Weihao Yu

ORCID: 0000-0003-3349-5890
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Orbital Angular Momentum in Optics
  • Advanced Graph Neural Networks
  • Text Readability and Simplification
  • Visual Attention and Saliency Detection
  • Lung Cancer Diagnosis and Treatment
  • Sperm and Testicular Function
  • Traffic Prediction and Management Techniques
  • Microfluidic and Bio-sensing Technologies
  • Mental Health via Writing
  • Ferroelectric and Negative Capacitance Devices
  • COVID-19 diagnosis using AI
  • Solar Radiation and Photovoltaics
  • CCD and CMOS Imaging Sensors
  • Advanced Memory and Neural Computing
  • Recommender Systems and Techniques
  • Human Pose and Action Recognition
  • Opinion Dynamics and Social Influence
  • Human Mobility and Location-Based Analysis
  • Power Systems and Renewable Energy

National University of Singapore
2006-2024

China Telecom
2022-2024

Ocean University of China
2024

China Telecom (China)
2022-2024

Chongqing University of Posts and Telecommunications
2024

Shanghai Jiao Tong University
2022-2023

North China Electric Power University
2019-2023

Sun Yat-sen University
2018-2020

Chinese Academy of Sciences
2013-2016

University of Science and Technology of China
2016

Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...

10.1109/iccv48922.2021.00060 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to competence. However, recent works show the transformers can be replaced by spatial MLPs and resulted models still perform quite well. Based on this observation, we hypothesize that general architecture of transformers, instead specific module, more essential model's performance. To verify this, deliberately replace attention with an embarrassingly...

10.1109/cvpr52688.2022.01055 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...

10.1109/cvpr.2019.00632 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, effectively learns comprehensive features with both high- low-frequency information visual data. Specifically, design an mixer to explicitly graft the advantages convolution max-pooling high-frequency...

10.48550/arxiv.2205.12956 preprint EN other-oa arXiv (Cornell University) 2022-01-01

10.1109/cvpr52733.2024.00542 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, by migrating our focus away from token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and demonstrate their gratifying We summarize observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed...

10.1109/tpami.2023.3329173 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-11-01

Social relationships (e.g., friends, couple etc.) form the basis of social network in our daily life. Automatically interpreting such bears a great potential for intelligent systems to understand human behavior depth and better interact with people at level. Human beings interpret within group not only based on alone, interplay between contextual information around also plays significant role. However, these additional cues are largely overlooked by previous studies. We found that two...

10.24963/ijcai.2018/142 preprint EN 2018-07-01

Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging push development this field towards comprehensive reasoning text. In paper, we a new Reading Comprehension dataset requiring logical (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated usually contain biases, which are often exploited by achieve high...

10.48550/arxiv.2002.04326 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural understanding tasks. However, heavily relies on the global self-attention block thus suffers large memory footprint computation cost. Although all attention heads query whole input sequence for generating map from a perspective, we observe some only need to learn local dependencies, which means existence of redundancy. We therefore propose novel span-based dynamic...

10.48550/arxiv.2008.02496 preprint EN other-oa arXiv (Cornell University) 2020-01-01

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written the blackboard, reasoning about events and celebrities in news images, explaining visual jokes. Rapid model advancements pose challenges to development. Problems include: (1) How systematically structure evaluate tasks; (2) design metrics work well across question answer types; (3) give...

10.48550/arxiv.2308.02490 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Propagation of Airy-Gaussian vortex (AiGV) beams through the gradient-index medium is investigated analytically and numerically with transfer matrix method. Deriving analytic expression AiGV based on Huygens diffraction integral formula, we obtain propagate path, intensity phase distributions, Poynting vector first- second-order beams, which paraxial ABCD system. The ballistic trajectory no longer conventional parabolic but trigonometric shapes in medium. Especially, represent singular...

10.1364/josaa.33.001025 article EN Journal of the Optical Society of America A 2016-05-05

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing complex architectures or training methods address the data-efficiency issue ViTs. However, few them explore improving self-attention mechanism, a key factor distinguishing ViTs from Different existing works, we introduce conceptually simple scheme, called refiner,...

10.48550/arxiv.2106.03714 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...

10.48550/arxiv.2101.11986 preprint EN other-oa arXiv (Cornell University) 2021-01-01

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, without focusing on token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and summarize our observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed IdentityFormer, achieves >80% accuracy ImageNet-1K....

10.48550/arxiv.2210.13452 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge receptive field improve model performance, like remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such operator only consumes a few FLOPs, it largely harms efficiency on powerful computing devices due high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but achieves 60% throughputs when trained A100 GPUs full...

10.48550/arxiv.2303.16900 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing incidence colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality images poses a considerable challenge development ADS. Despite recent advancements generating synthetic for dataset expansion, existing image generation algorithms failed accurately generate details polyp boundary regions...

10.48550/arxiv.2501.16679 preprint EN arXiv (Cornell University) 2025-01-27

In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require or features, e.g.~fine- coarse-grained one their mixture. this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn visual features. Mugs has three complementary granular supervisions: 1) instance discrimination supervision...

10.48550/arxiv.2203.14415 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The propagation dynamics of the Airy Gaussian vortex beams in uniaxial crystals orthogonal to optical axis has been investigated analytically and numerically. expression obtained. features are shown with changes distribution factor ratio extraordinary refractive index ordinary index. correlations between maximum intensity value during propagation, its appearing distance have investigated.

10.1088/1674-1056/25/4/044201 article EN Chinese Physics B 2016-04-01

There is high demand for a tailorable three-dimensional (3D) distribution of focused laser beams simultaneous optical manipulation multiple particles separately distributed in 3D space. In this letter, accurate control the beam foci demonstrated with an array customized fractal zone plates (FZPs). The FZPs are designed fractional number segments, so focal lengths can be finely tailored. unique focusing properties investigated both simulations and experiments. FZP also found to possess...

10.1088/1612-2011/10/3/035003 article EN Laser Physics Letters 2013-02-05

The propagation of right-hand circularly polarized Airy–Gaussian beams (RHCPAiGBs) through slabs right-handed materials (RHMs) and left-handed (LHMs) is investigated analytically numerically with the transfer matrix method. An approximate analytical expression for RHCPAiGBs passing a paraxial ABCD optical system derived on basis Huygens diffraction integral formula. intensity phase distributions RHMs LHMs are demonstrated. influence parameter χ0 RHM LHM investigated. possess...

10.1364/josaa.32.002104 article EN Journal of the Optical Society of America A 2015-10-16

To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...

10.48550/arxiv.1903.03326 preprint EN other-oa arXiv (Cornell University) 2019-01-01
Coming Soon ...