- Topic Modeling
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Natural Language Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Orbital Angular Momentum in Optics
- Advanced Graph Neural Networks
- Text Readability and Simplification
- Visual Attention and Saliency Detection
- Lung Cancer Diagnosis and Treatment
- Sperm and Testicular Function
- Traffic Prediction and Management Techniques
- Microfluidic and Bio-sensing Technologies
- Mental Health via Writing
- Ferroelectric and Negative Capacitance Devices
- COVID-19 diagnosis using AI
- Solar Radiation and Photovoltaics
- CCD and CMOS Imaging Sensors
- Advanced Memory and Neural Computing
- Recommender Systems and Techniques
- Human Pose and Action Recognition
- Opinion Dynamics and Social Influence
- Human Mobility and Location-Based Analysis
- Power Systems and Renewable Energy
National University of Singapore
2006-2024
China Telecom
2022-2024
Ocean University of China
2024
China Telecom (China)
2022-2024
Chongqing University of Posts and Telecommunications
2024
Shanghai Jiao Tong University
2022-2023
North China Electric Power University
2019-2023
Sun Yat-sen University
2018-2020
Chinese Academy of Sciences
2013-2016
University of Science and Technology of China
2016
Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to competence. However, recent works show the transformers can be replaced by spatial MLPs and resulted models still perform quite well. Based on this observation, we hypothesize that general architecture of transformers, instead specific module, more essential model's performance. To verify this, deliberately replace attention with an embarrassingly...
To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...
Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, effectively learns comprehensive features with both high- low-frequency information visual data. Specifically, design an mixer to explicitly graft the advantages convolution max-pooling high-frequency...
MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, by migrating our focus away from token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and demonstrate their gratifying We summarize observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed...
Social relationships (e.g., friends, couple etc.) form the basis of social network in our daily life. Automatically interpreting such bears a great potential for intelligent systems to understand human behavior depth and better interact with people at level. Human beings interpret within group not only based on alone, interplay between contextual information around also plays significant role. However, these additional cues are largely overlooked by previous studies. We found that two...
Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging push development this field towards comprehensive reasoning text. In paper, we a new Reading Comprehension dataset requiring logical (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated usually contain biases, which are often exploited by achieve high...
Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural understanding tasks. However, heavily relies on the global self-attention block thus suffers large memory footprint computation cost. Although all attention heads query whole input sequence for generating map from a perspective, we observe some only need to learn local dependencies, which means existence of redundancy. We therefore propose novel span-based dynamic...
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written the blackboard, reasoning about events and celebrities in news images, explaining visual jokes. Rapid model advancements pose challenges to development. Problems include: (1) How systematically structure evaluate tasks; (2) design metrics work well across question answer types; (3) give...
Propagation of Airy-Gaussian vortex (AiGV) beams through the gradient-index medium is investigated analytically and numerically with transfer matrix method. Deriving analytic expression AiGV based on Huygens diffraction integral formula, we obtain propagate path, intensity phase distributions, Poynting vector first- second-order beams, which paraxial ABCD system. The ballistic trajectory no longer conventional parabolic but trigonometric shapes in medium. Especially, represent singular...
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing complex architectures or training methods address the data-efficiency issue ViTs. However, few them explore improving self-attention mechanism, a key factor distinguishing ViTs from Different existing works, we introduce conceptually simple scheme, called refiner,...
Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...
MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, without focusing on token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and summarize our observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed IdentityFormer, achieves >80% accuracy ImageNet-1K....
Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge receptive field improve model performance, like remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such operator only consumes a few FLOPs, it largely harms efficiency on powerful computing devices due high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but achieves 60% throughputs when trained A100 GPUs full...
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing incidence colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality images poses a considerable challenge development ADS. Despite recent advancements generating synthetic for dataset expansion, existing image generation algorithms failed accurately generate details polyp boundary regions...
In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require or features, e.g.~fine- coarse-grained one their mixture. this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn visual features. Mugs has three complementary granular supervisions: 1) instance discrimination supervision...
The propagation dynamics of the Airy Gaussian vortex beams in uniaxial crystals orthogonal to optical axis has been investigated analytically and numerically. expression obtained. features are shown with changes distribution factor ratio extraordinary refractive index ordinary index. correlations between maximum intensity value during propagation, its appearing distance have investigated.
There is high demand for a tailorable three-dimensional (3D) distribution of focused laser beams simultaneous optical manipulation multiple particles separately distributed in 3D space. In this letter, accurate control the beam foci demonstrated with an array customized fractal zone plates (FZPs). The FZPs are designed fractional number segments, so focal lengths can be finely tailored. unique focusing properties investigated both simulations and experiments. FZP also found to possess...
The propagation of right-hand circularly polarized Airy–Gaussian beams (RHCPAiGBs) through slabs right-handed materials (RHMs) and left-handed (LHMs) is investigated analytically numerically with the transfer matrix method. An approximate analytical expression for RHCPAiGBs passing a paraxial ABCD optical system derived on basis Huygens diffraction integral formula. intensity phase distributions RHMs LHMs are demonstrated. influence parameter χ0 RHM LHM investigated. possess...
To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...