NFDI4DS | UHH-SEMS - Publication Details

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

OPENALEX - Publications

Li Yuan Yunpeng Chen Tao Wang Weihao Yu Yujun Shi and 4 more

Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...

10.1109/iccv48922.2021.00060 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

MetaFormer is Actually What You Need for Vision

OPENALEX - Publications

Weihao Yu Mi Luo Pan Zhou Chenyang Si Yichen Zhou and 3 more

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to competence. However, recent works show the transformers can be replaced by spatial MLPs and resulted models still perform quite well. Based on this observation, we hypothesize that general architecture of transformers, instead specific module, more essential model's performance. To verify this, deliberately replace attention with an embarrassingly...

10.1109/cvpr52688.2022.01055 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Knowledge-Embedded Routing Network for Scene Graph Generation

OPENALEX - Publications

Tianshui Chen Weihao Yu Riquan Chen Liang Lin

To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...

10.1109/cvpr.2019.00632 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Inception Transformer

OPENALEX - Publications

Chenyang Si Weihao Yu Pan Zhou Yichen Zhou Xinchao Wang and 1 more

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, effectively learns comprehensive features with both high- low-frequency information visual data. Specifically, design an mixer to explicitly graft the advantages convolution max-pooling high-frequency...

10.48550/arxiv.2205.12956 preprint EN other-oa arXiv (Cornell University) 2022-01-01

InceptionNeXt: When Inception Meets ConvNeXt

OPENALEX - Publications

Weihao Yu Pan Zhou Shuicheng Yan Xinchao Wang

10.1109/cvpr52733.2024.00542 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

MetaFormer Baselines for Vision

OPENALEX - Publications

Weihao Yu Chenyang Si Pan Zhou Mi Luo Yichen Zhou and 3 more

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, by migrating our focus away from token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and demonstrate their gratifying We summarize observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed...

10.1109/tpami.2023.3329173 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-11-01

Deep Reasoning with Knowledge Graph for Social Relationship Understanding

OPENALEX - Publications

Zhouxia Wang Tianshui Chen Jimmy Ren Weihao Yu Hui Cheng and 1 more

Social relationships (e.g., friends, couple etc.) form the basis of social network in our daily life. Automatically interpreting such bears a great potential for intelligent systems to understand human behavior depth and better interact with people at level. Human beings interpret within group not only based on alone, interplay between contextual information around also plays significant role. However, these additional cues are largely overlooked by previous studies. We found that two...

10.24963/ijcai.2018/142 preprint EN 2018-07-01

ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning

OPENALEX - Publications

Weihao Yu Zihang Jiang Yanfei Dong Jiashi Feng

Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging push development this field towards comprehensive reasoning text. In paper, we a new Reading Comprehension dataset requiring logical (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated usually contain biases, which are often exploited by achieve high...

10.48550/arxiv.2002.04326 preprint EN other-oa arXiv (Cornell University) 2020-01-01

ConvBERT: Improving BERT with Span-based Dynamic Convolution

OPENALEX - Publications

Zihang Jiang Weihao Yu Daquan Zhou Yunpeng Chen Jiashi Feng and 1 more

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural understanding tasks. However, heavily relies on the global self-attention block thus suffers large memory footprint computation cost. Although all attention heads query whole input sequence for generating map from a perspective, we observe some only need to learn local dependencies, which means existence of redundancy. We therefore propose novel span-based dynamic...

10.48550/arxiv.2008.02496 preprint EN other-oa arXiv (Cornell University) 2020-01-01

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

OPENALEX - Publications

Weihao Yu Zhengyuan Yang Linjie Li Jianfeng Wang Kevin Lin and 3 more

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written the blackboard, reasoning about events and celebrities in news images, explaining visual jokes. Rapid model advancements pose challenges to development. Problems include: (1) How systematically structure evaluate tasks; (2) design metrics work well across question answer types; (3) give...

10.48550/arxiv.2308.02490 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Propagation properties of Airy–Gaussian vortex beams through the gradient-index medium

OPENALEX - Publications

Ruihuang Zhao Fu Deng Weihao Yu Jiayao Huang Dongmei Deng

Propagation of Airy-Gaussian vortex (AiGV) beams through the gradient-index medium is investigated analytically and numerically with transfer matrix method. Deriving analytic expression AiGV based on Huygens diffraction integral formula, we obtain propagate path, intensity phase distributions, Poynting vector first- second-order beams, which paraxial ABCD system. The ballistic trajectory no longer conventional parabolic but trigonometric shapes in medium. Especially, represent singular...

10.1364/josaa.33.001025 article EN Journal of the Optical Society of America A 2016-05-05

Refiner: Refining Self-attention for Vision Transformers

OPENALEX - Publications

Daquan Zhou Yujun Shi Bingyi Kang Weihao Yu Zihang Jiang and 4 more

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing complex architectures or training methods address the data-efficiency issue ViTs. However, few them explore improving self-attention mechanism, a key factor distinguishing ViTs from Different existing works, we introduce conceptually simple scheme, called refiner,...

10.48550/arxiv.2106.03714 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

OPENALEX - Publications

Li Yuan Yunpeng Chen Tao Wang Weihao Yu Yujun Shi and 3 more

Transformers, which are popular for language modeling, have been explored solving vision tasks recently, e.g., the Vision Transformer (ViT) image classification. The ViT model splits each into a sequence of tokens with fixed length and then applies multiple layers to their global relation However, achieves inferior performance CNNs when trained from scratch on midsize dataset like ImageNet. We find it is because: 1) simple tokenization input images fails important local structure such as...

10.48550/arxiv.2101.11986 preprint EN other-oa arXiv (Cornell University) 2021-01-01

MetaFormer Baselines for Vision

OPENALEX - Publications

Weihao Yu Chenyang Si Pan Zhou Mi Luo Yichen Zhou and 3 more

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore capacity again, without focusing on token mixer design: introduce several baseline models under MetaFormer using most basic or common mixers, and summarize our observations as follows: (1) ensures solid lower bound By merely adopting identity mapping mixer, model, termed IdentityFormer, achieves >80% accuracy ImageNet-1K....

10.48550/arxiv.2210.13452 preprint EN other-oa arXiv (Cornell University) 2022-01-01

InceptionNeXt: When Inception Meets ConvNeXt

OPENALEX - Publications

Weihao Yu Pan Zhou Shuicheng Yan Xinchao Wang

Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge receptive field improve model performance, like remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such operator only consumes a few FLOPs, it largely harms efficiency on powerful computing devices due high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but achieves 60% throughputs when trained A100 GPUs full...

10.48550/arxiv.2303.16900 preprint EN other-oa arXiv (Cornell University) 2023-01-01

TSA-Net: a temporal knowledge graph completion method with temporal-structural adaptation

OPENALEX - Publications

Ruzhong Xie Ke Ruan Bosong Huang Weihao Yu Jing Xiao and 1 more

10.1007/s10489-024-05734-1 article EN Applied Intelligence 2024-08-13

Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion

OPENALEX - Publications

Shengyuan Liu Zhen Chen Qiushi Yang Weihao Yu Di Dong and 2 more

Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing incidence colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality images poses a considerable challenge development ADS. Despite recent advancements generating synthetic for dataset expansion, existing image generation algorithms failed accurately generate details polyp boundary regions...

10.48550/arxiv.2501.16679 preprint EN arXiv (Cornell University) 2025-01-27

Cryopreservation of stallion spermatozoa using different cryoprotectants and combinations of cryoprotectants

OPENALEX - Publications

Zhuangyuan Wu Xinbiao Zheng Yongming Luo Fei Huo Hong Dong and 5 more

10.1016/j.anireprosci.2015.09.020 article EN Animal Reproduction Science 2015-10-08

Mugs: A Multi-Granular Self-Supervised Learning Framework

OPENALEX - Publications

Pan Zhou Yichen Zhou Chenyang Si Weihao Yu Teck Khim Ng and 1 more

In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require or features, e.g.~fine- coarse-grained one their mixture. this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn visual features. Mugs has three complementary granular supervisions: 1) instance discrimination supervision...

10.48550/arxiv.2203.14415 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Propagation of Airy Gaussian vortex beams in uniaxial crystals

OPENALEX - Publications

Weihao Yu Ruihuang Zhao Fu Deng Jiayao Huang Chidao Chen and 3 more

The propagation dynamics of the Airy Gaussian vortex beams in uniaxial crystals orthogonal to optical axis has been investigated analytically and numerically. expression obtained. features are shown with changes distribution factor ratio extraordinary refractive index ordinary index. correlations between maximum intensity value during propagation, its appearing distance have investigated.

10.1088/1674-1056/25/4/044201 article EN Chinese Physics B 2016-04-01

Tailorable three-dimensional distribution of laser foci based on customized fractal zone plates

OPENALEX - Publications

Shaohua Tao Bo Yang Hui Xia Weihao Yu

There is high demand for a tailorable three-dimensional (3D) distribution of focused laser beams simultaneous optical manipulation multiple particles separately distributed in 3D space. In this letter, accurate control the beam foci demonstrated with an array customized fractal zone plates (FZPs). The FZPs are designed fractional number segments, so focal lengths can be finely tailored. unique focusing properties investigated both simulations and experiments. FZP also found to possess...

10.1088/1612-2011/10/3/035003 article EN Laser Physics Letters 2013-02-05

Propagation properties of right-hand circularly polarized Airy–Gaussian beams through slabs of right-handed materials and left-handed materials

OPENALEX - Publications

Jiayao Huang Zijie Liang Fu Deng Weihao Yu Ruihuang Zhao and 3 more

The propagation of right-hand circularly polarized Airy–Gaussian beams (RHCPAiGBs) through slabs right-handed materials (RHMs) and left-handed (LHMs) is investigated analytically numerically with the transfer matrix method. An approximate analytical expression for RHCPAiGBs passing a paraxial ABCD optical system derived on basis Huygens diffraction integral formula. intensity phase distributions RHMs LHMs are demonstrated. influence parameter χ0 RHM LHM investigated. possess...

10.1364/josaa.32.002104 article EN Journal of the Optical Society of America A 2015-10-16

Knowledge-Embedded Routing Network for Scene Graph Generation

OPENALEX - Publications

Tianshui Chen Weihao Yu Riquan Chen Liang Lin

To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since distribution of real-world is seriously unbalanced, existing methods perform quite poorly for less frequent relationships. In this work, we find that statistical correlations between object pairs their can effectively regularize semantic space make prediction ambiguous, thus well address unbalanced issue. achieve...

10.48550/arxiv.1903.03326 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Propagation of Airy-Gaussian beams in a chiral medium

OPENALEX - Publications

Fu Deng Weihao Yu Jiayao Huang Ruihuang Zhao Jiong Lin and 1 more

10.1140/epjd/e2016-60677-8 article EN The European Physical Journal D 2016-04-01