MAFormer: A transformer network with multi-scale attention fusion for visual recognition

FOS: Computer and information sciences 03 medical and health sciences 0302 clinical medicine Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.1016/j.neucom.2024.127828 Publication Date: 2024-05-10T15:27:35Z
ABSTRACT
Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection and 1.4$\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.<br/>9 pages, 2 figures<br/>
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (73)
CITATIONS (4)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....