Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

ENCODE
DOI: 10.1109/tpami.2022.3145427 Publication Date: 2022-01-25T20:38:22Z
ABSTRACT
In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of positional information carried by 2D feature representations, unlike recent models that encode spatial along flattened dimensions, Permutator separately encodes representations height width dimensions with linear projections. This allows to capture long-range dependencies meanwhile avoid attention building process in transformers. The outputs are then aggregated mutually complementing manner form expressive representations. We show our Permutators formidable competitors convolutional neural networks (CNNs) vision Without dependence on convolutions or mechanisms, achieves 81.5% top-1 accuracy ImageNet without extra large-scale training (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs transformers under same model size constraint. When scaling up 88M, it attains 83.2% accuracy, greatly improving performance state-of-the-art hope work could encourage research rethinking way encoding facilitate development models. PyTorch/MindSpore/Jittor code available at https://github.com/Andrew-Qibin/VisionPermutator.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (50)
CITATIONS (127)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....