Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
0202 electrical engineering, electronic engineering, information engineering
02 engineering and technology
DOI:
10.1609/aaai.v36i3.20202
Publication Date:
2022-07-04T09:12:34Z
AUTHORS (9)
ABSTRACT
Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since computation complexity of ViT quadratic with respect to input sequence length, mainstream paradigm for reduction reduce number tokens. Existing designs include structured spatial compression that uses progressive shrinking pyramid computations large feature maps, and unstructured token pruning dynamically drops redundant However, limitation existing lies in two folds: 1) incomplete structure caused by not compatible commonly used modern deep-narrow transformers; 2) it usually requires time-consuming pre-training procedure. To tackle limitations expand applicable scenario pruning, we present Evo-ViT, self-motivated slow-fast evolution approach vision transformers. Specifically, conduct instance-wise selection taking advantage simple effective global class attention native Then, propose update selected informative tokens uninformative different paths, namely, updating. updating mechanism maintains information flow, Evo-ViT can accelerate vanilla both flat structures from very beginning training process. Experimental results demonstrate our method significantly reduces while maintaining comparable performance on image classification. For example, accelerates DeiT-S over 60% throughput only sacrificing 0.4% top-1 accuracy ImageNet-1K, outperforming current methods efficiency.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (96)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....