Rethinking and Improving Relative Position Encoding for Vision Transformer
Hyperparameter
Position (finance)
Relative value
DOI:
10.48550/arxiv.2107.14222
Publication Date:
2021-01-01
AUTHORS (5)
ABSTRACT
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, computer vision, its not well studied and even remains controversial, e.g., whether relative can work equally as absolute position? In order clarify this, we first review existing methods analyze their pros cons when applied vision transformers. We then propose new dedicated 2D images, called image RPE (iRPE). Our consider directional distance modeling the interactions between queries embeddings self-attention mechanism. The proposed iRPE are simple lightweight. They be easily plugged into blocks. Experiments demonstrate that solely due methods, DeiT DETR obtain up 1.5% (top-1 Acc) 1.3% (mAP) stable improvements over original versions on ImageNet COCO respectively, without tuning any extra hyperparameters such learning rate weight decay. ablation analysis also yield interesting findings, some which run counter previous understanding. Code models open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....