PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

FOS: Computer and information sciences Computer Science - Machine Learning Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2111.12710 Publication Date: 2021-01-01
ABSTRACT
This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current targets disagree with human perception judgment.This contradiction motivates us to learn perceptual target. argue perceptually similar images should stay close each other in the space. surprisingly find one simple yet effective idea: enforcing similarity during dVAE training. Moreover, we adopt self-supervised transformer model deep feature extraction and show it works well calculating similarity.We demonstrate such learned visual tokens indeed exhibit semantic meanings, help achieve superior transfer performance various downstream tasks. For example, $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K ViT-B backbone, outperforming competitive method BEiT by $\textbf{+1.3\%}$ under same epochs. Our approach also gets significant improvement object detection segmentation COCO ADE20K. Equipped larger backbone ViT-H, state-of-the-art ImageNet (\textbf{88.3\%}) among methods using only data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....