NFDI4DS | UHH-SEMS - Publication Details

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

FOS: Computer and information sciences Computer Science - Machine Learning Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2111.12710 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (9)

Xiaoyi Dong

Jianmin Bao

Ting Zhang

Dongdong Chen

Weiming Zhang

Lu Yuan

Dong Chen

Fang Wen

Nenghai Yu

ABSTRACT

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current targets disagree with human perception judgment.This contradiction motivates us to learn perceptual target. argue perceptually similar images should stay close each other in the space. surprisingly find one simple yet effective idea: enforcing similarity during dVAE training. Moreover, we adopt self-supervised transformer model deep feature extraction and show it works well calculating similarity.We demonstrate such learned visual tokens indeed exhibit semantic meanings, help achieve superior transfer performance various downstream tasks. For example, $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K ViT-B backbone, outperforming competitive method BEiT by $\textbf{+1.3\%}$ under same epochs. Our approach also gets significant improvement object detection segmentation COCO ADE20K. Equipped larger backbone ViT-H, state-of-the-art ImageNet (\textbf{88.3\%}) among methods using only data.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....