NFDI4DS | UHH-SEMS - Publication Details

CAE v2: Context Autoencoder with CLIP Target

Autoencoder Representation

DOI: 10.48550/arxiv.2211.09799 Publication Date: 2022-01-01

Abstract Supplemental Material References Cited by

AUTHORS (13)

Xinyu Zhang

Jiahui Chen

Junkun Yuan

Qiang Chen

Jian Wang

Xiaodi Wang

Shumin Han

Xiaokang Chen

Jimin Pi

Kun Yao

Junyu Han

Errui Ding

Jingdong Wang

ABSTRACT

Masked image modeling (MIM) learns visual representation by masking and reconstructing patches. Applying the reconstruction supervision on CLIP has been proven effective for MIM. However, it is still under-explored how in MIM influences performance. To investigate strategies refining CLIP-targeted MIM, we study two critical elements i.e., position mask ratio, reveal interesting perspectives, relying our developed simple pipeline, context autodecoder with target (CAE v2). Firstly, observe that visible patches achieves remarkable performance, even better than masked patches, where latter standard format existing methods. Secondly, optimal ratio positively correlates to model size. That say, smaller model, lower needs be. Driven these discoveries, concise approach CAE v2 superior performance a series of downstream tasks. For example, vanilla ViT-Large 81.7% 86.7% top-1 accuracy linear probing fine-tuning ImageNet-1K, 55.9% mIoU semantic segmentation ADE20K pre-training 300 epochs. We hope findings can be helpful guidelines area, especially small-scale models.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

CAE v2: Context Autoencoder with CLIP Target

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....