CAE v2: Context Autoencoder with CLIP Target
Autoencoder
Representation
DOI:
10.48550/arxiv.2211.09799
Publication Date:
2022-01-01
AUTHORS (13)
ABSTRACT
Masked image modeling (MIM) learns visual representation by masking and reconstructing patches. Applying the reconstruction supervision on CLIP has been proven effective for MIM. However, it is still under-explored how in MIM influences performance. To investigate strategies refining CLIP-targeted MIM, we study two critical elements i.e., position mask ratio, reveal interesting perspectives, relying our developed simple pipeline, context autodecoder with target (CAE v2). Firstly, observe that visible patches achieves remarkable performance, even better than masked patches, where latter standard format existing methods. Secondly, optimal ratio positively correlates to model size. That say, smaller model, lower needs be. Driven these discoveries, concise approach CAE v2 superior performance a series of downstream tasks. For example, vanilla ViT-Large 81.7% 86.7% top-1 accuracy linear probing fine-tuning ImageNet-1K, 55.9% mIoU semantic segmentation ADE20K pre-training 300 epochs. We hope findings can be helpful guidelines area, especially small-scale models.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....