Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders
Expediting
DOI:
10.1609/aaai.v38i3.28052
Publication Date:
2024-03-25T09:20:55Z
AUTHORS (4)
ABSTRACT
Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from web. However, uncurated dataset contains weakly correlated image-text pairs, causing inefficiency. To address issue, knowledge distillation explored at expense of extra image and text momentum encoders generate teaching signals for misaligned pairs. In this paper, our goal is resolve misalignment problem with an efficient framework. end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining Self-distilled Encoders. ECLIPSE features a distinctive architecture wherein shared encoder utilized between online encoder. This strategic design choice enables operate within unified projected space embedding, resulting better performance. Based on embedding space, compensates additional computational cost by expediting Through extensive experiments, validate that there sweet spot expedition where partial view expedited interacts complementarily teacher. As result, outperforms its counterparts while achieving substantial acceleration inference speed.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (1)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....