$\boldsymbol{M^2}$-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
DOI:
10.48550/arxiv.2401.15896
Publication Date:
2024-01-29
AUTHORS (9)
ABSTRACT
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM supporting multi-language, e.g., in both Chinese and English, lagged due to relative scarcity large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal well understand images languages. To handle such scale dataset, propose novel grouped aggregation approach for contrastive loss computation, which reduces communication overhead GPU memory demands significantly, facilitating 60% increase training speed. We pretrain series an enhanced fine-grained understanding ability on BM-6B, resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks languages retrieval classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies 88.5% ImageNet 80.7% ImageNet-CN under zero-shot setting, surpassing previously reported SoTA methods by 2.2% 21.1%, respectively. The $M^2$-Encoder represents one most date, so are making it available research community further exploration development.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....