NFDI4DS | UHH-SEMS - Publication Details

$\boldsymbol{M^2}$-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

DOI: 10.48550/arxiv.2401.15896 Publication Date: 2024-01-29

Abstract Supplemental Material References Cited by

AUTHORS (9)

Qingpei Guo

F. R. Xu

Hanxiao Zhang

Ren Wang

Ziping Ma

Lin Ju

Jian Wang

Jingdong Chen

Ming Yang

ABSTRACT

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM supporting multi-language, e.g., in both Chinese and English, lagged due to relative scarcity large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal well understand images languages. To handle such scale dataset, propose novel grouped aggregation approach for contrastive loss computation, which reduces communication overhead GPU memory demands significantly, facilitating 60% increase training speed. We pretrain series an enhanced fine-grained understanding ability on BM-6B, resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks languages retrieval classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies 88.5% ImageNet 80.7% ImageNet-CN under zero-shot setting, surpassing previously reported SoTA methods by 2.2% 21.1%, respectively. The $M^2$-Encoder represents one most date, so are making it available research community further exploration development.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

$\boldsymbol{M^2}$-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....