NFDI4DS | UHH-SEMS - Publication Details

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Feature (linguistics)

DOI: 10.48550/arxiv.2103.06561 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (35)

Yuqi Huo

Manli Zhang

Guangzhen Liu

Haoyu Lu

Yizhao Gao

Guoxing Yang

Jingyuan Wen

Heng Zhang

Baogui Xu

Weihao Zheng

Zongzheng Xi

Yueqian Yang

Anwen Hu

Jinming Zhao

Ruichen Li

Yida Zhao

Liang Zhang

Yuqing Song

Xin Hong

Wanqing Cui

Dan Hou

Yingyan Li

Junyi Li

Peiyu Liu

Zheng Gong

Chuhao Jin

Yuchong Sun

Shizhe Chen

Zhiwu Lu

Zhicheng Dou

Jin Qin

Yanyan Lan

Wayne Xin Zhao

Ruihua Song

Ji-Rong Wen

ABSTRACT

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation text image modalities. Since this assumption is often invalid real-world scenarios, we choose implicitly for large-scale multi-modal pre-training, which focus Chinese project `WenLan' led our team. Specifically, with weak over propose a two-tower called BriVL within contrastive learning framework. Unlike OpenAI CLIP adopts simple method, devise more advanced algorithm adapting latest method MoCo into scenario. By building large queue-based dictionary, can incorporate negative samples limited GPU resources. We further construct multi-source dataset RUC-CAS-WenLan model. Extensive experiments demonstrate pre-trained outperforms both UNITER on various downstream tasks.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....