WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Feature (linguistics)
DOI:
10.48550/arxiv.2103.06561
Publication Date:
2021-01-01
AUTHORS (35)
ABSTRACT
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation text image modalities. Since this assumption is often invalid real-world scenarios, we choose implicitly for large-scale multi-modal pre-training, which focus Chinese project `WenLan' led our team. Specifically, with weak over propose a two-tower called BriVL within contrastive learning framework. Unlike OpenAI CLIP adopts simple method, devise more advanced algorithm adapting latest method MoCo into scenario. By building large queue-based dictionary, can incorporate negative samples limited GPU resources. We further construct multi-source dataset RUC-CAS-WenLan model. Extensive experiments demonstrate pre-trained outperforms both UNITER on various downstream tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....