Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training
Commonsense reasoning
DOI:
10.1609/aaai.v34i07.6795
Publication Date:
2020-06-29T18:38:12Z
AUTHORS (5)
ABSTRACT
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample Conneau 2019) Unicoder (Huang et al. 2019), both visual linguistic contents are fed into multi-layer Transformer (Vaswani 2017) for the cross-modal pre-training, where three tasks employed, including Masked Language Modeling(MLM), Object Classification(MOC) Visual-linguistic Matching(VLM). The first two context-aware input tokens based on jointly. last task tries predict whether an image text describe each other. After pretraining large-scale image-caption pairs, we transfer Unicoder-VL caption-based image-text retrieval commonsense reasoning, with just one additional output layer. achieve state-of-the-art or comparable results show powerful ability pre-training.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (436)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....