X2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks
Leverage (statistics)
DOI:
10.1109/tpami.2023.3339661
Publication Date:
2023-12-13T19:29:11Z
AUTHORS (6)
ABSTRACT
Vision language pre-training aims to learn alignments between vision and from a large amount of data. Most existing methods only image-text alignments. Some others utilize pre-trained object detectors leverage at the level. In this paper, we propose multi-grained by unified framework that learns aligning localization simultaneously. Based on it, present X2-VLM, an all-in-one model with flexible modular architecture, in which further unify video-text one model. X2-VLM is able unlimited visual concepts associated diverse text descriptions. Experiment results show performs best base scale for both tasks, making good trade-off performance scale. Moreover, design high transferability it be utilized any or domain. For example, simply replacing encoder XLM-R, outperforms state-of-the-art multilingual multi-modal models without pre-training. The code are available <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/zengyan-97/X2-VLM</uri> .
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (81)
CITATIONS (12)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....