Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Feature Learning
DOI:
10.1007/s11633-022-1386-4
Publication Date:
2023-05-02T23:02:11Z
AUTHORS (5)
ABSTRACT
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed previous works. 1) Generalizability: Existing methods often assume strong semantic correlation between each text-image pair, which thus difficult to generalize real-world scenarios where the weak dominates. 2) Efficiency: Many latest works adopt single-tower architecture with heavy detectors, inefficient during inference stage because costly computation needs be repeated for pair. In this work, overcome these challenges, we propose two-tower cross-modal contrastive learning (CMCL) framework. Specifically, first devise architecture, enables unified feature space text image modalities directly compared other, alleviating inference. We further introduce simple yet effective module named multi-grid split (MGS) learn fine-grained features without using detectors. Last but least, deploy loss on global image/text their achieve high generalizability. To validate our CMCL can readily generalized scenarios, construct large multi-source dataset called (WSCD). Extensive experiments show outperforms state-of-the-arts while being much more efficient.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (55)
CITATIONS (9)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....