Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia Multimedia (cs.MM)
DOI: 10.1609/aaai.v38i16.29789 Publication Date: 2024-03-25T11:53:29Z
ABSTRACT
Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and intra-modal semantic loss problem. These problems can significantly affect accuracy of retrieval. To address these challenges, we propose a novel method called Cross-modal Uni-modal Soft-label Alignment (CUSA). Our leverages power uni-modal pre-trained models to provide soft-label supervision signals for model. Additionally, introduce alignment techniques, (CSA) (USA), overcome false negatives enhance similarity recognition between samples. is designed be plug-and-play, meaning it easily applied existing without changing their original architectures. Extensive experiments on various datasets, demonstrate that our consistently improve achieve new state-of-the-art results. Furthermore, also boost models, enabling universal The code supplementary files found at https://github.com/lerogo/aaai24_itr_cusa.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (9)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....