Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Multimedia
Multimedia (cs.MM)
DOI:
10.1609/aaai.v38i16.29789
Publication Date:
2024-03-25T11:53:29Z
AUTHORS (4)
ABSTRACT
Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and intra-modal semantic loss problem. These problems can significantly affect accuracy of retrieval. To address these challenges, we propose a novel method called Cross-modal Uni-modal Soft-label Alignment (CUSA). Our leverages power uni-modal pre-trained models to provide soft-label supervision signals for model. Additionally, introduce alignment techniques, (CSA) (USA), overcome false negatives enhance similarity recognition between samples. is designed be plug-and-play, meaning it easily applied existing without changing their original architectures. Extensive experiments on various datasets, demonstrate that our consistently improve achieve new state-of-the-art results. Furthermore, also boost models, enabling universal The code supplementary files found at https://github.com/lerogo/aaai24_itr_cusa.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (9)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....