Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology
DOI: 10.1145/3639469 Publication Date: 2024-01-09T15:11:36Z
ABSTRACT
In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR) , in which query is not a single text but composed query, i.e., reference image, and modification text. Compared with conventional image-text CQBIR more as it requires properly preserving modifying specific region according to multi-level semantic information learned from multi-modal query. Most recent works focus on extracting preserved modified compositing into unified representation. However, observe that regions by existing methods contain redundant information, inevitably degrading overall performance. To end, propose novel method termed C ross- M odal A ttention P reservation (CMAP) . Specifically, first leverage cross-level interaction fully account for multi-granular aims supplement high-level semantics effective retrieval. Furthermore, different contrastive learning, our introduces self-contrastive learning prevent model confusing attention part part. Extensive experiments three widely used datasets, FashionIQ, Shoes, Fashion200k, demonstrate proposed CMAP significantly outperforms current state-of-the-art all datasets. The anonymous implementation code of available at https://github.com/CFM-MSG/Code_CMAP.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (52)
CITATIONS (7)