A Fusion Encoder with Multi-Task Guided for Cross-Modal Text-Image Retrieval in Remote Sensing

Feature (linguistics)
DOI: 10.20944/preprints202306.2010.v1 Publication Date: 2023-06-30T05:48:56Z
ABSTRACT
In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and significant increase image data volume. One approach that shown promising results natural images is multimodal fusion encoding method. However, have unique characteristics make task challenging. Firstly, semantic features are fine-grained, meaning they can be divided into multiple basic units expression. Additionally, these exhibit variations resolution, color, perspective. Different combinations expression generate diverse text descriptions. These pose considerable challenges for retrieval. To address challenges, this paper proposes multi-task guided encoder (MTGFE) based on The model incorporates three tasks: matching (ITM), masked language modeling (MLM), newly introduced multi-view joint representations contrast (MVJRC) task. By jointly training with tasks, we aim enhance its capability capture fine-grained correlations between texts. Specifically, MVJRC designed improve model’s consistency feature correlation, particularly differences angle. Furthermore, computational complexity associated large-scale models efficiency, filtering This method achieves higher efficiency while minimizing accuracy loss. Extensive experiments were conducted four public datasets evaluate proposed method, validate effectiveness. Overall, study focuses introduces MTGFE model, which combines tasks ability correlations. efficiency. Experimental demonstrate effectiveness
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (3)