Towards Text-Image Interleaved Retrieval

FOS: Computer and information sciences Computer Science - Computation and Language Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computation and Language (cs.CL) Information Retrieval (cs.IR) Computer Science - Information Retrieval
DOI: 10.48550/arxiv.2502.12799 Publication Date: 2025-02-18
ABSTRACT
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the (TIIR) task, where query document are sequences, model is required to understand semantics from context for effective retrieval. We construct a TIIR benchmark based naturally wikiHow tutorials, specific pipeline designed generate queries. To explore adapt several off-the-shelf retrievers build dense baseline by large language (MLLM). then propose novel Matryoshka Multimodal Embedder (MME), compresses number of visual tokens at different granularity, address challenge excessive in MLLM-based models. Experiments demonstrate that simple adaption existing models does not consistently yield results. Our MME achieves significant improvements over substantially fewer tokens. provide extensive analysis will release dataset code facilitate future research.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....