NFDI4DS | UHH-SEMS - Publication Details

Towards Text-Image Interleaved Retrieval

FOS: Computer and information sciences Computer Science - Computation and Language Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computation and Language (cs.CL) Information Retrieval (cs.IR) Computer Science - Information Retrieval

DOI: 10.48550/arxiv.2502.12799 Publication Date: 2025-02-18

Abstract Supplemental Material References Cited by

AUTHORS (10)

Xin Zhang

Ziqi Dai

Yongqi Li

Yanzhao Zhang

Dingkun Long

Pengjun Xie

Meishan Zhang

Jun Yu

Wenjie Li

Min Zhang

ABSTRACT

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the (TIIR) task, where query document are sequences, model is required to understand semantics from context for effective retrieval. We construct a TIIR benchmark based naturally wikiHow tutorials, specific pipeline designed generate queries. To explore adapt several off-the-shelf retrievers build dense baseline by large language (MLLM). then propose novel Matryoshka Multimodal Embedder (MME), compresses number of visual tokens at different granularity, address challenge excessive in MLLM-based models. Experiments demonstrate that simple adaption existing models does not consistently yield results. Our MME achieves significant improvements over substantially fewer tokens. provide extensive analysis will release dataset code facilitate future research.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Towards Text-Image Interleaved Retrieval

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....