NFDI4DS | UHH-SEMS - Publication Details

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition

DOI: 10.48550/arxiv.2502.09093 Publication Date: 2025-02-13

Abstract Supplemental Material References Cited by

AUTHORS (8)

Mingxiao Li

Fang Qu

Zhanpeng Chen

Na Su

Z. Zhong

Ziyang Chen

Nan Du

Xiaolong Li

ABSTRACT

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following visual encoder, approach supervises image hidden states and integrates tokens into training. Existing primarily focused recovering information textual inputs, often neglecting effective processing of data. In contrast, key improvement work is reinterpretation alignment as process input data, with particular emphasis reconstructing detailed features.The proposed method seamlessly standard models without architectural changes. Experiments 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....