From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2502.09093 Publication Date: 2025-02-13
ABSTRACT
While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following visual encoder, approach supervises image hidden states and integrates tokens into training. Existing primarily focused recovering information textual inputs, often neglecting effective processing of data. In contrast, key improvement work is reinterpretation alignment as process input data, with particular emphasis reconstructing detailed features.The proposed method seamlessly standard models without architectural changes. Experiments 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()