From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
DOI:
10.48550/arxiv.2502.09093
Publication Date:
2025-02-13
AUTHORS (8)
ABSTRACT
While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following visual encoder, approach supervises image hidden states and integrates tokens into training. Existing primarily focused recovering information textual inputs, often neglecting effective processing of data. In contrast, key improvement work is reinterpretation alignment as process input data, with particular emphasis reconstructing detailed features.The proposed method seamlessly standard models without architectural changes. Experiments 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....