Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Paradigm shift
DOI: 10.48550/arxiv.2502.01524 Publication Date: 2025-02-03
ABSTRACT
The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent Large Language Models (LLMs), there notable shift towards incorporating LLMs vision modalities. Following this, training paradigms for into have evolved. Initially, approach was to integrate through pretraining modality integrator, named Single-stage Tuning. It since branched out methods focusing performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred Direct Adaptation. existing surveys primarily address latest Vision (VLLMs) leaving gap understanding evolution their unique parameter-efficient considerations. This paper categorizes reviews 34 VLLMs from top conferences, journals, highly cited Arxiv papers, efficiency during adaptation paradigm perspective. We first introduce architecture learning methods, followed by discussion encoders comprehensive taxonomy integrators. then review three considerations, summarizing benchmarks VLLM field. To gain deeper insights effectiveness we compare discuss experimental results representative models, among which experiment Adaptation is replicated. Providing recent developments practical uses, this survey vital guide researchers practitioners navigating efficient LLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....