Efficient Multimodal Learning from Data-centric Perspective

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2402.11530 Publication Date: 2024-02-18
ABSTRACT
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs both training inference, limiting accessibility to the broader research user communities. A straightforward solution leverage smaller pre-trained vision language models, which inevitably causes significant performance drop. In this paper, we demonstrate possibility beat scaling law train a but better MLLM exploring more informative data. Specifically, introduce Bunny, family of lightweight MLLMs with flexible backbones for efficient multimodal learning from condensed Remarkably, our Bunny-3B outperforms state-of-the-art large MLLMs, especially LLaVA-v1.5-13B, on multiple benchmarks. The code, models data can be found https://github.com/BAAI-DCAI/Bunny.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....