NFDI4DS | UHH-SEMS - Publication Details

Efficient Multimodal Learning from Data-centric Perspective

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition

DOI: 10.48550/arxiv.2402.11530 Publication Date: 2024-02-18

Abstract Supplemental Material References Cited by

AUTHORS (7)

Muyang He

Yexin Liu

Boya Wu

Jianhao Yuan

Yueze Wang

Tiejun Huang

Bo Zhao

ABSTRACT

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs both training inference, limiting accessibility to the broader research user communities. A straightforward solution leverage smaller pre-trained vision language models, which inevitably causes significant performance drop. In this paper, we demonstrate possibility beat scaling law train a but better MLLM exploring more informative data. Specifically, introduce Bunny, family of lightweight MLLMs with flexible backbones for efficient multimodal learning from condensed Remarkably, our Bunny-3B outperforms state-of-the-art large MLLMs, especially LLaVA-v1.5-13B, on multiple benchmarks. The code, models data can be found https://github.com/BAAI-DCAI/Bunny.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Efficient Multimodal Learning from Data-centric Perspective

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....