NFDI4DS | UHH-SEMS - Publication Details

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Modality (human–computer interaction)

DOI: 10.48550/arxiv.2403.18814 Publication Date: 2024-03-27

Abstract Supplemental Material References Cited by

AUTHORS (8)

Yanwei Li

Yuechen Zhang

Chengyao Wang

Zhisheng Zhong

Yixin Chen

Ruihang Chu

Shaoteng Liu

Jiaya Jia

ABSTRACT

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog reasoning, performance gap persists compared to advanced models like GPT-4 Gemini. We try narrow by mining potential of for better any-to-any workflow from three aspects, i.e., high-resolution tokens, high-quality data, VLM-guided generation. To enhance propose utilize an additional encoder refinement without increasing token count. further construct dataset that promotes precise image comprehension reasoning-based generation, expanding operational scope current VLMs. general, Mini-Gemini mines empowers frameworks with understanding, generation simultaneously. supports series dense MoE Large (LLMs) 2B 34B. It is demonstrated achieve leading several zero-shot benchmarks even surpasses developed private models. Code are available at https://github.com/dvlab-research/MiniGemini.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....