Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Modality (human–computer interaction)
DOI:
10.48550/arxiv.2403.18814
Publication Date:
2024-03-27
AUTHORS (8)
ABSTRACT
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog reasoning, performance gap persists compared to advanced models like GPT-4 Gemini. We try narrow by mining potential of for better any-to-any workflow from three aspects, i.e., high-resolution tokens, high-quality data, VLM-guided generation. To enhance propose utilize an additional encoder refinement without increasing token count. further construct dataset that promotes precise image comprehension reasoning-based generation, expanding operational scope current VLMs. general, Mini-Gemini mines empowers frameworks with understanding, generation simultaneously. supports series dense MoE Large (LLMs) 2B 34B. It is demonstrated achieve leading several zero-shot benchmarks even surpasses developed private models. Code are available at https://github.com/dvlab-research/MiniGemini.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....