InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
DOI:
10.48550/arxiv.2401.16420
Publication Date:
2024-01-29
AUTHORS (23)
ABSTRACT
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This goes beyond conventional understanding, adeptly crafting interleaved content from diverse inputs like outlines, detailed textual specifications, reference images, enabling highly customizable creation. InternLM-XComposer2 proposes Partial LoRA (PLoRA) approach that applies additional parameters exclusively to image tokens preserve the integrity of pre-trained language knowledge, striking balance between precise vision understanding text with literary talent. Experimental results demonstrate superiority based on InternLM2-7B producing high-quality long-text multi-modal its exceptional performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V Gemini Pro certain assessments. highlights remarkable proficiency realm understanding. The series 7B are publicly available at https://github.com/InternLM/InternLM-XComposer.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....