NFDI4DS | UHH-SEMS - Publication Details

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

DOI: 10.48550/arxiv.2401.16420 Publication Date: 2024-01-29

Abstract Supplemental Material References Cited by

AUTHORS (23)

Xiaoyi Dong

Pan Zhang

Yuhang Zang

Yuhang Cao

Bin Wang

Linke Ouyang

Xilin Wei

Songyang Zhang

Haodong Duan

Maosong Cao

Wenwei Zhang

Yining Li

Hang Yan

Yang Gao

Xinyue Zhang

Wei Li

Jingwen Li

Chaoyu Chen

Conghui He

Xingcheng Zhang

Yu Qiao

Dahua Lin

Jiaqi Wang

ABSTRACT

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This goes beyond conventional understanding, adeptly crafting interleaved content from diverse inputs like outlines, detailed textual specifications, reference images, enabling highly customizable creation. InternLM-XComposer2 proposes Partial LoRA (PLoRA) approach that applies additional parameters exclusively to image tokens preserve the integrity of pre-trained language knowledge, striking balance between precise vision understanding text with literary talent. Experimental results demonstrate superiority based on InternLM2-7B producing high-quality long-text multi-modal its exceptional performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V Gemini Pro certain assessments. highlights remarkable proficiency realm understanding. The series 7B are publicly available at https://github.com/InternLM/InternLM-XComposer.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....