VIGC: Visual Instruction Generation and Correction

FOS: Computer and information sciences Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.1609/aaai.v38i6.28338 Publication Date: 2024-03-25T09:46:03Z
ABSTRACT
The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal (MLLMs). However, the scarcity high-quality instruction-tuning data for vision-language tasks remains a challenge. current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions detection bounding boxes, suffering from understanding details. A practical solution this problem would be utilize available instruction tasks. it's worth noting that currently accessible MLLMs are not powerful their LLM counterparts, they tend produce inadequate responses false information. As addressing issue, paper proposes Visual Instruction Generation Correction (VIGC) framework enables progressively enhance its quality on-the-fly. Specifically, (VIG) guides model diverse data. To ensure generation quality, (VIC) adopts an iterative update mechanism correct any inaccuracies produced by VIG, effectively reducing risk hallucination. Leveraging diverse, generated VIGC, we finetune mainstream validate based various evaluations. Experimental results demonstrate VIGC only compensates shortcomings methods, but also enhances benchmark performance. models, datasets, code at https://opendatalab.github.io/VIGC
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (9)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....