Visual In-Context Learning for Large Vision-Language Models
DOI:
10.48550/arxiv.2402.11574
Publication Date:
2024-02-18
AUTHORS (4)
ABSTRACT
In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel (VICL) method comprising Demonstration Retrieval, Intent-Oriented Image Summarization, Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises with task intent task-specific visual parsing, composes language-based demonstrations that reduce token count alleviate interaction problem. Experimental evaluations on five reasoning datasets demonstrate effectiveness our method. Moreover, extensive experiments leverage information flow analysis to elucidate method, investigate impact length position for LVLM. The use in-context unlearning further shows promise resetting specific model knowledge without retraining.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....