ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2406.08164 Publication Date: 2024-06-12
ABSTRACT
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts crucial question: VLMs effectively tackled CR challenge? We conjecture that existing benchmarks may not adequately push boundaries modern due to reliance on an LLM-only negative text generation pipeline. Consequently, negatives produced either appear as outliers from natural language distribution learned by VLMs' LLM decoders or improbable within corresponding image context. To address these limitations, we introduce ConMe -- compositional benchmark novel data pipeline leveraging produce `hard Q&A'. Through new concept conversing with each other collaboratively expose their weaknesses, our autonomously generates, evaluates, selects challenging questions, establishing robust benchmark, also subsequently validated manually. Our provokes noteworthy, up 33%, decrease performance compared preceding benchmarks, reinstating challenge even for state-of-the-art VLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....