UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
Visual Language
Visual reasoning
DOI:
10.48550/arxiv.2408.04810
Publication Date:
2024-08-08
AUTHORS (6)
ABSTRACT
Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked the heavy burden implementing each protocol, bearing a non-trivial computational cost, making sense how all these benchmarks translate into meaningful axes progress. To facilitate systematic evaluation VLM progress, we introduce UniBench: unified implementation 50+ spanning comprehensive range carefully categorized capabilities from object recognition spatial awareness, counting, much more. We showcase utility UniBench for measuring progress by evaluating nearly 60 publicly available models, trained on scales up 12.8B samples. find that while scaling data or size can boost many capabilities, offers little benefit reasoning relations. Surprisingly, also discover today's best VLMs struggle simple digit counting tasks, e.g. MNIST, which simpler networks solve. Where falls short, more precise interventions, such as quality tailored-learning objectives offer promise. For practitioners, guidance selecting suitable given application. Finally, release easy-to-run code-base full set comparisons across 59 models well distilled, representative runs in 5 minutes single GPU.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....