NFDI4DS | UHH-SEMS - Publication Details

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Visual Language Visual reasoning

DOI: 10.48550/arxiv.2408.04810 Publication Date: 2024-08-08

Abstract Supplemental Material References Cited by

AUTHORS (6)

Haider Al-Tahan

Quentin Garrido

Randall Balestriero

Diane Bouchacourt

Caner Hazırbaş

Mark Ibrahim

ABSTRACT

Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked the heavy burden implementing each protocol, bearing a non-trivial computational cost, making sense how all these benchmarks translate into meaningful axes progress. To facilitate systematic evaluation VLM progress, we introduce UniBench: unified implementation 50+ spanning comprehensive range carefully categorized capabilities from object recognition spatial awareness, counting, much more. We showcase utility UniBench for measuring progress by evaluating nearly 60 publicly available models, trained on scales up 12.8B samples. find that while scaling data or size can boost many capabilities, offers little benefit reasoning relations. Surprisingly, also discover today's best VLMs struggle simple digit counting tasks, e.g. MNIST, which simpler networks solve. Where falls short, more precise interventions, such as quality tailored-learning objectives offer promise. For practitioners, guidance selecting suitable given application. Finally, release easy-to-run code-base full set comparisons across 59 models well distilled, representative runs in 5 minutes single GPU.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....