VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Foundation (evidence)
DOI:
10.48550/arxiv.2408.06327
Publication Date:
2024-08-12
AUTHORS (30)
ABSTRACT
Large Multimodal Models (LMMs) have ushered in a new era artificial intelligence, merging capabilities both language and vision to form highly capable Visual Foundation Agents. These agents are postulated excel across myriad of tasks, potentially approaching general intelligence. However, existing benchmarks fail sufficiently challenge or showcase the full potential LMMs complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), comprehensive pioneering benchmark specifically designed train evaluate as visual foundation diverse scenarios, including Embodied, Graphical User Interface, Design, with tasks formulated probe depth LMMs' understanding interaction capabilities. Through rigorous testing nine proprietary LMM APIs eight open models, demonstrate considerable yet still developing agent these models. Additionally, VAB constructs trajectory training set constructed through hybrid methods Program-based Solvers, Agent Bootstrapping, Human Demonstrations, promoting substantial performance improvements behavior cloning. Our work not only aims models but also provides solid for future development into agents. Code, \& test data, part fine-tuned available at \url{https://github.com/THUDM/VisualAgentBench}.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....