NFDI4DS | UHH-SEMS - Publication Details

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Foundation (evidence)

DOI: 10.48550/arxiv.2408.06327 Publication Date: 2024-08-12

Abstract Supplemental Material References Cited by

AUTHORS (30)

Xiao Liu

Tianjie Zhang

裕二池谷

Iat Long Iong

Yifan Xu

Xixuan Song

Shudan Zhang

Hanyu Lai

Xinyi Liu

Hanlin Zhao

Jiadai Sun

Xinyue Yang

Yu Yang

Zehan Qi

Shuntian Yao

Xueqiao Sun

Siyi Cheng

Qinkai Zheng

Hao Yu

Hanchen Zhang

Wenyi Hong

Ming Ding

Lihang Pan

Xiaotao Gu

Aohan Zeng

Zhengxiao Du

Chan Hee Song

Yu Su

Yuxiao Dong

Jie Tang

ABSTRACT

Large Multimodal Models (LMMs) have ushered in a new era artificial intelligence, merging capabilities both language and vision to form highly capable Visual Foundation Agents. These agents are postulated excel across myriad of tasks, potentially approaching general intelligence. However, existing benchmarks fail sufficiently challenge or showcase the full potential LMMs complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), comprehensive pioneering benchmark specifically designed train evaluate as visual foundation diverse scenarios, including Embodied, Graphical User Interface, Design, with tasks formulated probe depth LMMs' understanding interaction capabilities. Through rigorous testing nine proprietary LMM APIs eight open models, demonstrate considerable yet still developing agent these models. Additionally, VAB constructs trajectory training set constructed through hybrid methods Program-based Solvers, Agent Bootstrapping, Human Demonstrations, promoting substantial performance improvements behavior cloning. Our work not only aims models but also provides solid for future development into agents. Code, \& test data, part fine-tuned available at \url{https://github.com/THUDM/VisualAgentBench}.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....