NFDI4DS | UHH-SEMS - Publication Details

Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

DOI: 10.48550/arxiv.2402.06596 Publication Date: 2024-02-09

Abstract Supplemental Material References Cited by

AUTHORS (6)

Mingzhe Xing

Rongkai Zhang

Hui Xue

Qi Chen

Fengtang Yang

Zhen Xiao

ABSTRACT

Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied general-purpose systems like operating systems, LLM face three primary challenges. Firstly, the action space is vast dynamic, posing difficulties for maintain an up-to-date understanding deliver accurate responses. Secondly, real-world often require inter-application cooperation}, demanding farsighted planning from agents. Thirdly, need identify optimal solutions aligning with user constraints, security concerns preferences. These challenges motivate AndroidArena, environment benchmark designed evaluate on a modern system. To address high-cost of manpower, we design scalable semi-automated method construct benchmark. In task evaluation, AndroidArena incorporates adaptive metrics issue non-unique solutions. Our findings reveal that even state-of-the-art struggle in cross-APP scenarios adhering specific constraints. Additionally, lack four key capabilities, i.e., understanding, reasoning, exploration, reflection, reasons failure Furthermore, provide empirical analysis improve success rate by 27% our proposed exploration strategy. This work first present valuable insights fine-grained weakness agents, offers path forward future research this area. Environment, benchmark, evaluation code are released at https://github.com/AndroidArenaAgent/AndroidArena.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....