Understanding the Weakness of Large Language Model Agents within a Complex Android Environment
DOI:
10.48550/arxiv.2402.06596
Publication Date:
2024-02-09
AUTHORS (6)
ABSTRACT
Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied general-purpose systems like operating systems, LLM face three primary challenges. Firstly, the action space is vast dynamic, posing difficulties for maintain an up-to-date understanding deliver accurate responses. Secondly, real-world often require inter-application cooperation}, demanding farsighted planning from agents. Thirdly, need identify optimal solutions aligning with user constraints, security concerns preferences. These challenges motivate AndroidArena, environment benchmark designed evaluate on a modern system. To address high-cost of manpower, we design scalable semi-automated method construct benchmark. In task evaluation, AndroidArena incorporates adaptive metrics issue non-unique solutions. Our findings reveal that even state-of-the-art struggle in cross-APP scenarios adhering specific constraints. Additionally, lack four key capabilities, i.e., understanding, reasoning, exploration, reflection, reasons failure Furthermore, provide empirical analysis improve success rate by 27% our proposed exploration strategy. This work first present valuable insights fine-grained weakness agents, offers path forward future research this area. Environment, benchmark, evaluation code are released at https://github.com/AndroidArenaAgent/AndroidArena.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....