NFDI4DS | UHH-SEMS - Publication Details

Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL)

DOI: 10.48550/arxiv.2502.00510 Publication Date: 2025-01-01

Abstract Supplemental Material References Cited by

AUTHORS (16)

Yang, Yingxuan

Huang, Bo

Qi, Siyuan

Feng, Chao

Hu, Haoyi

Zhu, Yuxuan

Hu, Jinbo

Zhao, Haoran

He, Ziyi

Liu, Xiao

Wang, Zongyu

Qiu, Lin

Cao, Xuezhi

Cai, Xunliang

Yu, Yong

Zhang, Weinan

ABSTRACT

Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module to overall system performance remains a significant challenge, impeding optimization and interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory's Shapley Value, which systematically measures the marginal impact of individual modules and their interactions within an agent's architecture. By replacing default modules with test variants across all possible combinations, CapaBench provides a principle method for attributing performance contributions. Key contributions include: (1) We are the first to propose a Shapley Value-based methodology for quantifying the contributions of capabilities in LLM agents; (2) Modules with high Shapley Values consistently lead to predictable performance gains when combined, enabling targeted optimization; and (3) We build a multi-round dataset of over 1,500 entries spanning diverse domains and practical task scenarios, enabling comprehensive evaluation of agent capabilities. CapaBench bridges the gap between component-level evaluation and holistic system assessment, providing actionable insights for optimizing modular LLM agents and advancing their deployment in complex, real-world scenarios.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....