NFDI4DS | UHH-SEMS - Publication Details

Evaluating Large Language Models with Runtime Behavior of Program Execution

Software Engineering (cs.SE) FOS: Computer and information sciences Computer Science - Software Engineering Computer Science - Computation and Language Computation and Language (cs.CL)

DOI: 10.48550/arxiv.2403.16437 Publication Date: 2024-03-25

Abstract Supplemental Material References Cited by

AUTHORS (6)

Junkai Chen

Zhiyuan Pan

Xing Hu

Zhenhao Li

Ge Li

Xin Xia

ABSTRACT

Large language models for code (i.e., LLMs) have shown strong understanding and generation capabilities. To evaluate the capabilities of LLMs in various aspects, many benchmarks been proposed (e.g., HumanEval ClassEval). Code reasoning is one most essential abilities LLMs, but existing are not sufficient. Typically, they focus on predicting input output a program, ignoring evaluation intermediate behavior during program execution, as well logical consistency model should give correct if prediction execution path wrong) when performing reasoning. address these problems, this paper, we propose framework, namely REval, evaluating with execution. We utilize adapt them to new within our framework. A large-scale empirical study conducted show unsatisfactory performance both Runtime Behavior Reasoning an average accuracy 44.4%) Incremental Consistency Evaluation IC score 10.3). results current reflect urgent need community strengthen capability LLMs.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Evaluating Large Language Models with Runtime Behavior of Program Execution

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....