Evaluating Large Language Models with Runtime Behavior of Program Execution

Software Engineering (cs.SE) FOS: Computer and information sciences Computer Science - Software Engineering Computer Science - Computation and Language Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2403.16437 Publication Date: 2024-03-25
ABSTRACT
Large language models for code (i.e., LLMs) have shown strong understanding and generation capabilities. To evaluate the capabilities of LLMs in various aspects, many benchmarks been proposed (e.g., HumanEval ClassEval). Code reasoning is one most essential abilities LLMs, but existing are not sufficient. Typically, they focus on predicting input output a program, ignoring evaluation intermediate behavior during program execution, as well logical consistency model should give correct if prediction execution path wrong) when performing reasoning. address these problems, this paper, we propose framework, namely REval, evaluating with execution. We utilize adapt them to new within our framework. A large-scale empirical study conducted show unsatisfactory performance both Runtime Behavior Reasoning an average accuracy 44.4%) Incremental Consistency Evaluation IC score 10.3). results current reflect urgent need community strengthen capability LLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....