Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain
Code (set theory)
Trustworthiness
DOI:
10.48550/arxiv.2310.14053
Publication Date:
2023-01-01
AUTHORS (7)
ABSTRACT
Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates performance of LLMs on a set individual tasks, their self-consistency across different tasks overlooked. Intuitively, trustworthy model should be self-consistent when generating natural language specifications for its own code and specifications. Failure to preserve reveals lack understanding shared semantics underlying programming language, therefore undermines trustworthiness model. In this paper, we first formally define then design framework, IdentityChain, which effectively efficiently at same time. We study eleven show that they fail self-consistency, indeed distinct aspect from accuracy. Furthermore, IdentityChain can used as debugging tool expose weaknesses by demonstrating three major identify current models using IdentityChain. Our available https://github.com/marcusm117/IdentityChain.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....