Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

Empirical Research
DOI: 10.48550/arxiv.2502.06193 Publication Date: 2025-02-10
ABSTRACT
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing quality these LLM-generated and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests configured environments, demands a high labor cost, is not suitable for evaluating text. Conventional metrics BLEU, which measure only lexical rather than semantic similarity, also come under scrutiny. In response, new trend has emerged employ LLMs automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed better mimic human assessment conventional without relying on high-quality reference answers. Nevertheless, their exact alignment in unexplored. this paper, we empirically explore tasks, focusing with judgments. We select seven that utilize general-purpose LLMs, alongside two specifically fine-tuned evaluation. After generating manually scoring LLM responses three recent datasets translation, summarization, then prompt evaluate each response. Finally, compare scores generated by results indicate output-based reach highest Pearson correlation 81.32 68.51 translation achieving near-human noticeably outperforming ChrF++, one best metrics, at 34.23 64.92. Such output judgments directly, exhibit more balanced score distributions resemble patterns. provide...
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....