NFDI4DS | UHH-SEMS - Publication Details

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

Empirical Research

DOI: 10.48550/arxiv.2502.06193 Publication Date: 2025-02-10

Abstract Supplemental Material References Cited by

AUTHORS (6)

Ruiqi Wang

J Inghua Guo

Cuiyun Gao

Guodong Fan

Chun Yong Chong

Xin Xia

ABSTRACT

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing quality these LLM-generated and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests configured environments, demands a high labor cost, is not suitable for evaluating text. Conventional metrics BLEU, which measure only lexical rather than semantic similarity, also come under scrutiny. In response, new trend has emerged employ LLMs automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed better mimic human assessment conventional without relying on high-quality reference answers. Nevertheless, their exact alignment in unexplored. this paper, we empirically explore tasks, focusing with judgments. We select seven that utilize general-purpose LLMs, alongside two specifically fine-tuned evaluation. After generating manually scoring LLM responses three recent datasets translation, summarization, then prompt evaluate each response. Finally, compare scores generated by results indicate output-based reach highest Pearson correlation 81.32 68.51 translation achieving near-human noticeably outperforming ChrF++, one best metrics, at 34.23 64.92. Such output judgments directly, exhibit more balanced score distributions resemble patterns. provide...

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....