NFDI4DS | UHH-SEMS - Publication Details

An Empirical Study of Using Large Language Models for Unit Test Generation

Unit testing Code coverage Benchmark (surveying) Code (set theory)

DOI: 10.48550/arxiv.2305.00418 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (6)

Mohammed Latif Si...

Joanna C. S. Santos

Ridwanul Hasan Ta...

Noshin Ulfat

Fahmid Al Rifat

Vinícius Carvalho...

ABSTRACT

A code generation model generates by taking a prompt from comment, existing code, or combination of both. Although models (e.g., GitHub Copilot) are increasingly being adopted in practice, it is unclear whether they can successfully be used for unit test without fine-tuning strongly typed language like Java. To fill this gap, we investigated how well three (Codex, GPT-3.5-Turbo, and StarCoder) generate tests. We two benchmarks (HumanEval Evosuite SF110) to investigate the effect context on process. evaluated based compilation rates, correctness, coverage, smells. found that Codex achieved above 80% coverage HumanEval dataset, but no had more than 2% EvoSuite SF110 benchmark. The generated tests also suffered smells, such as Duplicated Asserts Empty Tests.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

An Empirical Study of Using Large Language Models for Unit Test Generation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....