An Empirical Study of Using Large Language Models for Unit Test Generation
Unit testing
Code coverage
Benchmark (surveying)
Code (set theory)
DOI:
10.48550/arxiv.2305.00418
Publication Date:
2023-01-01
AUTHORS (6)
ABSTRACT
A code generation model generates by taking a prompt from comment, existing code, or combination of both. Although models (e.g., GitHub Copilot) are increasingly being adopted in practice, it is unclear whether they can successfully be used for unit test without fine-tuning strongly typed language like Java. To fill this gap, we investigated how well three (Codex, GPT-3.5-Turbo, and StarCoder) generate tests. We two benchmarks (HumanEval Evosuite SF110) to investigate the effect context on process. evaluated based compilation rates, correctness, coverage, smells. found that Codex achieved above 80% coverage HumanEval dataset, but no had more than 2% EvoSuite SF110 benchmark. The generated tests also suffered smells, such as Duplicated Asserts Empty Tests.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....