NFDI4DS | UHH-SEMS - Publication Details

Evaluating Large Language Models Trained on Code

FOS: Computer and information sciences Computer Science - Machine Learning 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2107.03374 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (58)

Chen, Mark

Tworek, Jerry

Jun, Heewoo

Yuan, Qiming

Pinto, Henrique P...

Kaplan, Jared

Edwards, Harri

Burda, Yuri

Joseph, Nicholas

Brockman, Greg

Ray, Alex

Puri, Raul

Krueger, Gretchen

Petrov, Michael

Khlaaf, Heidy

Sastry, Girish

Mishkin, Pamela

Chan, Brooke

Gray, Scott

Ryder, Nick

Pavlov, Mikhail

Power, Alethea

Kaiser, Lukasz

Bavarian, Mohammad

Winter, Clemens

Tillet, Philippe

Such, Felipe Petr...

Cummings, Dave

Plappert, Matthias

Chantzis, Fotios

Barnes, Elizabeth

Herbert-Voss, Ariel

Guss, William Hebgen

Nichol, Alex

Paino, Alex

Tezak, Nikolas

Tang, Jie

Babuschkin, Igor

Balaji, Suchir

Jain, Shantanu

Saunders, William

Hesse, Christopher

Carr, Andrew N.

Leike, Jan

Achiam, Josh

Misra, Vedant

Morikawa, Evan

Radford, Alec

Knight, Matthew

Brundage, Miles

Murati, Mira

Mayer, Katie

Welinder, Peter

McGrew, Bob

Amodei, Dario

McCandlish, Sam

Sutskever, Ilya

Zaremba, Wojciech

ABSTRACT

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.<br/>corrected typos, added references, added authors, added acknowledgements<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Evaluating Large Language Models Trained on Code

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....