NFDI4DS | UHH-SEMS - Publication Details

The effect of fine-tuning on language model toxicity

FOS: Computer and information sciences Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence

DOI: 10.48550/arxiv.2410.15821 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (3)

Hawkins, Will

Mittelstadt, Brent

Russell, Chris

ABSTRACT

Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models' propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.<br/>To be presented at NeurIPS 2024 Safe Generative AI Workshop<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

The effect of fine-tuning on language model toxicity

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....