NFDI4DS | UHH-SEMS - Publication Details

Robust Preference Optimization through Reward Model Distillation

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2405.19316 Publication Date: 2024-05-29

Abstract Supplemental Material References Cited by

AUTHORS (8)

Adam Fisch

Jacob Eisenstein

Vicky Zayats

Alekh Agarwal

Ahmad Beirami

Chirag Nagpal

P. D. Shaw

Jonathan Berant

ABSTRACT

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) popular offline alignment method trains policy directly on data without the need to train or apply reinforcement learning. However, typical datasets have only single, at most few, annotation per pair, which causes DPO overconfidently assign rewards trend towards infinite magnitude. This frequently leads degenerate policies, sometimes causing even probabilities of preferred generations go zero. In this work, we analyze phenomenon and propose distillation get better proxy for true distribution over generation pairs: LM produce match induced by trained data. Moreover, account uncertainty in are distilling from, optimize against family models that, as whole, likely include least one reasonable distribution. Our results show such improved robustness shift annotations, while preserving simple supervised nature DPO.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Robust Preference Optimization through Reward Model Distillation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....