Robust Preference Optimization through Reward Model Distillation

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2405.19316 Publication Date: 2024-05-29
ABSTRACT
Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) popular offline alignment method trains policy directly on data without the need to train or apply reinforcement learning. However, typical datasets have only single, at most few, annotation per pair, which causes DPO overconfidently assign rewards trend towards infinite magnitude. This frequently leads degenerate policies, sometimes causing even probabilities of preferred generations go zero. In this work, we analyze phenomenon and propose distillation get better proxy for true distribution over generation pairs: LM produce match induced by trained data. Moreover, account uncertainty in are distilling from, optimize against family models that, as whole, likely include least one reasonable distribution. Our results show such improved robustness shift annotations, while preserving simple supervised nature DPO.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....