ODIN: Disentangled Reward Mitigates Hacking in RLHF

Hacker
DOI: 10.48550/arxiv.2402.07319 Publication Date: 2024-02-11
ABSTRACT
In this work, we study the issue of reward hacking on response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) LLMs. A well-formatted, verbose but less helpful LLMs can often deceive or even human evaluators to achieve high scores. The same also holds for some models RL. To address challenges both training and evaluation, establish more reliable evaluation protocol comparing different configurations, which inspects trade-off between LLM score length obtained by varying hyperparameters. Based conduct large-scale studies, where results shed insights into efficacy hyperparameters tricks used RL mitigating bias. We further propose improve model jointly two linear heads shared feature representations predict rewards, one trained correlate with other decorrelate therefore focus actual content. then discard head prevent length. Experiments demonstrate that our approach almost eliminates correlation improves policy significant margin.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....