NFDI4DS | UHH-SEMS - Publication Details

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Hacker

DOI: 10.48550/arxiv.2402.07319 Publication Date: 2024-02-11

Abstract Supplemental Material References Cited by

AUTHORS (9)

Lichang Chen

Chen Zhu

Davit Soselia

Jiuhai Chen

Tianyi Zhou

Tom Goldstein

Heng Huang

Mohammad Shoeybi

Bryan Catanzaro

ABSTRACT

In this work, we study the issue of reward hacking on response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) LLMs. A well-formatted, verbose but less helpful LLMs can often deceive or even human evaluators to achieve high scores. The same also holds for some models RL. To address challenges both training and evaluation, establish more reliable evaluation protocol comparing different configurations, which inspects trade-off between LLM score length obtained by varying hyperparameters. Based conduct large-scale studies, where results shed insights into efficacy hyperparameters tricks used RL mitigating bias. We further propose improve model jointly two linear heads shared feature representations predict rewards, one trained correlate with other decorrelate therefore focus actual content. then discard head prevent length. Experiments demonstrate that our approach almost eliminates correlation improves policy significant margin.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

ODIN: Disentangled Reward Mitigates Hacking in RLHF

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....