NFDI4DS | UHH-SEMS - Publication Details

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2401.16635 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (6)

Zhang, Shun

Chen, Zhenfang

Chen, Sunli

Shen, Yikang

Sun, Zhiqing

Gan, Chuang

ABSTRACT

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....