Is RLHF More Difficult than Standard RL?

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Statistics - Machine Learning Machine Learning (stat.ML) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2306.14111 Publication Date: 2023-01-01
ABSTRACT
Reinforcement learning from Human Feedback (RLHF) learns preference signals, while standard Learning (RL) directly reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of models, we can solve using existing algorithms and techniques reward-based RL, with small or no extra costs. Specifically, (1) preferences that are drawn probabilistic reduce the problem to robust tolerate errors in rewards; (2) general arbitrary where objective is find von Neumann winner, multiagent finds Nash equilibria factored Markov games restricted set policies. The latter case be further reduced adversarial MDP when only depend on final state. We instantiate all subroutines by concrete provable algorithms, apply our theory large class models including tabular MDPs generic function approximation. provide guarantees K-wise comparisons available.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....