Logarithmic Regret for Online KL-Regularized Reinforcement Learning

FOS: Computer and information sciences Computer Science - Machine Learning Statistics - Machine Learning Machine Learning (stat.ML) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2502.07460 Publication Date: 2025-02-11
ABSTRACT
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, theoretical difference between KL-regularized and standard remains largely under-explored. While there is recent line work on analysis objective decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to traditional setting or rely strong coverage assumptions. In this paper, we propose an optimism-based online contextual bandit algorithm, provide novel regret. By carefully leveraging benign optimization landscape induced by optimistic reward estimation, our algorithm achieves $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote parameter, cardinality function class, number rounds, complexity class. Furthermore, extend reinforcement learning developing decomposition over transition steps also obtain similar bound.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....