NFDI4DS | UHH-SEMS - Publication Details

Logarithmic Regret for Online KL-Regularized Reinforcement Learning

FOS: Computer and information sciences Computer Science - Machine Learning Statistics - Machine Learning Machine Learning (stat.ML) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2502.07460 Publication Date: 2025-02-11

Abstract Supplemental Material References Cited by

AUTHORS (5)

Heyang Zhao

Chenlu Ye

Wei Xiong

Quanquan Gu

Tong Zhang

ABSTRACT

Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, theoretical difference between KL-regularized and standard remains largely under-explored. While there is recent line work on analysis objective decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to traditional setting or rely strong coverage assumptions. In this paper, we propose an optimism-based online contextual bandit algorithm, provide novel regret. By carefully leveraging benign optimization landscape induced by optimistic reward estimation, our algorithm achieves $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote parameter, cardinality function class, number rounds, complexity class. Furthermore, extend reinforcement learning developing decomposition over transition steps also obtain similar bound.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Logarithmic Regret for Online KL-Regularized Reinforcement Learning

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....