BOND: Aligning LLMs with Best-of-N Distillation

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2407.14622 Publication Date: 2024-07-19
ABSTRACT
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, surprisingly simple strong inference-time strategy Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Distillation (BOND), novel RLHF algorithm seeks to emulate but without its significant computational overhead at inference time. Specifically, BOND distribution matching forces generations policy get closer distribution. We use Jeffreys divergence (a linear combination forward backward KL) balance between mode-covering mode-seeking behavior, derive an iterative formulation utilizes moving anchor for efficiency. demonstrate effectiveness our approach several design choices through experiments on abstractive summarization Gemma Aligning policies with outperforms other algorithms by improving results benchmarks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....