NFDI4DS | UHH-SEMS - Publication Details

BOND: Aligning LLMs with Best-of-N Distillation

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2407.14622 Publication Date: 2024-07-19

Abstract Supplemental Material References Cited by

AUTHORS (20)

Pier Giuseppe Sessa

Robert Dadashi

Léonard Hussenot

Johan Ferret

Nino Vieillard

Alexandre Ramé

Bobak Shariari

Sarah Perrin

Abe Friesen

Geoffrey Cideron

Sertan Girgin

Piotr Stańczyk

Andrea Michi

Danila Sinopalnikov

Sabela Ramos

Amélie Héliou

Aliaksei Severyn

Matt Hoffman

Nikola Momchev

Olivier Bachem

ABSTRACT

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, surprisingly simple strong inference-time strategy Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Distillation (BOND), novel RLHF algorithm seeks to emulate but without its significant computational overhead at inference time. Specifically, BOND distribution matching forces generations policy get closer distribution. We use Jeffreys divergence (a linear combination forward backward KL) balance between mode-covering mode-seeking behavior, derive an iterative formulation utilizes moving anchor for efficiency. demonstrate effectiveness our approach several design choices through experiments on abstractive summarization Gemma Aligning policies with outperforms other algorithms by improving results benchmarks.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

BOND: Aligning LLMs with Best-of-N Distillation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....