NFDI4DS | UHH-SEMS - Publication Details

Human Alignment of Large Language Models through Online Preference Optimisation

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Statistics - Machine Learning Machine Learning (stat.ML) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2403.08635 Publication Date: 2024-03-13

Abstract Supplemental Material References Cited by

AUTHORS (13)

Daniele Calandriello

Daniel Guo

Rémi Munos

Mark Rowland

Yunhao Tang

Bernardo Ávila Pires

Pierre Harvey Ric...

Charline Le Lan

Michal Vaľko

Tianqi Liu

Rishabh Joshi

Zeyu Zheng

Bilal Piot

ABSTRACT

Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, has been extensively studied recently several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution two-fold. First, we show the equivalence between two recent methods, namely Identity (IPO) Nash Mirror Descent (Nash-MD). Second, introduce generalisation IPO, named IPO-MD, that leverages regularised sampling approach proposed by Nash-MD. This may seem surprising at first sight, since IPO an offline method whereas Nash-MD online using preference model. However, can be proven when consider version both generations are sampled policy annotated trained Optimising loss stream data becomes then equivalent finding equilibrium model through self-play. Building on equivalence, IPO-MD algorithm generates mixture (between reference policy) similarly general algorithm. We compare online-IPO different versions existing losses DPO SLiC summarisation task.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Human Alignment of Large Language Models through Online Preference Optimisation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....