Human Alignment of Large Language Models through Online Preference Optimisation

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Statistics - Machine Learning Machine Learning (stat.ML) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2403.08635 Publication Date: 2024-03-13
ABSTRACT
Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, has been extensively studied recently several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution two-fold. First, we show the equivalence between two recent methods, namely Identity (IPO) Nash Mirror Descent (Nash-MD). Second, introduce generalisation IPO, named IPO-MD, that leverages regularised sampling approach proposed by Nash-MD. This may seem surprising at first sight, since IPO an offline method whereas Nash-MD online using preference model. However, can be proven when consider version both generations are sampled policy annotated trained Optimising loss stream data becomes then equivalent finding equilibrium model through self-play. Building on equivalence, IPO-MD algorithm generates mixture (between reference policy) similarly general algorithm. We compare online-IPO different versions existing losses DPO SLiC summarisation task.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....