Human Alignment of Large Language Models through Online Preference Optimisation
FOS: Computer and information sciences
Computer Science - Machine Learning
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Statistics - Machine Learning
Machine Learning (stat.ML)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2403.08635
Publication Date:
2024-03-13
AUTHORS (13)
ABSTRACT
Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, has been extensively studied recently several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution two-fold. First, we show the equivalence between two recent methods, namely Identity (IPO) Nash Mirror Descent (Nash-MD). Second, introduce generalisation IPO, named IPO-MD, that leverages regularised sampling approach proposed by Nash-MD. This may seem surprising at first sight, since IPO an offline method whereas Nash-MD online using preference model. However, can be proven when consider version both generations are sampled policy annotated trained Optimising loss stream data becomes then equivalent finding equilibrium model through self-play. Building on equivalence, IPO-MD algorithm generates mixture (between reference policy) similarly general algorithm. We compare online-IPO different versions existing losses DPO SLiC summarisation task.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....