Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2410.06508 Publication Date: 2024-10-08
ABSTRACT
Monte Carlo Tree Search (MCTS) has recently emerged as a powerful technique for enhancing the reasoning capabilities of LLMs. Techniques such SFT or DPO have enabled LLMs to distill high-quality behaviors from MCTS, improving their performance. However, existing distillation methods underutilize rich trajectory information generated by limiting potential improvements in LLM reasoning. In this paper, we propose AlphaLLM-CPL, novel pairwise training framework that enables self-improve through MCTS behavior distillation. AlphaLLM-CPL efficiently leverages trajectories via two key innovations: (1) constructs stepwise pairs child nodes sharing same parent search tree, providing step-level more effective (2) introduces curriculum preference learning, dynamically adjusting sequence each offline epoch prioritize critical learning steps and mitigate overfitting. Experimental results on mathematical tasks demonstrate significantly outperforms previous methods, substantially boosting
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....