Scaling Laws for Upcycling Mixture-of-Experts Language Models

Scaling law
DOI: 10.48550/arxiv.2502.03009 Publication Date: 2025-02-05
ABSTRACT
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches mitigating such computational demands: reusing smaller to train larger ones (upcycling), and computationally efficient like mixture-of-experts (MoE). In this paper, we study the upcycling LLMs MoE models, which scaling behavior remains underexplored. Through extensive experiments, identify empirical laws that describe how performance depends on dataset size model configuration. Particularly, show that, while these factors improves performance, there a novel interaction term between dense upcycled limits efficiency at budgets. Based findings, provide guidance scale upcycling, establish conditions under outperforms from-scratch trainings within budget constraints.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....