NFDI4DS | UHH-SEMS - Publication Details

Scaling Laws for Upcycling Mixture-of-Experts Language Models

Scaling law

DOI: 10.48550/arxiv.2502.03009 Publication Date: 2025-02-05

Abstract Supplemental Material References Cited by

AUTHORS (3)

Seng Pei Liew

Takuya Kato

Sei-ichi Takase

ABSTRACT

Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches mitigating such computational demands: reusing smaller to train larger ones (upcycling), and computationally efficient like mixture-of-experts (MoE). In this paper, we study the upcycling LLMs MoE models, which scaling behavior remains underexplored. Through extensive experiments, identify empirical laws that describe how performance depends on dataset size model configuration. Particularly, show that, while these factors improves performance, there a novel interaction term between dense upcycled limits efficiency at budgets. Based findings, provide guidance scale upcycling, establish conditions under outperforms from-scratch trainings within budget constraints.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Scaling Laws for Upcycling Mixture-of-Experts Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....