NFDI4DS | UHH-SEMS - Publication Details

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2404.19429 Publication Date: 2024-04-30

Abstract Supplemental Material References Cited by

AUTHORS (6)

Chenyu Jiang

Ye Tian

Zhen Jia

Shuai Zheng

Chuan Wu

Yida Wang

ABSTRACT

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces challenge extended all-to-all communication latency during training process. Existing methods attempt to mitigate this issue by overlapping with expert computation. Yet, these frequently fall short achieving sufficient overlap, consequently restricting potential for performance enhancements. In our study, we extend scope considering overlap at broader graph level. During forward pass, enable non-MoE computations through careful partitioning and pipelining. backward achieve scheduling gradient weight computations. We implement techniques Lancet, system using compiler-based optimization automatically enhance MoE training. Our extensive evaluation reveals that Lancet significantly reduces time devoted non-overlapping communication, as much 77%. Moreover, achieves notable end-to-end speedup up 1.3 times when compared state-of-the-art solutions.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....