Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2404.19429 Publication Date: 2024-04-30
ABSTRACT
The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces challenge extended all-to-all communication latency during training process. Existing methods attempt to mitigate this issue by overlapping with expert computation. Yet, these frequently fall short achieving sufficient overlap, consequently restricting potential for performance enhancements. In our study, we extend scope considering overlap at broader graph level. During forward pass, enable non-MoE computations through careful partitioning and pipelining. backward achieve scheduling gradient weight computations. We implement techniques Lancet, system using compiler-based optimization automatically enhance MoE training. Our extensive evaluation reveals that Lancet significantly reduces time devoted non-overlapping communication, as much 77%. Moreover, achieves notable end-to-end speedup up 1.3 times when compared state-of-the-art solutions.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....