Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

Sequence (biology)
DOI: 10.48550/arxiv.2411.15419 Publication Date: 2024-11-22
ABSTRACT
Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE are typically trained in a distributed manner expert parallelism scheme, where experts each layer across multiple GPUs. However, the default suffers from heavy network burden due to all-to-all intermediate data exchange among GPUs before and after run. Some existing works have proposed reduce exchanges by transferring loads, however, which would decrease level of execution make computation inefficient. The weaknesses motivate us explore whether it possible inter-GPU traffic while maintaining high degree parallelism. This paper gives positive response presenting Luffy, communication-efficient training system two new techniques. First, Luffy migrates sequences hide token pulling paths within avoid copying over Second, we propose condensation that identifies similar tokens then eliminates redundant transmissions. We implement based on PyTorch evaluate its performance testbed 16 V100 can achieve speedup up 2.73x compared state-of-the-art systems.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....