Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
Boosting
DOI:
10.48550/arxiv.2406.04594
Publication Date:
2024-06-06
AUTHORS (25)
ABSTRACT
The emergence of Large Language Models (LLMs) has necessitated the adoption parallel training techniques, involving deployment thousands GPUs to train a single model. Unfortunately, we have found that efficiency current is often suboptimal, largely due following two main issues. Firstly, hardware failures are inevitable, leading interruptions in tasks. inability quickly identify faulty components results substantial waste GPU resources. Secondly, since must wait for parameter synchronization complete before proceeding next round computation, network congestions can greatly increase waiting time GPUs. To address these challenges, this paper introduces communication-driven solution, namely C4. key insights C4 folds. First, training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies certainly some form malfunction. By leveraging feature, rapidly components, swiftly isolate anomaly, restart task, thereby avoiding resource wastage caused by delays anomaly detection. Second, predictable model communication, few large flows, allows efficiently execute traffic planning, substantially reducing congestion. been extensively implemented across our production systems, cutting error-induced overhead roughly 30% enhancing runtime performance about 15% certain applications with moderate costs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....