Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
Training set
DOI:
10.48550/arxiv.1807.11205
Publication Date:
2018-01-01
AUTHORS (14)
ABSTRACT
Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing communication-to-computation ratio, it may hurt generalization ability of models. To this end, we build a highly scalable learning for dense GPU clusters three main contributions: (1) We propose mixed-precision method that significantly improves throughput single without losing accuracy. (2) an optimization approach extremely large size (up to 64k) train CNN models on ImageNet dataset (3) optimized all-reduce algorithms achieve up 3x and 11x speedup AlexNet ResNet-50 respectively than NCCL-based cluster 1024 Tesla P40 GPUs. On 90 epochs, state-of-the-art GPU-based P100 GPUs spent 15 minutes achieved 74.9\% top-1 test accuracy, another KNL-based 2048 Intel KNLs 20 75.4\% Our 75.8\% accuracy only 6.6 When 95 our 58.7\% within 4 minutes, which also outperforms all other existing systems.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....