ZeRO-Offload: Democratizing Billion-Scale Model Training

Speedup
DOI: 10.48550/arxiv.2101.06840 Publication Date: 2021-01-01
ABSTRACT
Large-scale model training has been a playing ground for limited few requiring complex refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large landscape by making accessible nearly everyone. It can train models with over 13 billion parameters on single GPU, 10x increase in size compared popular framework such as PyTorch, it does so without any change from data scientists or sacrificing computational efficiency. enables offloading compute CPU. To preserve efficiency, is designed minimize movement to/from reduce CPU time while maximizing memory savings GPU. As result, achieve 40 TFlops/GPU NVIDIA V100 10B parameter 30TF using PyTorch alone 1.4B model, largest that be trained running out of memory. also scale multiple-GPUs when available, offering near linear speedup up 128 GPUs. Additionally, work together parallelism 70 DGX-2 box, 4.5x alone. By combining efficiency ease-of-use, democratizes large-scale even just
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....