Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Performance metric
DOI: 10.1145/3673660.3655087 Publication Date: 2024-06-14T05:26:31Z
ABSTRACT
Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, large training today typically involves sharding, data parallelism, and parallelism. Thus, the throughput large-scale depends heavily on network bandwidth since combination sharding multiple parallelism strategies incurs costs. However, prior characterizations high-bandwidth DGX machines that use TFLOPS as metric may not reflect performance system with lower bandwidth. Furthermore, reveal significantly distinct profiles different bandwidths at scale and, thus, need thorough study. In this paper, we provide bottom-up breakdown into compute communication time, quantitatively analyze their respective influences overall end-to-end scaling. Our evaluation an in-depth exploration scaling up to 512 GPUs limited bandwidth, examines three among six sizes. We also evaluate combinations both high low supercomputing systems. Overall, our work provides broader perspective training, analysis yield practical insights for predicting scaling, shaping future development design.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (3)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....