NFDI4DS | UHH-SEMS - Publication Details

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Distributed, Parallel, and Cluster Computing (cs.DC) 01 natural sciences 0105 earth and related environmental sciences Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2103.15195 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (3)

Wang, Zhuang

Wu, Xinyu

Ng, T. S. Eugene

ABSTRACT

8 papes<br/>Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....