NFDI4DS | UHH-SEMS - Publication Details

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology

DOI: 10.21437/interspeech.2014-274 Publication Date: 2021-08-27T05:58:05Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Frank Seide

Hao Fu

Jasha Droppo

Gang Li

Dong Yu

ABSTRACT

We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k frames per second (kfps) when using 2880 samples per minibatch, and 51kfps with 16k, on a server with 8 K20X GPUs. This corresponds to speed-ups over a single GPU of 3.6 and 6.3, respectively. 7 training passes over 309h of data complete in under 7h. A 160M-parameter model training processes 3300h of data in under 16h on 20 dual-GPU servers—a 10 times speed-up—albeit at a small accuracy loss.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (337)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....