Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Pruning
DOI: 10.48550/arxiv.2106.03795 Publication Date: 2021-01-01
ABSTRACT
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies illustrated that even simple pruning strategies be surprisingly effective, several theoretical shown compressible networks (in specific senses) should achieve a low generalization error. Yet, characterization of underlying cause makes amenable to such schemes is still missing. In this study, we address fundamental question reveal dynamics training algorithm has key role in obtaining Focusing our attention on stochastic gradient descent (SGD), main contribution link compressibility two recently established properties SGD: (i) size goes infinity, system converge mean-field limit, where weights behave independently, (ii) step-size/batch-size ratio, SGD iterates heavy-tailed stationary distribution. case these phenomena occur simultaneously, prove are guaranteed '$\ell_p$-compressible', errors different (magnitude, singular value, or node pruning) arbitrarily small increases. We further bounds adapted framework, which indeed confirm error will lower more Our theory numerical study various neural show ratios introduce heavy-tails, which, combination with overparametrization, result compressibility.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....