Accelerating Sparse DNNs Based on Tiled GEMM
DOI:
10.48550/arxiv.2402.10876
Publication Date:
2024-02-16
AUTHORS (8)
ABSTRACT
Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading irregular computations. Consequently, unstructured cannot achieve meaningful speedup on commodity hardware built for dense matrix Accelerators are usually modified or designed with structured sparsity-optimized architectures exploiting sparsity. For example, Ampere architecture introduces a tensor core, which adopts 2:4 sparsity pattern. We propose method that builds upon insight multiplication generally breaks large into multiple smaller tiles parallel execution. present tile-wise pattern, maintains pattern at tile level efficient execution but allows global scale high accuracy. In addition, is implemented memory level, and executes register inside core. combine these two patterns tile-vector-wise (TVW) explore more fine-grained further accelerate DNN evaluate TVW GPU, achieving averages $1.85\times$, $2.75\times$, $22.18\times$ speedups over model, block sparsity,
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....