NFDI4DS | UHH-SEMS - Publication Details

Scalpel

OPENALEX - Publications

Jiecao Yu Andrew Lukefahr David J. Palframan Ganesh Dasika Reetuparna Das and 1 more

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3079856.3080215 article EN 2017-06-24

Scalpel

OPENALEX - Publications

Jiecao Yu Andrew Lukefahr David J. Palframan Ganesh Dasika Reetuparna Das and 1 more

As the size of Deep Neural Networks (DNNs) continues to grow increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model computation by removing redundant weights. However, we implemented weight for several popular networks on a variety hardware platforms observed surprising results. For many networks, network sparsity caused will actually hurt overall performance despite large reductions in required multiply-accumulate operations....

10.1145/3140659.3080215 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks

OPENALEX - Publications

Xiaowei Wang Jiecao Yu Charles Augustine Ravi Iyer Reetuparna Das

We propose Bit Prudent In-Cache Acceleration of Deep Convolutional Neural Networks - an in-SRAM architecture for accelerating Network (CNN) inference by leveraging network redundancy and massive parallelism. The is exploited in two ways. First, we prune fine-tune the trained model develop distinct methods coalescing overlapping to run inferences efficiently with sparse models. Second, models a reduced bit width bit-serial computation. Our proposed achieves 17.7×/3.7× speedup over server...

10.1109/hpca.2019.00029 article EN 2019-02-01

First-Generation Inference Accelerator Deployment at Facebook

OPENALEX - Publications

Michael J. Anderson Benny Chen Stephen Chen Summer Deng Jordan Fix and 50 more

In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, well high compute, and network bandwidth requirements. We co-designed high-performance, energy-efficient accelerator platform based on these describe ecosystem developed deployed Facebook: both hardware, through Open Compute Platform (OCP), software framework tooling, Pytorch/Caffe2/Glow. A...

10.48550/arxiv.2107.04140 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Compute-Capable Block RAMs for Efficient Deep Learning Acceleration on FPGAs

OPENALEX - Publications

Xiaowei Wang Vidushi Goyal Jiecao Yu Valeria Bertacco Andrew Boutros and 4 more

The density of FPGA on-chip memory has been continuously increasing with modern FPGAs having thousands block RAMs (BRAMs) distributed across their reconfigurable fabric. These BRAMs can provide a tremendous amount bandwidth for efficient acceleration data-intensive applications. In this work, we propose enhancing the ubiquitous in-memory compute-capabilities. As result, act as normal storage units or bitlines be re-purposed SIMD lanes executing bit-serial arithmetic operations. Our proposed...

10.1109/fccm51124.2021.00018 article EN 2021-05-01

TF-Net

OPENALEX - Publications

Jiecao Yu Andrew Lukefahr Reetuparna Das Scott Mahlke

Deep Neural Networks (DNNs) have become an essential component of various applications. While today’s DNNs are mainly restricted to cloud services, network connectivity, energy, and data privacy problems make it important support efficient DNN computation on low-cost, low-power processors like microcontrollers. However, due the constrained resources, is challenging execute large models Using sub-byte low-precision input activations weights a typical method reduce computation. But...

10.1145/3358189 article EN ACM Transactions on Embedded Computing Systems 2019-10-07

BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks

OPENALEX - Publications

Yunjie Pan Jiecao Yu Andrew Lukefahr Reetuparna Das Scott Mahlke

Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at cost substantial computation and energy consumption, making it difficult to be deployed on mobile embedded devices. In CNNs, compute-intensive convolutional layers are followed by ReLU activation layer, which clamps negative outputs zeros, resulting in large sparsity. By exploiting such sparsity CNN models, we propose...

10.1145/3609093 article EN ACM Transactions on Embedded Computing Systems 2023-09-09

Spatial-Winograd Pruning Enabling Sparse Winograd Convolution

OPENALEX - Publications

Jiecao Yu Jongsoo Park Maxim Naumov

Deep convolutional neural networks (CNNs) are deployed in various applications but demand immense computational requirements. Pruning techniques and Winograd convolution two typical methods to reduce the CNN computation. However, they cannot be directly combined because transformation fills sparsity resulting from pruning. Li et al. (2017) propose sparse which weights pruned domain, this technique is not very practical Winograd-domain retraining requires low learning rates hence...

10.48550/arxiv.1901.02132 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data

OPENALEX - Publications

Mao Ye Dhruv Choudhary Jiecao Yu Ellie Wen Zeliang Chen and 4 more

Large scale deep learning provides a tremendous opportunity to improve the quality of content recommendation systems by employing both wider and deeper models, but this comes at great infrastructural cost carbon footprint in modern data centers. Pruning is an effective technique that reduces memory compute demand for model inference. However, pruning online challenging due continuous distribution shift (a.k.a non-stationary data). Although incremental training on full able adapt data,...

10.48550/arxiv.2010.08655 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

OPENALEX - Publications

Xiaocong Du Bhargav Bhushanam Jiecao Yu Dhruv Choudhary Tianxiang Gao and 5 more

Deep learning recommendation systems at scale have provided remarkable gains through increasing model capacity (i.e. wider and deeper neural networks), but it comes significant training cost infrastructure cost. Model pruning is an effective technique to reduce computation overhead for deep networks by removing redundant parameters. However, modern are still thirsty due the demand handling big data. Thus, a results in smaller consequently lower accuracy. To without sacrificing capacity, we...

10.1109/icmla52953.2021.00229 article EN 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) 2021-12-01

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

OPENALEX - Publications

Xiaocong Du Bhargav Bhushanam Jiecao Yu Dhruv Choudhary Tianxiang Gao and 5 more

Deep learning recommendation systems at scale have provided remarkable gains through increasing model capacity (i.e. wider and deeper neural networks), but it comes significant training cost infrastructure cost. Model pruning is an effective technique to reduce computation overhead for deep networks by removing redundant parameters. However, modern are still thirsty due the demand handling big data. Thus, a results in smaller consequently lower accuracy. To without sacrificing capacity, we...

10.48550/arxiv.2105.01064 preprint EN other-oa arXiv (Cornell University) 2021-01-01