NFDI4DS | UHH-SEMS - Publication Details

SCNN

OPENALEX - Publications

Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan and 4 more

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs, especially in mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU operator....

10.1145/3079856.3080254 article EN 2017-06-15

SCNN

OPENALEX - Publications

Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan and 4 more

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs, especially in mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU operator....

10.1145/3140659.3080254 article EN ACM SIGARCH Computer Architecture News 2017-06-24

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

OPENALEX - Publications

Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulfiqar Stephen W. Keckler

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into DRAM capacity of a GPU. This restriction hampers researcher's flexibility study different algorithms, forcing them either use less desirable architecture or parallelize processing across multiple GPUs. We propose runtime manager virtualizes DNNs such both GPU and CPU can simultaneously be utilized for training larger DNNs. Our virtualized DNN...

10.1109/micro.2016.7783721 preprint EN 2016-10-01

vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design

OPENALEX - Publications

Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulfiqar Stephen W. Keckler

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into DRAM capacity of a GPU. This restriction hampers researcher's flexibility study different algorithms, forcing them either use less desirable architecture or parallelize processing across multiple GPUs. We propose runtime manager virtualizes DNNs such both GPU and CPU can simultaneously be utilized for training larger DNNs. Our virtualized DNN...

10.5555/3195638.3195660 article EN International Symposium on Microarchitecture 2016-10-15

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

OPENALEX - Publications

Minsoo Rhu Mike O’Connor Niladrish Chatterjee Jeff Pool Youngeun Kwon and 1 more

Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a neural network (DNN) fits within GPU physical memory. Prior work tries address this restriction by virtualizing DNNs, enabling both CPU and be utilized for allocations. Despite its merits, can incur significant performance overheads when time needed copy back forth from is higher than latency perform DNN computations. We introduce high-performance virtualization strategy based on...

10.1109/hpca.2018.00017 article EN 2018-02-01

TensorDIMM

OPENALEX - Publications

Youngeun Kwon Y. Lee Minsoo Rhu

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses memory capacity and bandwidth challenges of associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL These DIMMs are populated inside GPU-centric system interconnect remote...

10.1145/3352460.3358284 article EN 2019-10-11

BTS

OPENALEX - Publications

Sangpyo Kim Jongmin Kim Michael Jaemin Kim Wonkyung Jung John Kim and 2 more

Homomorphic encryption (HE) enables the secure offloading of computations to cloud by providing computation on encrypted data (ciphertexts). HE is based noisy schemes in which noise accumulates as more are applied data. The limited number operations applicable prevents practical applications from exploiting HE. Bootstrapping an unlimited or fully (FHE) refreshing ciphertext. Unfortunately, bootstrapping requires a significant amount additional and memory bandwidth well. Prior works have...

10.1145/3470496.3527415 article EN 2022-05-31

A locality-aware memory hierarchy for energy-efficient GPU architectures

OPENALEX - Publications

Minsoo Rhu Michael B. Sullivan Jingwen Leng Mattan Erez

As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU hierarchies use coarse-grained accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These accesses, however, are poor match for emerging applications with irregular control flow access patterns. Meanwhile, the massive multi-threading of GPUs simplicity make CPU-specific system enhancements ineffective improving performance...

10.1145/2540708.2540717 article EN 2013-12-07

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

OPENALEX - Publications

Yujeong Choi Minsoo Rhu

To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization share the underlying resources among multiple requests. This paper makes case for "preemptible" neural processing unit (NPU) "predictive" multi-task scheduler meet latency demands of high-priority inference while maintaining high throughput. We evaluate both mechanisms that enable NPUs be preemptible policies utilize them scheduling objectives. show preemptive NPU...

10.1109/hpca47549.2020.00027 article EN 2020-02-01

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

OPENALEX - Publications

Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan and 4 more

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs in wide range situations, especially mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU...

10.48550/arxiv.1708.04485 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

OPENALEX - Publications

Ranggi Hwang Taehun Kim Youngeun Kwon Minsoo Rhu

Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers a crucial building block in designing yet little attention has been paid properly accelerating this ML algorithm. This paper first provides detailed workload characterization on personalized and identifies two significant performance limiters: memory-intensive compute-intensive...

10.1109/isca45697.2020.00083 article EN 2020-05-01

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

OPENALEX - Publications

Jongmin Kim Gwangho Lee Sangpyo Kim Gina Sohn Minsoo Rhu and 2 more

Homomorphic Encryption (HE) is one of the most promising post-quantum cryptographic schemes that enable privacy-preserving computation on servers. However, noise accumulates as we perform operations HE-encrypted data, restricting number possible operations. Fully HE (FHE) removes this restriction by introducing bootstrapping operation, which refreshes data; however, FHE are highly memory-bound. Bootstrapping, in particular, requires loading GBs evaluation keys and plaintexts from offchip...

10.1109/micro56248.2022.00086 preprint EN 2022-10-01

GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks

OPENALEX - Publications

Ranggi Hwang Minhoo Kang Jiwon Lee Dongyun Kam Youngjoo Lee and 1 more

Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting combination stages series sparse-dense matrix multiplication. However, work frequently suffers from inefficient movements, leaving significant...

10.1109/hpca56546.2023.10070983 article EN 2023-02-01

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

OPENALEX - Publications

Bongjoon Hyun Taehun Kim Dongjae Lee Minsoo Rhu

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it never seen the light of day in real-world products due to its high design overheads and lack a killer application. With advent critical memoryintensive workloads, several commercial PIM technologies have introduced market, ranging from domain-specific architectures more general-purpose architectures. In this work, we deepdive into UPMEM's technology, PIM-enabled parallel computing architecture that is...

10.1109/hpca57654.2024.00029 article EN 2024-03-02

Priority-based cache allocation in throughput processors

OPENALEX - Publications

Dong Li Minsoo Rhu Daniel Johnson Mike O’Connor Mattan Erez and 3 more

GPUs employ massive multithreading and fast context switching to provide high throughput hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants throttling thread-level parallelism reduce cache improve performance. Throttling approaches can, lead under-utilizing thread contexts, on-chip interconnect, off-chip bandwidth. This paper proposes tightly...

10.1109/hpca.2015.7056024 article EN 2015-02-01

Architecting an Energy-Efficient DRAM System for GPUs

OPENALEX - Publications

Niladrish Chatterjee Mike O’Connor Donghyuk Lee Daniel Johnson Stephen W. Keckler and 2 more

This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete a limited number row buffers. As result, only fraction the data fetched into buffer is used, leading to significant energy overheads. Our proposed exploits hierarchical organization bank reduce minimum activation granularity. To avoid incremental area with this approach, we must partition datapath...

10.1109/hpca.2017.58 article EN 2017-02-01

Lazy Batching: An SLA-aware Batching System for Cloud Machine Learning Inference

OPENALEX - Publications

Yujeong Choi Yunseong Kim Minsoo Rhu

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph combines the individual DNN graphs into a single one, allowing multiple inputs be concurrently executed in parallel. We observe that coarse-grained becomes suboptimal effectively handling dynamic request traffic, leaving significant performance left on table. This paper proposes LazyBatching, SLA-aware system considers both scheduling and...

10.1109/hpca51647.2021.00049 article EN 2021-02-01

SmartSAGE

OPENALEX - Publications

Y. Lee Jinha Chung Minsoo Rhu

Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and relationship across different edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms a production environment faces several challenges as number nodes amount to billions hundreds scale, requiring substantial storage space for training. Unfortunately, ML frameworks employ an...

10.1145/3470496.3527391 article EN 2022-05-31

The dual-path execution model for efficient GPU control flow

OPENALEX - Publications

Minsoo Rhu Mattan Erez

Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all in execute common on particular cycle. To enable control flow to diverge within threads, GPUs partially serialize and follow path at time. The are not current is masked. Most rely hardware reconvergence stack track concurrent paths choose for execution. Control pushed onto when they popped off reconverge keep lane...

10.1109/hpca.2013.6522352 article EN 2013-02-01

Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

OPENALEX - Publications

Youngeun Kwon Minsoo Rhu

As the models and datasets to train deep learning (DL) scale, system architects are faced with new challenges, one of which is memory capacity bottleneck, where limited physical inside accelerator device constrains algorithm that can be studied. We propose a memory-centric transparently expand available accelerators while also providing fast inter-device communication for parallel training. Our proposal aggregates pool modules locally within device-side interconnect, decoupled from host...

10.1109/micro.2018.00021 article EN 2018-10-01

TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory

OPENALEX - Publications

Jaehyun Park Byeongho Kim Sungmin Yun Eojin Lee Minsoo Rhu and 1 more

Personalized recommendation systems are gaining significant traction due to their industrial importance. An important building block of consists the embedding layers, which exhibit a highly memory-intensive characteristic. A fundamental primitive layers is vector gathers followed by reductions, exhibiting low arithmetic intensity and becoming bottlenecked memory throughput. To tackle such challenge, recent proposals employ near-data processing (NDP) solution at DRAM rank-level, achieving...

10.1145/3466752.3480080 article EN 2021-10-17

CAPRI

OPENALEX - Publications

Minsoo Rhu Mattan Erez

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable allow even code with irregular control to execute well on their SIMD pipelines. To do this, each lane is considered logical thread where hardware ensures that flow accurate by automatically applying masked execution. The execution, however, often degrades performance because the issue slots of lanes are wasted. This degradation can be mitigated dynamically compacting...

10.1145/2366231.2337167 article EN ACM SIGARCH Computer Architecture News 2012-09-05

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

OPENALEX - Publications

Youngeun Kwon Y. Lee Minsoo Rhu

Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been target several prior literatures. Unfortunately, little explored and understood regarding training side this emerging ML workload. In paper, we first perform a detailed characterization study on recommendations, root-causing sparse embedding layer as significant...

10.1109/hpca51647.2021.00029 article EN 2021-02-01

A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit

OPENALEX - Publications

Taehun Kim Yunjae Lee Juntaek Lim Minsoo Rhu

10.1109/lca.2025.3546811 article EN IEEE Computer Architecture Letters 2025-01-01

Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

OPENALEX - Publications

Minsoo Rhu Mattan Erez

Current GPUs maintain high programmability by abstracting the SIMD nature of hardware as independent concurrent threads control with responsible for generating predicate masks to utilize different flows control. This dynamic masking leads poor utilization resources when in same group diverges. Prior research suggests that groups be formed dynamically compacting a large number into groups, mitigating impact divergence. To efficiency, however, alignment thread lane is fixed, limiting potential...

10.1145/2485922.2485953 article EN 2013-06-23