- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- Advanced Memory and Neural Computing
- Ferroelectric and Negative Capacitance Devices
- Advanced Data Storage Technologies
- Stochastic Gradient Optimization Techniques
- Caching and Content Delivery
- Recommender Systems and Techniques
- Cryptography and Data Security
- Advanced Graph Neural Networks
- Interconnection Networks and Systems
- Privacy-Preserving Technologies in Data
- Digital Filter Design and Implementation
- Topic Modeling
- Advanced Data Compression Techniques
- Distributed and Parallel Computing Systems
- Tensor decomposition and applications
- IoT and Edge/Fog Computing
- Cloud Computing and Resource Management
- Graph Theory and Algorithms
- Natural Language Processing Techniques
- Wireless Sensor Networks for Data Analysis
- Adversarial Robustness in Machine Learning
- Cryptographic Implementations and Security
- Algorithms and Data Compression
Korea Advanced Institute of Science and Technology
2009-2025
Kootenay Association for Science & Technology
2022-2024
Seoul National University
2021
Korea Institute of Science & Technology Information
2019
Universitat Politècnica de Catalunya
2019
Barcelona Supercomputing Center
2019
Pohang University of Science and Technology
2017-2018
Korea Post
2017-2018
Nvidia (United Kingdom)
2017
Nvidia (United States)
2016
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs, especially in mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU operator....
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs, especially in mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU operator....
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into DRAM capacity of a GPU. This restriction hampers researcher's flexibility study different algorithms, forcing them either use less desirable architecture or parallelize processing across multiple GPUs. We propose runtime manager virtualizes DNNs such both GPU and CPU can simultaneously be utilized for training larger DNNs. Our virtualized DNN...
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into DRAM capacity of a GPU. This restriction hampers researcher's flexibility study different algorithms, forcing them either use less desirable architecture or parallelize processing across multiple GPUs. We propose runtime manager virtualizes DNNs such both GPU and CPU can simultaneously be utilized for training larger DNNs. Our virtualized DNN...
Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a neural network (DNN) fits within GPU physical memory. Prior work tries address this restriction by virtualizing DNNs, enabling both CPU and be utilized for allocations. Despite its merits, can incur significant performance overheads when time needed copy back forth from is higher than latency perform DNN computations. We introduce high-performance virtualization strategy based on...
Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses memory capacity and bandwidth challenges of associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL These DIMMs are populated inside GPU-centric system interconnect remote...
Homomorphic encryption (HE) enables the secure offloading of computations to cloud by providing computation on encrypted data (ciphertexts). HE is based noisy schemes in which noise accumulates as more are applied data. The limited number operations applicable prevents practical applications from exploiting HE. Bootstrapping an unlimited or fully (FHE) refreshing ciphertext. Unfortunately, bootstrapping requires a significant amount additional and memory bandwidth well. Prior works have...
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU hierarchies use coarse-grained accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These accesses, however, are poor match for emerging applications with irregular control flow access patterns. Meanwhile, the massive multi-threading of GPUs simplicity make CPU-specific system enhancements ineffective improving performance...
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization share the underlying resources among multiple requests. This paper makes case for "preemptible" neural processing unit (NPU) "predictive" multi-task scheduler meet latency demands of high-priority inference while maintaining high throughput. We evaluate both mechanisms that enable NPUs be preemptible policies utilize them scheduling objectives. show preemptive NPU...
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical deployments of CNNs in wide range situations, especially mobile platforms such autonomous vehicles, cameras, electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves by exploiting zero-valued weights that stem from network pruning during training activations arise common ReLU...
Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers a crucial building block in designing yet little attention has been paid properly accelerating this ML algorithm. This paper first provides detailed workload characterization on personalized and identifies two significant performance limiters: memory-intensive compute-intensive...
Homomorphic Encryption (HE) is one of the most promising post-quantum cryptographic schemes that enable privacy-preserving computation on servers. However, noise accumulates as we perform operations HE-encrypted data, restricting number possible operations. Fully HE (FHE) removes this restriction by introducing bootstrapping operation, which refreshes data; however, FHE are highly memory-bound. Bootstrapping, in particular, requires loading GBs evaluation keys and plaintexts from offchip...
Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting combination stages series sparse-dense matrix multiplication. However, work frequently suffers from inefficient movements, leaving significant...
Processing-in-memory (PIM) has been explored for decades by computer architects, yet it never seen the light of day in real-world products due to its high design overheads and lack a killer application. With advent critical memoryintensive workloads, several commercial PIM technologies have introduced market, ranging from domain-specific architectures more general-purpose architectures. In this work, we deepdive into UPMEM's technology, PIM-enabled parallel computing architecture that is...
GPUs employ massive multithreading and fast context switching to provide high throughput hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants throttling thread-level parallelism reduce cache improve performance. Throttling approaches can, lead under-utilizing thread contexts, on-chip interconnect, off-chip bandwidth. This paper proposes tightly...
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete a limited number row buffers. As result, only fraction the data fetched into buffer is used, leading to significant energy overheads. Our proposed exploits hierarchical organization bank reduce minimum activation granularity. To avoid incremental area with this approach, we must partition datapath...
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph combines the individual DNN graphs into a single one, allowing multiple inputs be concurrently executed in parallel. We observe that coarse-grained becomes suboptimal effectively handling dynamic request traffic, leaving significant performance left on table. This paper proposes LazyBatching, SLA-aware system considers both scheduling and...
Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and relationship across different edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms a production environment faces several challenges as number nodes amount to billions hundreds scale, requiring substantial storage space for training. Unfortunately, ML frameworks employ an...
Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all in execute common on particular cycle. To enable control flow to diverge within threads, GPUs partially serialize and follow path at time. The are not current is masked. Most rely hardware reconvergence stack track concurrent paths choose for execution. Control pushed onto when they popped off reconverge keep lane...
As the models and datasets to train deep learning (DL) scale, system architects are faced with new challenges, one of which is memory capacity bottleneck, where limited physical inside accelerator device constrains algorithm that can be studied. We propose a memory-centric transparently expand available accelerators while also providing fast inter-device communication for parallel training. Our proposal aggregates pool modules locally within device-side interconnect, decoupled from host...
Personalized recommendation systems are gaining significant traction due to their industrial importance. An important building block of consists the embedding layers, which exhibit a highly memory-intensive characteristic. A fundamental primitive layers is vector gathers followed by reductions, exhibiting low arithmetic intensity and becoming bottlenecked memory throughput. To tackle such challenge, recent proposals employ near-data processing (NDP) solution at DRAM rank-level, achieving...
Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable allow even code with irregular control to execute well on their SIMD pipelines. To do this, each lane is considered logical thread where hardware ensures that flow accurate by automatically applying masked execution. The execution, however, often degrades performance because the issue slots of lanes are wasted. This degradation can be mitigated dynamically compacting...
Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been target several prior literatures. Unfortunately, little explored and understood regarding training side this emerging ML workload. In paper, we first perform a detailed characterization study on recommendations, root-causing sparse embedding layer as significant...
Current GPUs maintain high programmability by abstracting the SIMD nature of hardware as independent concurrent threads control with responsible for generating predicate masks to utilize different flows control. This dynamic masking leads poor utilization resources when in same group diverges. Prior research suggests that groups be formed dynamically compacting a large number into groups, mitigating impact divergence. To efficiency, however, alignment thread lane is fixed, limiting potential...