NFDI4DS | UHH-SEMS - Publication Details

Google Workloads for Consumer Devices

OPENALEX - Publications

Amirali Boroumand Saugata Ghose Youngsok Kim Rachata Ausavarungnirun Eric Shiu and 6 more

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3173162.3173177 article EN 2018-03-19

Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities

OPENALEX - Publications

Sangho Lee Youngsok Kim Jangwoo Kim Jong Kim

Graphics processing units (GPUs) are important components of modern computing devices for not only graphics rendering, but also efficient parallel computations. However, their security problems ignored despite importance and popularity. In this paper, we first perform an in-depth analysis on GPUs to detect vulnerabilities. We observe that contemporary, widely-used GPUs, both NVIDIA's AMD's, do initialize newly allocated GPU memory pages which may contain sensitive user data. By exploiting...

10.1109/sp.2014.9 article EN IEEE Symposium on Security and Privacy 2014-05-01

μLayer

OPENALEX - Publications

Youngsok Kim Joonsung Kim Dongju Chae Daehyun Kim Jangwoo Kim

Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted depend on fast NN execution for high responsiveness, demanding devices minimize the latency by efficiently utilizing their underlying hardware resources. To better resources, existing frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, performance is...

10.1145/3302424.3303950 article EN 2019-03-22

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

OPENALEX - Publications

Jaeyong Song Jinkyu Yim Jaewon Jung Hongsun Jang Hyungjin Kim and 2 more

In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism multiple GPUs. Such technique, however, suffers from high overhead inter-node communication. Compressing the communication is one way mitigate by reducing traffic volume; existing compression techniques have critical limitations be applied for NLP with in that 1) only data targeted, and 2) schemes already harm model quality too much.

10.1145/3575693.3575712 article EN 2023-01-27

Google Workloads for Consumer Devices

OPENALEX - Publications

Amirali Boroumand Saugata Ghose Youngsok Kim Rachata Ausavarungnirun Eric Shiu and 6 more

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3296957.3173177 article EN ACM SIGPLAN Notices 2018-03-19

Real-Time Object Detection System with Multi-Path Neural Networks

OPENALEX - Publications

Seonyeong Heo Sungjun Cho Youngsok Kim Hanjun Kim

Thanks to the recent advances in Deep Neural Networks (DNNs), DNN-based object detection systems become highly accurate and widely used real-time environments such as autonomous vehicles, drones security robots. Although should detect objects within a certain time limit that can vary depending on their execution vehicle speeds, existing blindly execute entire long-latency DNNs without reflecting time-varying limits, thus they cannot guarantee constraints. This work proposes novel system...

10.1109/rtas48715.2020.000-8 article EN 2020-04-01

It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher

OPENALEX - Publications

Kanghyun Choi Hye Yoon Lee Deokki Hong Joonsang Yu Noseong Park and 2 more

Model quantization is considered as a promising method to greatly reduce the resource requirements of deep neural networks. To deal with performance drop induced by errors, popular use training data fine-tune quantized In real-world environments, however, such frequently infeasible because unavailable due security, privacy, or confidentiality concerns. Zero-shot addresses problems, usually taking information from weights full-precision teacher network compensate this paper, we first analyze...

10.1109/cvpr52688.2022.00813 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System

OPENALEX - Publications

Hongsun Jang Jaeyong Song Jaewon Jung Jaeyoung Park Youngsok Kim and 1 more

The recent huge advance of Large Language Models (LLMs) is mainly driven by the increase in number parameters. This has led to substantial memory capacity requirements, necessitating use dozens GPUs just meet capacity. One popular solution this storage-offloaded training, which uses host and storage as an extended hierarchy. However, obviously comes at cost bandwidth bottleneck because devices have orders magnitude lower compared that GPU device memories. Our work, Smart-Infinity, addresses...

10.1109/hpca57654.2024.00034 article EN 2024-03-02

Efficient footprint caching for Tagless DRAM Caches

OPENALEX - Publications

Hakbeom Jang Yongjun Lee Jong‐Won Kim Youngsok Kim Jangwoo Kim and 2 more

Efficient cache tag management is a primary design objective for large, in-package DRAM caches. Recently, Tagless Caches (TDCs) have been proposed to completely eliminate tagging structures from both on-die SRAM and DRAM, which are major scalability bottleneck future multi-gigabyte However, TDC imposes constraint on block size be the same as OS page (e.g., 4KB) it takes unified approach address translation management. Caching at granularity, or page-based caching, incurs significant...

10.1109/hpca.2016.7446068 article EN 2016-03-01

Dataflow Mirroring: Architectural Support for Highly Efficient Fine-Grained Spatial Multitasking on Systolic-Array NPUs

OPENALEX - Publications

Jounghoo Lee Jinwoo Choi Jenny J. Kim Jinho Lee Youngsok Kim

We present dataflow mirroring, architectural support for low-overhead fine-grained systolic array allocation which overcomes the limitations of prior coarse-grained spatial-multitasking Neural Processing Unit (NPU) architectures. The key idea mirroring is to reverse dataflows co-located Networks (NNs) in horizontal and/or vertical directions, allowing boundaries be set between any adjacent rows and columns a supporting up four-way spatial multitasking. Our detailed experiments using MLPerf...

10.1109/dac18074.2021.9586312 article EN 2021-11-08

DANCE: Differentiable Accelerator/Network Co-Exploration

OPENALEX - Publications

Kanghyun Choi Deokki Hong Hojae Yoon Joonsang Yu Youngsok Kim and 1 more

This work presents DANCE, a differentiable approach towards the co-exploration of hardware accelerator and network architecture design. At heart DANCE is evaluator network. By modeling evaluation software with neural network, relation between design metrics becomes differentiable, allowing search to be performed backpropagation. Compared naive existing approaches, our method performs in significantly shorter time, while achieving superior accuracy cost metrics.

10.1109/dac18074.2021.9586121 article EN 2021-11-08

Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs

OPENALEX - Publications

Chaemin Lim Su-Hyun Lee Jinwoo Choi Jounghoo Lee Seongyeon Park and 3 more

Modern dual in-line memory modules (DIMMs) support processing-in-memory (PIM) by implementing in-DIMM processors (IDPs) located near banks. PIM can greatly accelerate in-memory join, whose performance is frequently bounded main-memory accesses, offloading the operations of join from host central processing units (CPUs) to IDPs. As real hardware has not been available until very recently, prior PIM-assisted algorithms have relied on simulators which assume fast shared between IDPs and...

10.1145/3589258 article EN Proceedings of the ACM on Management of Data 2023-06-13

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

OPENALEX - Publications

Kanghyun Choi Deokki Hong Noseong Park Youngsok Kim Jinho Lee

Model quantization is known as a promising method to compress deep neural networks, especially for inferences on lightweight mobile or edge devices. However, model usually requires access the original training data maintain accuracy of full-precision models, which often infeasible in real-world scenarios security and privacy issues. A popular approach perform without use synthetically generated samples, based batch-normalization statistics adversarial learning. drawback such approaches that...

10.48550/arxiv.2111.02625 preprint EN cc-by-nc-nd arXiv (Cornell University) 2021-01-01

GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management

OPENALEX - Publications

Youngsok Kim Jaewon Lee Jae-Eon Jo Jangwoo Kim

GPU programmers suffer from programmer-managed memory because both performance and programmability heavily depend on allocation CPU-GPU data transfer mechanisms. To improve programmability, should be able to place only the frequently accessed by while overlapping transfers executions as much possible. However, current architectures programming models blindly entire memory, requiring a significantly large size. Otherwise, they must trigger unnecessary due an insufficient In this paper, we...

10.1109/hpca.2014.6835963 article EN 2014-02-01

Flexon: A Flexible Digital Neuron for Efficient Spiking Neural Network Simulations

OPENALEX - Publications

Dayeol Lee Gwangmu Lee Dongup Kwon Sunghwa Lee Youngsok Kim and 1 more

Spiking Neural Networks (SNNs) play an important role in neuroscience as they help neuroscientists understand how the nervous system works. To model system, SNNs incorporate concept of time into neurons and inter-neuron interactions called spikes; a neuron's internal state changes with respect to input spikes, neuron fires output spike when its satisfies certain conditions. As forming behave differently, SNN simulation frameworks must be able simulate diverse behaviors neurons. support any...

10.1109/isca.2018.00032 article EN 2018-06-01

DCS

OPENALEX - Publications

Jaehyung Ahn Dongup Kwon Youngsok Kim Mohammadamin Ajdari Jaewon Lee and 1 more

Conventional servers have achieved high performance by employing fast CPUs to run compute-intensive workloads, while making operating systems manage relatively slow I/O devices through memory accesses and interrupts. However, as the emerging workloads are becoming heavily data-intensive (e.g., NVM storage, high-bandwidth NICs, GPUs) come enable low-latency device operations, traditional host-centric server architectures fail deliver due their inefficient handling mechanisms. Furthermore,...

10.1145/2830772.2830794 article EN 2015-12-05

SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators

OPENALEX - Publications

Mingi Yoo Jaeyong Song Jounghoo Lee Namhyung Kim Youngsok Kim and 1 more

Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. One recent trend in GCNs is use deep network architectures. As opposed to traditional GCNs, which only span around two five layers deep, modern now incorporate tens hundreds with help residual connections. From such we find an important characteristic that exhibit very high intermediate feature sparsity. This reveals a new opportunity for accelerators...

10.1109/hpca56546.2023.10071102 article EN 2023-02-01

ScaleGPU: GPU Architecture for Memory-Unaware GPU Programming

OPENALEX - Publications

Youngsok Kim Jaewon Lee Donggyu Kim Jangwoo Kim

Programmer-managed GPU memory is a major challenge in writing applications. Programmers must rewrite and optimize an existing code for different size both portability performance. Alternatively, they can achieve only by disabling at the cost of significant performance degradation. In this paper, we propose ScaleGPU, novel architecture to enable high-performance memory-unaware programming. ScaleGPU uses as cache CPU provide programmers view memory-sized programming space. also achieves high...

10.1109/l-ca.2013.19 article EN IEEE Computer Architecture Letters 2013-07-16

FlexLearn

OPENALEX - Publications

Eunjin Baek Hunjun Lee Youngsok Kim Jangwoo Kim

To understand how the human brain works, neuroscientists heavily rely on simulations which incorporate concept of time to their operating model. In simulations, neurons transmit signals through synapses whose weights change over and by activity associated neurons. Such changes in synaptic weights, known as learning, are thought contribute memory, various learning rules exist model different behaviors brain. Due diverse rules, perform using highly programmable general-purpose processors....

10.1145/3352460.3358268 article EN 2019-10-11

CloudSwap: A Cloud-Assisted Swap Mechanism for Mobile Devices

OPENALEX - Publications

Dongju Chae Joonsung Kim Youngsok Kim Jangwoo Kim Kyung-Ah Chang and 2 more

Application caching is a key feature to enable fast application switches for mobile devices by the entire memory pages of applications in device's physical memory. However, requires prohibitive amount unless swap employed maintain only working sets Unfortunately, often disable invaluable as it can severely decrease flash-based local storage already marginal lifespan due increased writes device. As result, modern suffering from insufficient space end up killing memory-hungry and keeping few...

10.1109/ccgrid.2016.22 article EN 2016-05-01

AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read Mapping

OPENALEX - Publications

Seongyeon Park Junguk Hong Jaeyong Song Hajin Kim Youngsok Kim and 1 more

With the advance in genome sequencing technology, lengths of deoxyribonucleic acid (DNA) results are rapidly increasing at lower prices than ever. However, longer come cost a heavy computational burden on aligning them. For example, sequences to human reference can take tens or even hundreds hours. The current de facto standard approach for alignment is based guided dynamic programming method. Although this takes long time and could potentially benefit from high-throughput graphic processing...

10.1145/3627535.3638474 article EN other-oa 2024-02-20

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

OPENALEX - Publications

Si Ung Noh Junguk Hong Chaemin Lim Seongyeon Park JeeHyun Kim and 3 more

10.1109/isca59077.2024.00027 article EN 2024-06-29

GCoM

OPENALEX - Publications

Jounghoo Lee Yeonan Ha Su-Hyun Lee Jinyoung Woo Jinho Lee and 2 more

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid for graphics processing units (GPUs), prior studies have proposed GPU analytical which capture first-order stall events causing performance degradation; however, the existing cannot accurately model modern GPUs due to their outdated and highly abstract core microarchitecture assumptions. Therefore, evaluate GPUs,...

10.1145/3470496.3527384 article EN 2022-05-31

Enabling Fine-Grained Spatial Multitasking on Systolic-Array NPUs Using Dataflow Mirroring

OPENALEX - Publications

Jinwoo Choi Yeonan Ha Jounghoo Lee Sangsu Lee Jinho Lee and 2 more

Neural Processing Units (NPUs) frequently suffer from low hardware utilization as the efficiency of their systolic arrays heavily depends on characteristics a deep neural network (DNN). Spatial multitasking is promising solution to overcome NPU utilization; however, state-of-the-art spatial-multitasking architecture achieves sub-optimal performance due its coarse-grained systolic-array distribution and incurs significant implementation costs. In this paper, we propose <italic...

10.1109/tc.2023.3299030 article EN IEEE Transactions on Computers 2023-08-01