NFDI4DS | UHH-SEMS - Publication Details

Chimera

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...

10.1145/2694344.2694346 article EN 2015-03-03

Polymorphic pipeline array

OPENALEX - Publications

Hyunchul Park Yongjun Park Scott Mahlke

Mobile computing in the form of smart phones, netbooks, and personal digital assistants has become an integral part our everyday lives. Moving ahead to next generation mobile devices, we believe that multimedia will a more critical product-differentiating feature. High definition audio video as well 3D graphics provide richer interfaces compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia are complex...

10.1145/1669112.1669160 article EN 2009-12-12

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

OPENALEX - Publications

Janghaeng Lee Mehrzad Samadi Yongjun Park Scott Mahlke

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number cores while CPU non data-parallel work, such as sequential code or transfer management. Unfortunately, this distribution can be a poor solution it under utilizes CPU, difficulty generalizing beyond single CPU-GPU combination, may waste large fraction time transferring data. Further, are performance competitive with many...

10.5555/2523721.2523756 article EN International Conference on Parallel Architectures and Compilation Techniques 2013-10-07

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals multitasking GPUs have focused either spatial multitasking, which partitions resource streaming multiprocessor (SM) granularity, or simultaneous multikernel (SMK), runs kernels SM. However, performance varies heavily depending within each scheme, and application mixes. In this paper, we propose Maestro that performs dynamic...

10.1145/3037697.3037707 article EN 2017-04-04

High Quality Area-Selective Atomic Layer Deposition Co Using Ammonia Gas as a Reactant

OPENALEX - Publications

Han‐Bo‐Ram Lee Woo‐Hee Kim Jeong Won Lee Jae Min Kim Kwang Heo and 4 more

Atomic layer deposition (ALD) Co was developed using bis(-diisopropylacetamidinato)cobalt(II) as a precursor and reactant, producing pure thin films with excellent conformality nanoscale thickness controllability. In addition to , the were also deposited by gas reactant. Compared ALD showed higher film quality, lower resistivity, density. The thermal process applied area-selective an octadecyltrichlorosilane self-assembled monolayer blocking layer, which produced wide line patterns without...

10.1149/1.3248002 article EN Journal of The Electrochemical Society 2009-12-04

CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries

OPENALEX - Publications

Y. Lee Yongseung Yu Yongjun Park

10.1145/3696443.3708944 article EN 2025-02-22

Accelerating LLMs using an Efficient GEMM Library and Target-Aware Optimizations on Real-World PIM Devices

OPENALEX - Publications

H. Kim Taehoon Kim T.-H. Park Donghyeon Kim Yongseung Yu and 2 more

10.1145/3696443.3708953 article EN 2025-02-22

CGRA express

OPENALEX - Publications

Yongjun Park Hyunchul Park Scott Mahlke

Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used innermost loops that contain abundant of instruction-level parallelism. Conversely, non-loop outer-loop code are latency constrained do not offer significant amounts In these situations, ineffective as majority resources remain idle. this paper,...

10.1145/1629395.1629433 article EN 2009-10-11

Process variation in near-threshold wide SIMD architectures

OPENALEX - Publications

Sangwon Seo Ronald Dreslinski Mark Woh Yongjun Park Chaitali Charkrabari and 3 more

Near-threshold operation has emerged as a competitive approach for energy-efficient architecture design. In particular, combination of near-threshold circuit techniques and parallel SIMD computations achieves excellent energy efficiency easy-to-parallelize applications. However, operations suffer from delay variations due to increased process variability. This is exacerbated in wide architectures where the number critical paths are multiplied by width. paper provides systematic in-depth...

10.1145/2228360.2228536 article EN 2012-05-31

SKMD

OPENALEX - Publications

Janghaeng Lee Mehrzad Samadi Yongjun Park Scott Mahlke

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number cores while CPU non data-parallel work, such as sequential code or transfer management. This distribution can be a poor solution it underutilizes CPU, difficulty generalizing beyond single CPU-GPU combination, may waste large fraction time transferring data. Further, are performance competitive with many workloads, thus...

10.1145/2798725 article EN ACM Transactions on Computer Systems 2015-08-31

Efficiency of enhanced oil recovery by injection of low-salinity water in barium-containing carbonate reservoirs

OPENALEX - Publications

Hyemin Park Yongjun Park Yeonkyeong Lee Wonmo Sung

When low-salinity water containing sulfate ions is injected into carbonate reservoirs, rock dissolution and in situ precipitation occur, altering permeability wettability. Particularly, when barium are present formation water, they react chemically with $${\text{SO}}_{4}^{2 - }$$ , BaSO4 precipitated. These reactions can cause a serious impact on the efficiency of enhanced oil recovery (EOR). Therefore, main purpose this study was to identify EOR induced by waterflooding (LSWF) Ba2+...

10.1007/s12182-018-0244-z article EN cc-by-nc-nd Petroleum Science 2018-07-16

Chimera

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...

10.1145/2786763.2694346 article EN ACM SIGARCH Computer Architecture News 2015-03-14

APRES

OPENALEX - Publications

Yunho Oh Keun‐Soo Kim Myung Kuk Yoon Jong Hyun Park Yongjun Park and 2 more

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The takes hundreds cycles which is difficult to be hidden by simply interleaving tens warp execution. While cache hierarchy helps reduce system pressure, massive Thread-Level Parallelism (TLP) often causes excessive contention. This paper proposes Adaptive PREfetching Scheduling (APRES) improve GPU efficiency. APRES relies on the following observations. First, certain static load instructions...

10.1145/3007787.3001158 article EN ACM SIGARCH Computer Architecture News 2016-06-18

SIMD defragmenter

OPENALEX - Publications

Yongjun Park Sangwon Seo Hyunchul Park Hyoun Kyu Cho Scott Mahlke

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating parallel resources SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels data-level parallelism (DLP). However, often much less for media applications due low trip count...

10.1145/2150976.2151014 article EN 2012-03-03

Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability

OPENALEX - Publications

Yongjun Park Jason Jong Kyu Park Hyunchul Park Scott Mahlke

Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation these devices will be driven providing even richer user experience and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, games, voice interfaces. To address goals, core capabilities must scaled. However, energy budgets are increasing at a much lower rate, requiring fundamental improvements in efficiency. SIMD accelerators offer...

10.1109/micro.2012.17 article EN 2012-12-01

InGaN/GaN nanowires grown on SiO_2 and light emitting diodes with low turn on voltages

OPENALEX - Publications

Youngseo Park Shafat Jahangir Yongjun Park P. Bhattacharya Junseok Heo

GaN nanowires and InGaN disk heterostructures are grown on an amorphous SiO2 layer by a plasma-assisted molecular beam epitaxy. Structural studies using scanning electron microscopy high-resolution transmission reveal that the grow vertically without any extended defect similarly to Si. The as-grown have intermediate region consisting of Ga, O, Si rather than SiNx at interface between SiO2. measured photoluminescence shows variation peak wavelengths ranging from 580 nm 635 because...

10.1364/oe.23.00a650 article EN cc-by Optics Express 2015-05-15

Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks

OPENALEX - Publications

Jeongmyung Lee Seokwon Kang Yongseung Yu Yong‐Yeon Jo Sang‐Wook Kim and 1 more

Sparse matrix multiplication (spGEMM) is widely used to analyze the sparse network data, and extract important information based on representation. As it contains a high degree of data parallelism, many efficient implementations using data-parallel programming platforms such as CUDA OpenCL have been introduced graphic processing units (GPUs). Several well-known spGEMM techniques, cuS- PARSE CUSP, often do not utilize GPU resources fully, owing load imbalance between threads in expansion...

10.1109/icde48307.2020.00085 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2020-04-01

Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing

OPENALEX - Publications

Jiho Kim Jehee Cha Jason Jong Kyu Park Dongsuk Jeon Yongjun Park

As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to efficiently support concurrent execution of different applications. Spatial multitasking, which assigns amount streaming multiprocessors (SMs) applications, is one the most common solutions this. However, this not panacea maximizing total resource utilization. It because an SM consists many sub-resources such as caches, units and scheduling units, requirements per kernel are...

10.1109/lca.2018.2889042 article EN IEEE Computer Architecture Letters 2018-12-21

ELF

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

Graphics processing units (GPUs) are increasingly utilized as throughput engines in the modern computer systems. GPUs rely on fast context switching between thousands of threads to hide long latency operations, however, they still stall due memory operations. To minimize stalls, operations should be overlapped with other much possible maximize memory-level parallelism (MLP). In this paper, we propose Earliest Load First (ELF) warp scheduling, which maximizes MLP by giving higher priority...

10.1145/2807591.2807598 article EN 2015-10-27

Efficient performance scaling of future CGRAs for mobile applications

OPENALEX - Publications

Yongjun Park Jason Jong Kyu Park Scott Mahlke

Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation these devices will be driven providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, voice interfaces. To meet goals, core capabilities mobile terminals must scaled within highly constrained energy budgets. Coarse-grained reconfigurable architectures (CGRAs) are appealing hardware platform for systems...

10.1109/fpt.2012.6412158 article EN 2012-12-01

SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations

OPENALEX - Publications

Sunghyun Park Salar Latifi Yongjun Park Armand Behroozi Byungsoo Jeon and 1 more

Despite ceaseless efforts, extremely large and complex optimization space makes even the state-of-the-art compilers fail in delivering most performant setting that can fully utilize underlying hardware. Although this inefficiency suggests opportunity for tuning, it has been challenging prior tuning methods to consider interactions between optimizations maximize quality while handling local optima efficiently. To tackle problem, we suggest an intelligent auto-tuning strategy, called SRTuner,...

10.1109/cgo53902.2022.9741263 article EN 2022-03-29

Efficient execution of augmented reality applications on mobile programmable accelerators

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

Mobile devices are ubiquitous in daily lives. From smartphones to tablets, customers constantly demanding richer user experiences through more visual and interactive interface with prolonged battery life. To meet the demands, accelerators commonly adopted system-on-chip (SoC) for various applications. Coarse-grained reconfigurable architecture (CGRA) is a promising solution, which accelerates hot loops software pipelining. Although CGRAs have shown that they can support multimedia...

10.1109/fpt.2013.6718350 article EN 2013-12-01

Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

OPENALEX - Publications

Yunho Oh Keun‐Soo Kim Myung Kuk Yoon Jong Hyun Park Yongjun Park and 2 more

This paper proposes a new architecture, called Adaptive PREfetching and Scheduling (APRES), which improves cache efficiency of GPUs. APRES relies on the observation that GPU loads tend to have either high locality or strided access patterns across warps. schedules warps so as many hits are generated possible before generation any miss. Without directly predicting future hits/misses for each warp, creates warp group will execute same static load shortly prioritizes grouped If first executed...

10.1109/tc.2018.2878671 article EN IEEE Transactions on Computers 2018-10-30

Chimera

OPENALEX - Publications

Jason Jong Kyu Park Yongjun Park Scott Mahlke

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...

10.1145/2775054.2694346 article EN ACM SIGPLAN Notices 2015-03-14

Virtual PIM: Resource-Aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture

OPENALEX - Publications

Donghyeon Kim Taehoon Kim Inyong Hwang Taehyeong Park Hanjun Kim and 2 more

Processing-in-Memory (PIM) is an attractive device that can effectively satisfy the rapidly increasing demands for memory-intensive workloads in emerging application domains, such as deep learning and big data processing. Thanks to integrated design of main memory (MRAM) multiple processing units (DPUs) on a single chip, PIM devices provide massive parallelism from numerous DPUs substantial bandwidth between MRAM DPUs, thus achieving high performance workloads. However, although recent...

10.1109/pact58117.2023.00018 article EN 2023-10-21