- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Embedded Systems Design Techniques
- Advanced Neural Network Applications
- Advanced Memory and Neural Computing
- Neural Networks and Applications
- CCD and CMOS Imaging Sensors
- Stochastic Gradient Optimization Techniques
- Caching and Content Delivery
- Low-power high-performance VLSI design
- Advanced Image and Video Retrieval Techniques
- Face and Expression Recognition
- Machine Learning and ELM
- Ferroelectric and Negative Capacitance Devices
- Distributed systems and fault tolerance
- Domain Adaptation and Few-Shot Learning
- Semiconductor materials and devices
- Face recognition and analysis
- Enhanced Oil Recovery Techniques
- Robotic Mechanisms and Dynamics
- Graph Theory and Algorithms
- Electrical and Thermal Properties of Materials
Yonsei University
2022-2025
Korea Advanced Institute of Science and Technology
2024
Hansei University
2024
Hanyang University
2017-2022
Anyang University
2018
Hongik University
2015-2017
University of Michigan–Ann Arbor
2009-2013
Pohang University of Science and Technology
2009
The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...
Mobile computing in the form of smart phones, netbooks, and personal digital assistants has become an integral part our everyday lives. Moving ahead to next generation mobile devices, we believe that multimedia will a more critical product-differentiating feature. High definition audio video as well 3D graphics provide richer interfaces compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia are complex...
Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number cores while CPU non data-parallel work, such as sequential code or transfer management. Unfortunately, this distribution can be a poor solution it under utilizes CPU, difficulty generalizing beyond single CPU-GPU combination, may waste large fraction time transferring data. Further, are performance competitive with many...
As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals multitasking GPUs have focused either spatial multitasking, which partitions resource streaming multiprocessor (SM) granularity, or simultaneous multikernel (SMK), runs kernels SM. However, performance varies heavily depending within each scheme, and application mixes. In this paper, we propose Maestro that performs dynamic...
Atomic layer deposition (ALD) Co was developed using bis(-diisopropylacetamidinato)cobalt(II) as a precursor and reactant, producing pure thin films with excellent conformality nanoscale thickness controllability. In addition to , the were also deposited by gas reactant. Compared ALD showed higher film quality, lower resistivity, density. The thermal process applied area-selective an octadecyltrichlorosilane self-assembled monolayer blocking layer, which produced wide line patterns without...
Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used innermost loops that contain abundant of instruction-level parallelism. Conversely, non-loop outer-loop code are latency constrained do not offer significant amounts In these situations, ineffective as majority resources remain idle. this paper,...
Near-threshold operation has emerged as a competitive approach for energy-efficient architecture design. In particular, combination of near-threshold circuit techniques and parallel SIMD computations achieves excellent energy efficiency easy-to-parallelize applications. However, operations suffer from delay variations due to increased process variability. This is exacerbated in wide architectures where the number critical paths are multiplied by width. paper provides systematic in-depth...
Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number cores while CPU non data-parallel work, such as sequential code or transfer management. This distribution can be a poor solution it underutilizes CPU, difficulty generalizing beyond single CPU-GPU combination, may waste large fraction time transferring data. Further, are performance competitive with many workloads, thus...
When low-salinity water containing sulfate ions is injected into carbonate reservoirs, rock dissolution and in situ precipitation occur, altering permeability wettability. Particularly, when barium are present formation water, they react chemically with $${\text{SO}}_{4}^{2 - }$$ , BaSO4 precipitated. These reactions can cause a serious impact on the efficiency of enhanced oil recovery (EOR). Therefore, main purpose this study was to identify EOR induced by waterflooding (LSWF) Ba2+...
The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...
Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The takes hundreds cycles which is difficult to be hidden by simply interleaving tens warp execution. While cache hierarchy helps reduce system pressure, massive Thread-Level Parallelism (TLP) often causes excessive contention. This paper proposes Adaptive PREfetching Scheduling (APRES) improve GPU efficiency. APRES relies on the following observations. First, certain static load instructions...
Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating parallel resources SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels data-level parallelism (DLP). However, often much less for media applications due low trip count...
Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation these devices will be driven providing even richer user experience and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, games, voice interfaces. To address goals, core capabilities must scaled. However, energy budgets are increasing at a much lower rate, requiring fundamental improvements in efficiency. SIMD accelerators offer...
GaN nanowires and InGaN disk heterostructures are grown on an amorphous SiO2 layer by a plasma-assisted molecular beam epitaxy. Structural studies using scanning electron microscopy high-resolution transmission reveal that the grow vertically without any extended defect similarly to Si. The as-grown have intermediate region consisting of Ga, O, Si rather than SiNx at interface between SiO2. measured photoluminescence shows variation peak wavelengths ranging from 580 nm 635 because...
Sparse matrix multiplication (spGEMM) is widely used to analyze the sparse network data, and extract important information based on representation. As it contains a high degree of data parallelism, many efficient implementations using data-parallel programming platforms such as CUDA OpenCL have been introduced graphic processing units (GPUs). Several well-known spGEMM techniques, cuS- PARSE CUSP, often do not utilize GPU resources fully, owing load imbalance between threads in expansion...
As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to efficiently support concurrent execution of different applications. Spatial multitasking, which assigns amount streaming multiprocessors (SMs) applications, is one the most common solutions this. However, this not panacea maximizing total resource utilization. It because an SM consists many sub-resources such as caches, units and scheduling units, requirements per kernel are...
Graphics processing units (GPUs) are increasingly utilized as throughput engines in the modern computer systems. GPUs rely on fast context switching between thousands of threads to hide long latency operations, however, they still stall due memory operations. To minimize stalls, operations should be overlapped with other much possible maximize memory-level parallelism (MLP). In this paper, we propose Earliest Load First (ELF) warp scheduling, which maximizes MLP by giving higher priority...
Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation these devices will be driven providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, voice interfaces. To meet goals, core capabilities mobile terminals must scaled within highly constrained energy budgets. Coarse-grained reconfigurable architectures (CGRAs) are appealing hardware platform for systems...
Despite ceaseless efforts, extremely large and complex optimization space makes even the state-of-the-art compilers fail in delivering most performant setting that can fully utilize underlying hardware. Although this inefficiency suggests opportunity for tuning, it has been challenging prior tuning methods to consider interactions between optimizations maximize quality while handling local optima efficiently. To tackle problem, we suggest an intelligent auto-tuning strategy, called SRTuner,...
Mobile devices are ubiquitous in daily lives. From smartphones to tablets, customers constantly demanding richer user experiences through more visual and interactive interface with prolonged battery life. To meet the demands, accelerators commonly adopted system-on-chip (SoC) for various applications. Coarse-grained reconfigurable architecture (CGRA) is a promising solution, which accelerates hot loops software pipelining. Although CGRAs have shown that they can support multimedia...
This paper proposes a new architecture, called Adaptive PREfetching and Scheduling (APRES), which improves cache efficiency of GPUs. APRES relies on the observation that GPU loads tend to have either high locality or strided access patterns across warps. schedules warps so as many hits are generated possible before generation any miss. Without directly predicting future hits/misses for each warp, creates warp group will execute same static load shortly prioritizes grouped If first executed...
The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components modern computer systems along with traditional processors (CPUs). Preemptive CPUs has been primarily supported through context switching. However, same preemption strategy incurs substantial overhead due to large in GPUs. comes two dimensions: a preempting kernel suffers from long latency, and system throughput wasted during switch. Without precise...
Processing-in-Memory (PIM) is an attractive device that can effectively satisfy the rapidly increasing demands for memory-intensive workloads in emerging application domains, such as deep learning and big data processing. Thanks to integrated design of main memory (MRAM) multiple processing units (DPUs) on a single chip, PIM devices provide massive parallelism from numerous DPUs substantial bandwidth between MRAM DPUs, thus achieving high performance workloads. However, although recent...