- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Distributed systems and fault tolerance
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Advanced Neural Network Applications
- Real-Time Systems Scheduling
- Logic, programming, and type systems
- Advanced Memory and Neural Computing
- Chaos-based Image/Signal Encryption
- Advanced Data Compression Techniques
- Video Coding and Compression Technologies
- Advanced Steganography and Watermarking Techniques
- Educational Systems and Policies
- Ferroelectric and Negative Capacitance Devices
- Advanced Image and Video Retrieval Techniques
- Digital Media Forensic Detection
- Computer Graphics and Visualization Techniques
- Domain Adaptation and Few-Shot Learning
- Radiation Effects in Electronics
- Internet of Things and Social Network Interactions
- Security and Verification in Computing
- Innovation in Digital Healthcare Systems
Seoul National University
2015-2024
Pusan National University
2022
National University of Singapore
2021
Chungbuk National University
2004-2006
Seoul National University of Science and Technology
2006
Dongguk University
1999-2003
Michigan State University
2000-2003
University of Illinois Urbana-Champaign
1996-2002
Georgia Institute of Technology
1997
National Center for Supercomputing Applications
1996
Heterogeneous parallel computing platforms, which are composed of different processors (e.g., CPUs, GPUs, FPGAs, and DSPs), widening their user base in all domains. With this trend, programming models need to achieve portability across as well high performance with reasonable effort. OpenCL (Open Computing Language) is an open standard emerging model write applications for such heterogeneous platforms. In paper, we characterize the implementation NAS Parallel Benchmark suite (NPB) on a...
In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original semantics naturally fits to cluster programming environment, and achieves high performance ease of programming. The target architecture consists a designated, single host node many compute nodes. They are connected by interconnection network, such as Gigabit Ethernet InfiniBand switches. Each is equipped with multicore CPUs multiple GPUs. A set CPU cores or each GPU becomes...
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing virtual device image to the user makes application written for GPU portable platform has devices. It also exploit full computing power of devices total amount memories available in platform. Our automatically distributes at run-time kernel into CUDA kernels execute on applies memory access range analysis by performing sampling run identifies optimal workload...
Thanks to modern deep learning frameworks that exploit GPUs, convolutional neural networks (CNNs) have been greatly successful in visual recognition tasks. In this paper, we analyze the GPU performance characteristics of five popular frameworks: Caffe, CNTK, TensorFlow, Theano, and Torch perspective a representative CNN model, AlexNet. Based on obtained, suggest possible optimization methods increase efficiency models built by frameworks. We also show different convolution algorithms each...
This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, user thread runs on general-purpose processor in main memory, either memory controller chip or DRAM chip. The performs prefetching software, sending prefetched data into L2 cache processor. approach requires minimal hardware beyond processor: table is software structure that resides while only needs few modifications to its so it can accept incoming prefetches. addition,...
The rise of Java, C#, and other explicitly parallel languages has increased the importance compiling for different software memory models. This paper describes co-operating escape, thread structure, delay set analyses that enable high performance sequentially consistent programs.We compare a Java programs compiled sequential consistency (SC) with same weak consistency. For SC, we observe slowdown 10% on average an architecture based Intel Xeon processor, 26% IBM Power3.
In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists a general-purpose processor core multiple cores typically do not have any cache. Each core, instead, has small internal Our OpenCL runtime is based on software-managed caches coherence protocols guarantee memory consistency to overcome limited size To boost performance, relies three...
Deep Reinforcement Learning (Deep RL) is applied to many areas where an agent learns how interact with the environment achieve a certain goal, such as video game plays and robot controls. RL exploits DNN eliminate need for handcrafted feature engineering that requires prior domain knowledge. The Asynchronous Advantage Actor-Critic (A3C) one of state-of-the-art methods. In this paper, we present FPGA-based A3C platform, called FA3C. Traditionally, accelerators have mainly focused on inference...
In general, the hardware memory consistency model in a multiprocessor system is not identical to at programming language level. Consequently, must be mapped onto model. Memory fence instructions can inserted by compiler where needed accomplish this mapping. We have developed and implemented several insertion optimization algorithms our Pensieve project. present different techniques that were used guarantee sequential level, compare them using performance data. Our target two relaxed models...
This paper introduces the idea of using a user-level memory thread (ULMT) for correlation prefetching. In this approach, user runs on general-purpose processor in main memory, either controller chip or DRAM chip. The performs prefetching software, sending prefetched data into L2 cache processor. approach requires minimal hardware beyond processor: table is software structure that resides while only needs few modifications to its so it can accept incoming prefetches. addition, has wide...
Checkpointing, i.e., recording the volatile state of a virtual machine (VM) running as guest in monitor (VMM) for later restoration, includes storing memory available to VM. Typically, full image VM's along with processor and device states are recorded. With sizes up several gigabytes, size checkpoint images becomes more concern.
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for is the MinHash LSH algorithm. Recently, NVIDIA introduced GPU-based method, but it remains suboptimal, leaving room further improvement processing efficiency. This paper proposes GPU-accelerated framework \sys that optimizes GPU clusters leverages computationally efficient partially reusable non-cryptographic hash functions....
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due interference between and execution overhead. To maximize performance in an SMT multiprocessor, finding optimal number of important. This paper presents adaptive techniques find mode for multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at time each...
In this paper, we propose a novel, application specific demand paging mechanism for low-end embedded systems with flash memory as secondary storage. These are not equipped virtual memory. A small space called an execution buffer is allocated to page application. An application-specific manager manages the buffer. The generated by compiler post-pass and combined image. Our analyzes ELF executable image of transforms function call/return instructions into calls manager. As result, each code...
In this paper, we propose a fully automatic dynamic scratch-pad memory (SPM) management technique for instructions. Our loads required code segments into the SPM on demand at runtime. approach is based postpass analysis and optimization techniques, it handles whole program, including libraries. The mapping determined by solving mixed integer linear programming formulation that approximates our paging technique. We increase effectiveness of extracting from functions natural loops are smaller...
In this paper,we present a dynamic scratchpad memory allocation strategy targeting horizontally partitioned subsystem for contemporary embedded processors. The is equipped with management unit (MMU), and physically addressed (SPM)is mapped into the virtual address space. A small minicache added to further reduce energy consumption improve performance.Using MMU's page fault exception mechanism, we track accesses copy frequently executed code sections SPM before they are executed. Because...
Deep Convolutional Neural Networks (CNNs) are empirically known to be invariant moderate translation but not rotation in image classification. This paper proposes a deep CNN model, called CyCNN, which exploits polar mapping of input images convert translation. To deal with the cylindrical property coordinates, we replace convolution layers conventional CNNs convolutional (CyConv) layers. A CyConv layer cylindrically sliding windows (CSW) mechanism that vertically extends input-image...
In this work, we present a dynamic memory allocation technique for novel, horizontally partitioned subsystem targeting contemporary embedded processors with management unit (MMU). We propose to replace the on-chip instruction cache scratchpad (SPM) and small minicache. Serializing address translation actual access enables system either only SPM or Independent of size based solely on profiling information, postpass optimizer classifies code an application binary into pageable cacheable...
There have been strong demands for a fast and cycle-accurate virtual platforms in the embedded systems area where developers can do meaningful software development including performance debugging context of entire platform. In this paper, we describe design implementation architecture simulator called FaCSim as first step towards such FacSim accurately models ARM9E-S processor core ARM926EJ-S processor's memory subsystem. It simulates exceptions interrupts to enable whole-system simulation...
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely processors have advantage resources processor and L1 cache are not contended by the application threads, hence preserving speed of application. However, interprocessor communication expensive We present techniques alleviate this. Our approach exploits large loop-based code regions based new...
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, latencies from different types of memory accesses add overhead adversely affect instruction scheduling. Instead, the internal local to place code data. Programmers heterogeneous multicore architectures must explicitly manage data transfers between a core globally shared main memory. This is tedious errorprone programming task. A...
To exploit the abundant computational power of world's fastest supercomputers, an even workload distribution to typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, estimation modern GPUs do not exist. This paper presents two GPUs: a sampling-based linear model, and model based on machine-learning (ML) techniques which improves accuracy applicable with without caches. We first construct predict runtime arbitrary...
As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress performance of multi-threaded applications if they are not scalable. In this paper, we present a new allocator for applications. The never uses any synchronization common cases. It only lock-free mechanisms uncommon Each thread owns private heap handles requests on heap. Our is completely synchronization-free when allocates block deal locates it by itself....
Deep neural networks (DNNs) are continuing to get wider and deeper. As a result, it requires tremendous amount of GPU memory computing power. In this paper, we propose framework called DeepUM that exploits CUDA Unified Memory (UM) allow oversubscription for DNNs. While UM allows using page fault mechanism, migration introduces enormous overhead. uses new correlation prefetching technique hide the It is fully automatic transparent users. We also two optimization techniques minimize handling...