NFDI4DS | UHH-SEMS - Publication Details

Jaejin Lee

ORCID: 0000-0003-4638-8170

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100767182

Research Areas

Parallel Computing and Optimization Techniques
Advanced Data Storage Technologies
Embedded Systems Design Techniques
Interconnection Networks and Systems
Distributed systems and fault tolerance
Distributed and Parallel Computing Systems
Cloud Computing and Resource Management
Advanced Neural Network Applications
Real-Time Systems Scheduling
Logic, programming, and type systems
Advanced Memory and Neural Computing
Chaos-based Image/Signal Encryption
Advanced Data Compression Techniques
Video Coding and Compression Technologies
Advanced Steganography and Watermarking Techniques
Educational Systems and Policies
Ferroelectric and Negative Capacitance Devices
Advanced Image and Video Retrieval Techniques
Digital Media Forensic Detection
Computer Graphics and Visualization Techniques
Domain Adaptation and Few-Shot Learning
Radiation Effects in Electronics
Internet of Things and Social Network Interactions
Security and Verification in Computing
Innovation in Digital Healthcare Systems

Seoul National University
2015-2024

Pusan National University
2022

National University of Singapore
2021

Chungbuk National University
2004-2006

Seoul National University of Science and Technology
2006

Dongguk University
1999-2003

Michigan State University
2000-2003

University of Illinois Urbana-Champaign
1996-2002

Georgia Institute of Technology
1997

National Center for Supercomputing Applications
1996

Performance characterization of the NAS Parallel Benchmarks in OpenCL

OPENALEX - Publications

Sangmin Seo Gangwon Jo Jaejin Lee

Heterogeneous parallel computing platforms, which are composed of different processors (e.g., CPUs, GPUs, FPGAs, and DSPs), widening their user base in all domains. With this trend, programming models need to achieve portability across as well high performance with reasonable effort. OpenCL (Open Computing Language) is an open standard emerging model write applications for such heterogeneous platforms. In paper, we characterize the implementation NAS Parallel Benchmark suite (NPB) on a...

10.1109/iiswc.2011.6114174 article EN 2011-11-01

SnuCL

OPENALEX - Publications

Jungwon Kim Sangmin Seo Jun Lee Jeongho Nah Gangwon Jo and 1 more

In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original semantics naturally fits to cluster programming environment, and achieves high performance ease of programming. The target architecture consists a designated, single host node many compute nodes. They are connected by interconnection network, such as Gigabit Ethernet InfiniBand switches. Each is equipped with multicore CPUs multiple GPUs. A set CPU cores or each GPU becomes...

10.1145/2304576.2304623 article EN 2012-06-25

Achieving a single compute device image in OpenCL for multiple GPUs

OPENALEX - Publications

Jungwon Kim Honggyu Kim Joo Hwan Lee Jaejin Lee

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing virtual device image to the user makes application written for GPU portable platform has devices. It also exploit full computing power of devices total amount memories available in platform. Our automatically distributes at run-time kernel into CUDA kernels execute on applies memory access range analysis by performing sampling run identifies optimal workload...

10.1145/1941553.1941591 article EN 2011-02-12

Performance analysis of CNN frameworks for GPUs

OPENALEX - Publications

Heehoon Kim Hyoungwook Nam Wookeun Jung Jaejin Lee

Thanks to modern deep learning frameworks that exploit GPUs, convolutional neural networks (CNNs) have been greatly successful in visual recognition tasks. In this paper, we analyze the GPU performance characteristics of five popular frameworks: Caffe, CNTK, TensorFlow, Theano, and Torch perspective a representative CNN model, AlexNet. Based on obtained, suggest possible optimization methods increase efficiency models built by frameworks. We also show different convolution algorithms each...

10.1109/ispass.2017.7975270 article EN 2017-04-01

Using a user-level memory thread for correlation prefetching

OPENALEX - Publications

Yan Solihin Jaejin Lee Josep Torrellas

This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, user thread runs on general-purpose processor in main memory, either memory controller chip or DRAM chip. The performs prefetching software, sending prefetched data into L2 cache processor. approach requires minimal hardware beyond processor: table is software structure that resides while only needs few modifications to its so it can accept incoming prefetches. addition,...

10.1145/545214.545235 article EN ACM SIGARCH Computer Architecture News 2002-05-01

Compiler techniques for high performance sequentially consistent java programs

OPENALEX - Publications

Zehra Sura Xing Fang Chi-Leung Wong Samuel P. Midkiff Jaejin Lee and 1 more

The rise of Java, C#, and other explicitly parallel languages has increased the importance compiling for different software memory models. This paper describes co-operating escape, thread structure, delay set analyses that enable high performance sequentially consistent programs.We compare a Java programs compiled sequential consistency (SC) with same weak consistency. For SC, we observe slowdown 10% on average an architecture based Intel Xeon processor, 26% IBM Power3.

10.1145/1065944.1065947 article EN 2005-06-15

An OpenCL framework for heterogeneous multicores with local memory

OPENALEX - Publications

Jaejin Lee Jungwon Kim Sangmin Seo Seungkyun Kim Jung‐Ho Park and 9 more

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists a general-purpose processor core multiple cores typically do not have any cache. Each core, instead, has small internal Our OpenCL runtime is based on software-managed caches coherence protocols guarantee memory consistency to overcome limited size To boost performance, relies three...

10.1145/1854273.1854301 article EN 2010-09-11

FA3C

OPENALEX - Publications

Hyungmin Cho Pyeongseok Oh Jiyoung Park Wookeun Jung Jaejin Lee

Deep Reinforcement Learning (Deep RL) is applied to many areas where an agent learns how interact with the environment achieve a certain goal, such as video game plays and robot controls. RL exploits DNN eliminate need for handcrafted feature engineering that requires prior domain knowledge. The Asynchronous Advantage Actor-Critic (A3C) one of state-of-the-art methods. In this paper, we present FPGA-based A3C platform, called FA3C. Traditionally, accelerators have mainly focused on inference...

10.1145/3297858.3304058 article EN 2019-04-04

Automatic fence insertion for shared memory multiprocessing

OPENALEX - Publications

Xing Fang Jaejin Lee Samuel P. Midkiff

In general, the hardware memory consistency model in a multiprocessor system is not identical to at programming language level. Consequently, must be mapped onto model. Memory fence instructions can inserted by compiler where needed accomplish this mapping. We have developed and implemented several insertion optimization algorithms our Pensieve project. present different techniques that were used guarantee sequential level, compare them using performance data. Our target two relaxed models...

10.1145/782814.782854 article EN 2003-06-23

Using a user-level memory thread for correlation prefetching

OPENALEX - Publications

Yan Solihin Jaejin Lee Josep Torrellas

This paper introduces the idea of using a user-level memory thread (ULMT) for correlation prefetching. In this approach, user runs on general-purpose processor in main memory, either controller chip or DRAM chip. The performs prefetching software, sending prefetched data into L2 cache processor. approach requires minimal hardware beyond processor: table is software structure that resides while only needs few modifications to its so it can accept incoming prefetches. addition, has wide...

10.1109/isca.2002.1003576 article EN 2003-06-25

Fast and space-efficient virtual machine checkpointing

OPENALEX - Publications

Eunbyung Park Bernhard Egger Jaejin Lee

Checkpointing, i.e., recording the volatile state of a virtual machine (VM) running as guest in monitor (VMM) for later restoration, includes storing memory available to VM. Typically, full image VM's along with processor and device states are recorded. With sizes up several gigabytes, size checkpoint images becomes more concern.

10.1145/1952682.1952694 article EN 2011-03-09

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

OPENALEX - Publications

Young‐Jun Son C. Kim Jaejin Lee

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for is the MinHash LSH algorithm. Recently, NVIDIA introduced GPU-based method, but it remains suboptimal, leaving room further improvement processing efficiency. This paper proposes GPU-accelerated framework \sys that optimizes GPU clusters leverages computationally efficient partially reusable non-cryptographic hash functions....

10.48550/arxiv.2501.01046 preprint EN arXiv (Cornell University) 2025-01-01

Adaptive execution techniques for SMT multiprocessor architectures

OPENALEX - Publications

Changhee Jung Daeseob Lim Jaejin Lee SangYong Han

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due interference between and execution overhead. To maximize performance in an SMT multiprocessor, finding optimal number of important. This paper presents adaptive techniques find mode for multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at time each...

10.1145/1065944.1065976 article EN 2005-06-15

Compiler-assisted demand paging for embedded systems with flash memory

OPENALEX - Publications

Chanik Park Junghee Lim Kiwon Kwon Jaejin Lee Sang Lyul Min

In this paper, we propose a novel, application specific demand paging mechanism for low-end embedded systems with flash memory as secondary storage. These are not equipped virtual memory. A small space called an execution buffer is allocated to page application. An application-specific manager manages the buffer. The generated by compiler post-pass and combined image. Our analyzes ELF executable image of transforms function call/return instructions into calls manager. As result, each code...

10.1145/1017753.1017775 article EN 2004-09-27

A dynamic code placement technique for scratchpad memory using postpass optimization

OPENALEX - Publications

Bernhard Egger Chihun Kim Choonki Jang Yoonsung Nam Jaejin Lee and 1 more

In this paper, we propose a fully automatic dynamic scratch-pad memory (SPM) management technique for instructions. Our loads required code segments into the SPM on demand at runtime. approach is based postpass analysis and optimization techniques, it handles whole program, including libraries. The mapping determined by solving mixed integer linear programming formulation that approximates our paging technique. We increase effectiveness of extracting from functions natural loops are smaller...

10.1145/1176760.1176788 article EN 2006-10-22

Scratchpad memory management for portable systems with a memory management unit

OPENALEX - Publications

Bernhard Egger Jaejin Lee Heonshik Shin

In this paper,we present a dynamic scratchpad memory allocation strategy targeting horizontally partitioned subsystem for contemporary embedded processors. The is equipped with management unit (MMU), and physically addressed (SPM)is mapped into the virtual address space. A small minicache added to further reduce energy consumption improve performance.Using MMU's page fault exception mechanism, we track accesses copy frequently executed code sections SPM before they are executed. Because...

10.1145/1176887.1176933 article EN 2006-01-01

CyCNN: A Rotation Invariant CNN using Polar Mapping and Cylindrical Convolution Layers

OPENALEX - Publications

Jin-Pyo Kim Wooekun Jung Hyungmo Kim Jaejin Lee

Deep Convolutional Neural Networks (CNNs) are empirically known to be invariant moderate translation but not rotation in image classification. This paper proposes a deep CNN model, called CyCNN, which exploits polar mapping of input images convert translation. To deal with the cylindrical property coordinates, we replace convolution layers conventional CNNs convolutional (CyConv) layers. A CyConv layer cylindrically sliding windows (CSW) mechanism that vertically extends input-image...

10.48550/arxiv.2007.10588 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Dynamic scratchpad memory management for code in portable systems with an MMU

OPENALEX - Publications

Bernhard Egger Jaejin Lee Heonshik Shin

In this work, we present a dynamic memory allocation technique for novel, horizontally partitioned subsystem targeting contemporary embedded processors with management unit (MMU). We propose to replace the on-chip instruction cache scratchpad (SPM) and small minicache. Serializing address translation actual access enables system either only SPM or Independent of size based solely on profiling information, postpass optimizer classifies code an application binary into pageable cacheable...

10.1145/1331331.1331335 article EN ACM Transactions on Embedded Computing Systems 2008-01-29

FaCSim

OPENALEX - Publications

Jaejin Lee Jung‐Hyun Kim Choonki Jang Seungkyun Kim Bernhard Egger and 2 more

There have been strong demands for a fast and cycle-accurate virtual platforms in the embedded systems area where developers can do meaningful software development including performance debugging context of entire platform. In this paper, we describe design implementation architecture simulator called FaCSim as first step towards such FacSim accurately models ARM9E-S processor core ARM926EJ-S processor's memory subsystem. It simulates exceptions interrupts to enable whole-system simulation...

10.1145/1375657.1375670 article EN 2008-06-12

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

OPENALEX - Publications

Jaejin Lee Changhee Jung Daeseob Lim Yan Solihin

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely processors have advantage resources processor and L1 cache are not contended by the application threads, hence preserving speed of application. However, interprocessor communication expensive We present techniques alleviate this. Our approach exploits large loop-based code regions based new...

10.1109/tpds.2008.224 article EN IEEE Transactions on Parallel and Distributed Systems 2008-10-15

Design and implementation of software-managed caches for multicores with local memory

OPENALEX - Publications

Sangmin Seo Jaejin Lee Zehra Sura

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, latencies from different types of memory accesses add overhead adversely affect instruction scheduling. Instead, the internal local to place code data. Programmers heterogeneous multicore architectures must explicitly manage data transfers between a core globally shared main memory. This is tedious errorprone programming task. A...

10.1109/hpca.2009.4798237 article EN 2009-02-01

A Performance Model for GPUs with Caches

OPENALEX - Publications

Thanh Tuan Dao Jungwon Kim Sangmin Seo Bernhard Egger Jaejin Lee

To exploit the abundant computational power of world's fastest supercomputers, an even workload distribution to typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, estimation modern GPUs do not exist. This paper presents two GPUs: a sampling-based linear model, and model based on machine-learning (ML) techniques which improves accuracy applicable with without caches. We first construct predict runtime arbitrary...

10.1109/tpds.2014.2333526 article EN IEEE Transactions on Parallel and Distributed Systems 2014-06-26

SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores

OPENALEX - Publications

Sangmin Seo Jung‐Hyun Kim Jaejin Lee

As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress performance of multi-threaded applications if they are not scalable. In this paper, we present a new allocator for applications. The never uses any synchronization common cases. It only lock-free mechanisms uncommon Each thread owns private heap handles requests on heap. Our is completely synchronization-free when allocates block deal locates it by itself....

10.1109/pact.2011.57 article EN International Conference on Parallel Architectures and Compilation Techniques 2011-10-01

DeepUM: Tensor Migration and Prefetching in Unified Memory

OPENALEX - Publications

Jaehoon Jung Jin-Pyo Kim Jaejin Lee

Deep neural networks (DNNs) are continuing to get wider and deeper. As a result, it requires tremendous amount of GPU memory computing power. In this paper, we propose framework called DeepUM that exploits CUDA Unified Memory (UM) allow oversubscription for DNNs. While UM allows using page fault mechanism, migration introduces enormous overhead. uses new correlation prefetching technique hide the It is fully automatic transparent users. We also two optimization techniques minimize handling...

10.1145/3575693.3575736 article EN 2023-01-27

Coming Soon ...