- Low-power high-performance VLSI design
- Advanced Memory and Neural Computing
- Ferroelectric and Negative Capacitance Devices
- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Algorithms and Data Compression
- Analog and Mixed-Signal Circuit Design
- Advancements in Semiconductor Devices and Circuit Design
- Embedded Systems Design Techniques
- CCD and CMOS Imaging Sensors
- Medical Image Segmentation Techniques
- Advanced Image and Video Retrieval Techniques
- Numerical Methods and Algorithms
- Domain Adaptation and Few-Shot Learning
- Topic Modeling
- Particle accelerators and beam dynamics
- Advanced Data Compression Techniques
- Magnetic confinement fusion research
Pohang University of Science and Technology
2021-2025
In this paper, we present a novel approximate computing scheme suitable for realizing the energy-efficient multiply-accumulate (MAC) processing. contrast to prior works that suffer from error accumulation limiting range, utilize different multipliers in an interleaved way compensate errors opposite direction during accumulate operations. For balanced accumulation, first design 4-2 compressors generating while minimizing computational costs. Based on probabilistic analysis, positive and...
The introduction of 8-bit floating-point (FP8) computation units in modern AI accelerators has generated significant interest FP8-based large language model (LLM) inference. Unlike 16-bit formats, FP8 deep learning requires a shared scaling factor. Additionally, while E4M3 and E5M2 are well-defined at the individual value level, their accumulation methods remain unspecified vary across hardware software implementations. As result, behaves more like quantization format than standard numeric...
Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, growing size of NLP models introduces a memory wall problem during generation phase. To mitigate this issue, recent efforts focused on quantizing model weights to sub-4-bit precision while preserving full for activations, resulting practical speed-ups inference single GPU. these improvements primarily stem...
In this brief, we present a novel design methodology of cost-effective approximate radix-4 Booth multipliers, which can significantly reduce the power consumption error-resilient signal processing tasks. contrast that prior studies only focus on approximation either partial product generation with encoders or reductions compressors, proposed method considers two major steps jointly by forcing generated error directions to be opposite each other. As internal errors are naturally balanced have...
We present the energy-efficient TF-MVP architecture, a sparsity-aware transformer accelerator, by introducing novel algorithm-hardware co-optimization techniques. From previous fine-grained pruning map, for first time, direction strength is developed to analyze patterns quantitatively, indicating major and size of each layer. Then, mixed-length vector (MVP) proposed generate hardware-friendly pruned-transformer model, which fully supported our accelerator with reconfigurable PE structure....
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, memory footprint KV cache poses a critical bottleneck in LLM deployment as size grows with batch sequence length, often surpassing even model itself. Although recent methods were proposed to select evict unimportant pairs from reduce consumption, potential ramifications eviction on process are yet be thoroughly examined. In this...
Based on recent RISC-V designs, we present in this paper a low-power vector processor architecture for efficiently deploying vision transformer (ViT) models. To fairly measure the processing efficiency of different designs with instruction/data cache memories, first develop evaluation framework based numerous design tools jointly considering algorithm, architecture, and circuit performances together, numerically revealing that previous CSR-based data compression cannot accelerate pruned...