NFDI4DS | UHH-SEMS - Publication Details

Minsik Cho

ORCID: 0000-0003-0481-2682

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5115076654

Research Areas

Speech Recognition and Synthesis
Topic Modeling
Advanced Neural Network Applications
Natural Language Processing Techniques
Music and Audio Processing
Speech and Audio Processing
Domain Adaptation and Few-Shot Learning
Sparse and Compressive Sensing Techniques
Advanced Text Analysis Techniques
Adversarial Robustness in Machine Learning
Machine Learning and ELM
Blind Source Separation Techniques
Neural Networks and Applications
Speech and dialogue systems
Asian Culture and Media Studies
Quantum Computing Algorithms and Architecture
Quantum Information and Cryptography
Information Retrieval and Search Behavior
Text and Document Classification Technologies
Multimodal Machine Learning Applications
COVID-19 diagnosis using AI
Advancements in Photolithography Techniques
Computational Physics and Python Applications
Parallel Computing and Optimization Techniques
Seismic Imaging and Inversion Techniques

Apple (United Kingdom)
2023-2025

Apple (United States)
2024-2025

IBM (United States)
2020

IBM Research - Austin
2020

The University of Texas at Austin
2006

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

OPENALEX - Publications

Keivan Alizadeh Seyed Iman Mirzadeh Dmitry Belenko S. Karen Khatamifard Minsik Cho and 3 more

10.18653/v1/2024.acl-long.678 article EN 2024-01-01

Flexible Keyword Spotting Based on Homogeneous Audio-Text Embedding

OPENALEX - Publications

Kumari Nishu Minsik Cho Paul R. Dixon Devang Naik

Spotting user-defined/flexible keywords represented in text frequently uses an expensive encoder for joint analysis with audio embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary based on audio-compliant inherently has homogeneous embedding, it is also much smaller than compatible encoder. Our converts the phonemes using grapheme-to-phoneme...

10.1109/icassp48485.2024.10447547 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

OPENALEX - Publications

Kumari Nishu Sachin Mehta Samira Abnar Mehrdad Farajtabar Maxwell Horton and 4 more

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts pre-trained dense LLM token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This...

10.48550/arxiv.2502.12325 preprint EN arXiv (Cornell University) 2025-02-17

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

OPENALEX - Publications

Kumari Nishu Minsik Cho Devang Naik

10.1109/icassp49660.2025.10888533 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models

OPENALEX - Publications

Minsik Cho Keivan Alizadeh Vahid Qichen Fu Saurabh Adya Carlo C. Del Mundo and 3 more

10.1109/hpca61900.2025.00133 article EN 2025-03-01

Matching Latent Encoding for Audio-Text based Keyword Spotting

OPENALEX - Publications

Kumari Nishu Minsik Cho Devang Naik

10.21437/interspeech.2023-478 article EN Interspeech 2022 2023-08-14

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

OPENALEX - Publications

Keivan Alizadeh Iman Mirzadeh Dmitry Belenko Karen Khatamifard Minsik Cho and 3 more

Large language models (LLMs) are central to modern natural processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed available capacity by storing model parameters flash memory, but bringing them on demand DRAM. Our method involves constructing an inference cost takes into...

10.48550/arxiv.2312.11514 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

PEAKASO: Peak-Temperature Aware Scan-Vector Optimization

OPENALEX - Publications

Minsik Cho David Z. Pan

In this paper, an algorithm for scan vector ordering, PEAKASO, is proposed to minimize the peak temperature during testing. Given a circuit with and vectors, hotspot predicted by window-based power analysis. The on minimized global ordering which expedites heat dissipation ambient air through large thermal gradient. Further reduction achieved local reordering based overheat precompensation. As output, PEAKASO provides order lower temperature. Note that vectors themselves are not changed at...

10.1109/vts.2006.56 article EN 2006-05-25

eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models

OPENALEX - Publications

Minsik Cho Keivan Alizadeh Vahid Qichen Fu Saurabh Adya Carlo C. Del Mundo and 3 more

The size of LLMs (i.e., billions parameters) requires highly effective compression to fit into storage-limited devices. Among many techniques, weight-clustering, a form non-linear quantization, is one the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead prohibitively significant fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown state-of-the-art trade-off between ratio accuracy regression, but large memory...

10.1109/lca.2024.3363492 article EN IEEE Computer Architecture Letters 2024-01-01

Multiscale Embedding for Quantum Computing

OPENALEX - Publications

Leah P. Weisburn Minsik Cho Moritz Bensberg Oinam Romesh Meitei Markus Reiher and 1 more

We present a novel multi-scale embedding scheme that links conventional QM/MM and bootstrap (BE) to allow simulations of large chemical systems on limited quantum devices. also propose mixed-basis BE facilitates calculations extended using classical computers with memory resources. Benchmark data suggest the combination these two strategies as robust path in attaining correlation energies realistic systems, combining proven accuracy biological interest lower computational cost method. Due...

10.48550/arxiv.2409.06813 preprint EN arXiv (Cornell University) 2024-09-10

Deep Learning for Multi-Messenger Astrophysics: A Gateway for Discovery in the Big Data Era

OPENALEX - Publications

Gabrielle Allen Igor Andreoni E. Bachelet G. Bruce Berriman Federica Bianco and 43 more

This report provides an overview of recent work that harnesses the Big Data Revolution and Large Scale Computing to address grand computational challenges in Multi-Messenger Astrophysics, with a particular emphasis on real-time discovery campaigns. Acknowledging transdisciplinary nature this document has been prepared by members physics, astronomy, computer science, data software cyberinfrastructure communities who attended NSF-, DOE- NVIDIA-funded "Deep Learning for Astrophysics: Real-time...

10.48550/arxiv.1902.00522 preprint EN other-oa arXiv (Cornell University) 2019-01-01

A Unified Approximation Framework for Compressing and Accelerating Deep Neural Networks

OPENALEX - Publications

Yuzhe Ma Ran Chen Wei Li Fanhua Shang Wenjian Yu and 2 more

Deep neural networks (DNNs) have achieved significant success in a variety of real world applications, i.e., image classification. However, tons parameters the restrict efficiency due to large model size and intensive computation. To address this issue, various approximation techniques been investigated, which seek for light weighted network with little performance degradation exchange smaller or faster inference. Both low-rankness sparsity are appealing properties approximation. In paper we...

10.1109/ictai.2019.00060 article EN 2019-11-01

PowerAI DDL

OPENALEX - Publications

Minsik Cho Ulrich Finkler Sameer Kumar David Kung Vaibhav Saxena and 1 more

As deep neural networks become more complex and input datasets grow larger, it can take days or even weeks to train a network the desired accuracy. Therefore, distributed Deep Learning at massive scale is critical capability, since offers potential reduce training time from hours. In this paper, we present software-hardware co-optimized system that achieve near-linear scaling up hundreds of GPUs. The core algorithm multi-ring communication pattern provides good tradeoff between latency...

10.48550/arxiv.1708.02188 preprint EN other-oa arXiv (Cornell University) 2017-01-01

SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of Autoencoders

OPENALEX - Publications

Inseok Hwang Jinho Lee Frank Liu Minsik Cho

Knowing the similarity between sets of data has a number positive implications in training an effective model, such as assisting informed selection out known datasets favorable to model transfer or augmentation problems with unknown dataset. Common practices estimate include comparing original sample space, embedding space from performing certain task, fine-tuning pretrained different and evaluating performance changes therefrom. However, these would suffer shallow comparisons, task-specific...

10.48550/arxiv.2001.04893 preprint EN other-oa arXiv (Cornell University) 2020-01-01

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

OPENALEX - Publications

Minsik Cho Keivan Alizadeh-Vahid Saurabh Adya Mohammad Rastegari

Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) its application train-time weight clustering-based DNN compression. DKM casts as an attention problem enables joint optimization of the parameters centroids. Unlike prior works that rely on additional regularizers parameters, DKM-based keeps...

10.48550/arxiv.2108.12659 preprint EN cc-by-nc-nd arXiv (Cornell University) 2021-01-01

Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

OPENALEX - Publications

Utkarsh Sarawgi John Berkowitz Vineet Garg Arnav Kundu Minsik Cho and 3 more

Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming (i.e., by adding more parameters) improve predictive power may not be viable real-world tasks. In this work, we propose a new loss, Anchor Loss (SAL), better utilize given encouraging model learn from essential frames. More specifically, our SAL its focal variations dynamically...

10.1109/icassp48485.2024.10447222 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

OPENALEX - Publications

Minsik Cho Mohammad Rastegari Devang Naik

Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output first token and extension decoding) generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead accelerate phase. The key observation is that generates tokens faster than because of key-value cache (KV-cache). Hence, parallelizes by orchestrating multiple processes populate KV-cache minimizes time-to-first-token (TTFT). Dual-purposing scheme main benefits....

10.48550/arxiv.2405.05329 preprint EN arXiv (Cornell University) 2024-05-08

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

OPENALEX - Publications

Qichen Fu Minsik Cho Thomas Merth Sachin Mehta Mohammad Rastegari and 1 more

The inference of transformer-based large language models consists two sequential stages: 1) a prefilling stage to compute the KV cache prompts and generate first token, 2) decoding subsequent tokens. For long prompts, must be computed for all tokens during stage, which can significantly increase time needed token. Consequently, may become bottleneck in generation process. An open question remains whether prompt are essential generating To answer this, we introduce novel method, LazyLLM, that...

10.48550/arxiv.2407.14057 preprint EN arXiv (Cornell University) 2024-07-19

High ground state overlap via quantum embedding methods

OPENALEX - Publications

Mihael Eraković Freek Witteveen Dylan Harley Jakob Günther Moritz Bensberg and 5 more

Quantum computers can accurately compute ground state energies using phase estimation, but this requires a guiding which has significant overlap with the true state.For large molecules and extended materials, it becomes difficult to find states good for growing molecule sizes. Additionally, required number of qubits quantum gates may become prohibitively large. One approach dealing these challenges is use embedding method, allows reduction one or multiple smaller cores embedded in larger...

10.48550/arxiv.2408.01940 preprint EN arXiv (Cornell University) 2024-08-04

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

OPENALEX - Publications

Keivan Alizadeh Iman Mirzadeh Hooman Shahrokhi Dmitry Belenko F.W. Sun and 4 more

Large Language Models (LLMs) typically generate outputs token by using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on complexity nature input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting full potential...

10.48550/arxiv.2410.10846 preprint EN arXiv (Cornell University) 2024-10-01

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

OPENALEX - Publications

Kumari Nishu Minsik Cho Devang Naik

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by maximum length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined can be treated as length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method efficient spotting, SLiCK (exploiting Subsequences Length-Constrained Keyword spotting)....

10.48550/arxiv.2409.09067 preprint EN arXiv (Cornell University) 2024-09-05

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

OPENALEX - Publications

Mohammad Samragh Iman Mirzadeh Keivan Alizadeh Vahid Fartash Faghri Minsik Cho and 3 more

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number parameters can be extremely slow and costly. In contrast, small are less expensive to train, but they cannot achieve accuracy models. this paper, we explore an intriguing idea connect these two different regimes: Can develop a method initialize using smaller pre-trained models? Will such initialization bring any benefits terms...

10.48550/arxiv.2409.12903 preprint EN arXiv (Cornell University) 2024-09-19

Coming Soon ...