Minsik Cho

ORCID: 0000-0003-0481-2682
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Topic Modeling
  • Advanced Neural Network Applications
  • Natural Language Processing Techniques
  • Music and Audio Processing
  • Speech and Audio Processing
  • Domain Adaptation and Few-Shot Learning
  • Sparse and Compressive Sensing Techniques
  • Advanced Text Analysis Techniques
  • Adversarial Robustness in Machine Learning
  • Machine Learning and ELM
  • Blind Source Separation Techniques
  • Neural Networks and Applications
  • Speech and dialogue systems
  • Asian Culture and Media Studies
  • Quantum Computing Algorithms and Architecture
  • Quantum Information and Cryptography
  • Information Retrieval and Search Behavior
  • Text and Document Classification Technologies
  • Multimodal Machine Learning Applications
  • COVID-19 diagnosis using AI
  • Advancements in Photolithography Techniques
  • Computational Physics and Python Applications
  • Parallel Computing and Optimization Techniques
  • Seismic Imaging and Inversion Techniques

Apple (United Kingdom)
2023-2025

Apple (United States)
2024-2025

IBM (United States)
2020

IBM Research - Austin
2020

The University of Texas at Austin
2006

Spotting user-defined/flexible keywords represented in text frequently uses an expensive encoder for joint analysis with audio embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary based on audio-compliant inherently has homogeneous embedding, it is also much smaller than compatible encoder. Our converts the phonemes using grapheme-to-phoneme...

10.1109/icassp48485.2024.10447547 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts pre-trained dense LLM token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This...

10.48550/arxiv.2502.12325 preprint EN arXiv (Cornell University) 2025-02-17

10.1109/icassp49660.2025.10888533 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Large language models (LLMs) are central to modern natural processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed available capacity by storing model parameters flash memory, but bringing them on demand DRAM. Our method involves constructing an inference cost takes into...

10.48550/arxiv.2312.11514 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

In this paper, an algorithm for scan vector ordering, PEAKASO, is proposed to minimize the peak temperature during testing. Given a circuit with and vectors, hotspot predicted by window-based power analysis. The on minimized global ordering which expedites heat dissipation ambient air through large thermal gradient. Further reduction achieved local reordering based overheat precompensation. As output, PEAKASO provides order lower temperature. Note that vectors themselves are not changed at...

10.1109/vts.2006.56 article EN 2006-05-25

The size of LLMs (i.e., billions parameters) requires highly effective compression to fit into storage-limited devices. Among many techniques, weight-clustering, a form non-linear quantization, is one the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead prohibitively significant fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown state-of-the-art trade-off between ratio accuracy regression, but large memory...

10.1109/lca.2024.3363492 article EN IEEE Computer Architecture Letters 2024-01-01

We present a novel multi-scale embedding scheme that links conventional QM/MM and bootstrap (BE) to allow simulations of large chemical systems on limited quantum devices. also propose mixed-basis BE facilitates calculations extended using classical computers with memory resources. Benchmark data suggest the combination these two strategies as robust path in attaining correlation energies realistic systems, combining proven accuracy biological interest lower computational cost method. Due...

10.48550/arxiv.2409.06813 preprint EN arXiv (Cornell University) 2024-09-10

This report provides an overview of recent work that harnesses the Big Data Revolution and Large Scale Computing to address grand computational challenges in Multi-Messenger Astrophysics, with a particular emphasis on real-time discovery campaigns. Acknowledging transdisciplinary nature this document has been prepared by members physics, astronomy, computer science, data software cyberinfrastructure communities who attended NSF-, DOE- NVIDIA-funded "Deep Learning for Astrophysics: Real-time...

10.48550/arxiv.1902.00522 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Deep neural networks (DNNs) have achieved significant success in a variety of real world applications, i.e., image classification. However, tons parameters the restrict efficiency due to large model size and intensive computation. To address this issue, various approximation techniques been investigated, which seek for light weighted network with little performance degradation exchange smaller or faster inference. Both low-rankness sparsity are appealing properties approximation. In paper we...

10.1109/ictai.2019.00060 article EN 2019-11-01

As deep neural networks become more complex and input datasets grow larger, it can take days or even weeks to train a network the desired accuracy. Therefore, distributed Deep Learning at massive scale is critical capability, since offers potential reduce training time from hours. In this paper, we present software-hardware co-optimized system that achieve near-linear scaling up hundreds of GPUs. The core algorithm multi-ring communication pattern provides good tradeoff between latency...

10.48550/arxiv.1708.02188 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Knowing the similarity between sets of data has a number positive implications in training an effective model, such as assisting informed selection out known datasets favorable to model transfer or augmentation problems with unknown dataset. Common practices estimate include comparing original sample space, embedding space from performing certain task, fine-tuning pretrained different and evaluating performance changes therefrom. However, these would suffer shallow comparisons, task-specific...

10.48550/arxiv.2001.04893 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) its application train-time weight clustering-based DNN compression. DKM casts as an attention problem enables joint optimization of the parameters centroids. Unlike prior works that rely on additional regularizers parameters, DKM-based keeps...

10.48550/arxiv.2108.12659 preprint EN cc-by-nc-nd arXiv (Cornell University) 2021-01-01

Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming (i.e., by adding more parameters) improve predictive power may not be viable real-world tasks. In this work, we propose a new loss, Anchor Loss (SAL), better utilize given encouraging model learn from essential frames. More specifically, our SAL its focal variations dynamically...

10.1109/icassp48485.2024.10447222 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output first token and extension decoding) generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead accelerate phase. The key observation is that generates tokens faster than because of key-value cache (KV-cache). Hence, parallelizes by orchestrating multiple processes populate KV-cache minimizes time-to-first-token (TTFT). Dual-purposing scheme main benefits....

10.48550/arxiv.2405.05329 preprint EN arXiv (Cornell University) 2024-05-08

The inference of transformer-based large language models consists two sequential stages: 1) a prefilling stage to compute the KV cache prompts and generate first token, 2) decoding subsequent tokens. For long prompts, must be computed for all tokens during stage, which can significantly increase time needed token. Consequently, may become bottleneck in generation process. An open question remains whether prompt are essential generating To answer this, we introduce novel method, LazyLLM, that...

10.48550/arxiv.2407.14057 preprint EN arXiv (Cornell University) 2024-07-19

Quantum computers can accurately compute ground state energies using phase estimation, but this requires a guiding which has significant overlap with the true state.For large molecules and extended materials, it becomes difficult to find states good for growing molecule sizes. Additionally, required number of qubits quantum gates may become prohibitively large. One approach dealing these challenges is use embedding method, allows reduction one or multiple smaller cores embedded in larger...

10.48550/arxiv.2408.01940 preprint EN arXiv (Cornell University) 2024-08-04

Large Language Models (LLMs) typically generate outputs token by using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on complexity nature input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting full potential...

10.48550/arxiv.2410.10846 preprint EN arXiv (Cornell University) 2024-10-01

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by maximum length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined can be treated as length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method efficient spotting, SLiCK (exploiting Subsequences Length-Constrained Keyword spotting)....

10.48550/arxiv.2409.09067 preprint EN arXiv (Cornell University) 2024-09-05

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number parameters can be extremely slow and costly. In contrast, small are less expensive to train, but they cannot achieve accuracy models. this paper, we explore an intriguing idea connect these two different regimes: Can develop a method initialize using smaller pre-trained models? Will such initialization bring any benefits terms...

10.48550/arxiv.2409.12903 preprint EN arXiv (Cornell University) 2024-09-19
Coming Soon ...