- Speech Recognition and Synthesis
- Topic Modeling
- Advanced Neural Network Applications
- Natural Language Processing Techniques
- Music and Audio Processing
- Speech and Audio Processing
- Domain Adaptation and Few-Shot Learning
- Sparse and Compressive Sensing Techniques
- Advanced Text Analysis Techniques
- Adversarial Robustness in Machine Learning
- Machine Learning and ELM
- Blind Source Separation Techniques
- Neural Networks and Applications
- Speech and dialogue systems
- Asian Culture and Media Studies
- Quantum Computing Algorithms and Architecture
- Quantum Information and Cryptography
- Information Retrieval and Search Behavior
- Text and Document Classification Technologies
- Multimodal Machine Learning Applications
- COVID-19 diagnosis using AI
- Advancements in Photolithography Techniques
- Computational Physics and Python Applications
- Parallel Computing and Optimization Techniques
- Seismic Imaging and Inversion Techniques
Apple (United Kingdom)
2023-2025
Apple (United States)
2024-2025
IBM (United States)
2020
IBM Research - Austin
2020
The University of Texas at Austin
2006
Spotting user-defined/flexible keywords represented in text frequently uses an expensive encoder for joint analysis with audio embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary based on audio-compliant inherently has homogeneous embedding, it is also much smaller than compatible encoder. Our converts the phonemes using grapheme-to-phoneme...
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts pre-trained dense LLM token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This...
Large language models (LLMs) are central to modern natural processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed available capacity by storing model parameters flash memory, but bringing them on demand DRAM. Our method involves constructing an inference cost takes into...
In this paper, an algorithm for scan vector ordering, PEAKASO, is proposed to minimize the peak temperature during testing. Given a circuit with and vectors, hotspot predicted by window-based power analysis. The on minimized global ordering which expedites heat dissipation ambient air through large thermal gradient. Further reduction achieved local reordering based overheat precompensation. As output, PEAKASO provides order lower temperature. Note that vectors themselves are not changed at...
The size of LLMs (i.e., billions parameters) requires highly effective compression to fit into storage-limited devices. Among many techniques, weight-clustering, a form non-linear quantization, is one the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead prohibitively significant fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown state-of-the-art trade-off between ratio accuracy regression, but large memory...
We present a novel multi-scale embedding scheme that links conventional QM/MM and bootstrap (BE) to allow simulations of large chemical systems on limited quantum devices. also propose mixed-basis BE facilitates calculations extended using classical computers with memory resources. Benchmark data suggest the combination these two strategies as robust path in attaining correlation energies realistic systems, combining proven accuracy biological interest lower computational cost method. Due...
This report provides an overview of recent work that harnesses the Big Data Revolution and Large Scale Computing to address grand computational challenges in Multi-Messenger Astrophysics, with a particular emphasis on real-time discovery campaigns. Acknowledging transdisciplinary nature this document has been prepared by members physics, astronomy, computer science, data software cyberinfrastructure communities who attended NSF-, DOE- NVIDIA-funded "Deep Learning for Astrophysics: Real-time...
Deep neural networks (DNNs) have achieved significant success in a variety of real world applications, i.e., image classification. However, tons parameters the restrict efficiency due to large model size and intensive computation. To address this issue, various approximation techniques been investigated, which seek for light weighted network with little performance degradation exchange smaller or faster inference. Both low-rankness sparsity are appealing properties approximation. In paper we...
As deep neural networks become more complex and input datasets grow larger, it can take days or even weeks to train a network the desired accuracy. Therefore, distributed Deep Learning at massive scale is critical capability, since offers potential reduce training time from hours. In this paper, we present software-hardware co-optimized system that achieve near-linear scaling up hundreds of GPUs. The core algorithm multi-ring communication pattern provides good tradeoff between latency...
Knowing the similarity between sets of data has a number positive implications in training an effective model, such as assisting informed selection out known datasets favorable to model transfer or augmentation problems with unknown dataset. Common practices estimate include comparing original sample space, embedding space from performing certain task, fine-tuning pretrained different and evaluating performance changes therefrom. However, these would suffer shallow comparisons, task-specific...
Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) its application train-time weight clustering-based DNN compression. DKM casts as an attention problem enables joint optimization of the parameters centroids. Unlike prior works that rely on additional regularizers parameters, DKM-based keeps...
Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming (i.e., by adding more parameters) improve predictive power may not be viable real-world tasks. In this work, we propose a new loss, Anchor Loss (SAL), better utilize given encouraging model learn from essential frames. More specifically, our SAL its focal variations dynamically...
Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output first token and extension decoding) generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead accelerate phase. The key observation is that generates tokens faster than because of key-value cache (KV-cache). Hence, parallelizes by orchestrating multiple processes populate KV-cache minimizes time-to-first-token (TTFT). Dual-purposing scheme main benefits....
The inference of transformer-based large language models consists two sequential stages: 1) a prefilling stage to compute the KV cache prompts and generate first token, 2) decoding subsequent tokens. For long prompts, must be computed for all tokens during stage, which can significantly increase time needed token. Consequently, may become bottleneck in generation process. An open question remains whether prompt are essential generating To answer this, we introduce novel method, LazyLLM, that...
Quantum computers can accurately compute ground state energies using phase estimation, but this requires a guiding which has significant overlap with the true state.For large molecules and extended materials, it becomes difficult to find states good for growing molecule sizes. Additionally, required number of qubits quantum gates may become prohibitively large. One approach dealing these challenges is use embedding method, allows reduction one or multiple smaller cores embedded in larger...
Large Language Models (LLMs) typically generate outputs token by using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on complexity nature input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting full potential...
User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by maximum length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined can be treated as length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method efficient spotting, SLiCK (exploiting Subsequences Length-Constrained Keyword spotting)....
The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number parameters can be extremely slow and costly. In contrast, small are less expensive to train, but they cannot achieve accuracy models. this paper, we explore an intriguing idea connect these two different regimes: Can develop a method initialize using smaller pre-trained models? Will such initialization bring any benefits terms...