Chieh-Hsin Lai

ORCID: 0009-0009-3059-929X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Generative Adversarial Networks and Image Synthesis
  • Speech and Audio Processing
  • Domain Adaptation and Few-Shot Learning
  • AI in cancer detection
  • Music Technology and Sound Studies
  • Anomaly Detection Techniques and Applications
  • Model Reduction and Neural Networks
  • Face recognition and analysis
  • Advanced X-ray Imaging Techniques
  • Image and Signal Denoising Methods
  • Acoustic Wave Phenomena Research
  • Cellular Automata and Applications
  • Speech Recognition and Synthesis
  • Digital Media Forensic Detection
  • Advanced Neuroimaging Techniques and Applications
  • Natural Language Processing Techniques
  • Electron and X-Ray Spectroscopy Techniques
  • Numerical methods in inverse problems
  • Differential Equations and Numerical Methods
  • Statistical Methods and Bayesian Inference
  • Machine Learning and Algorithms
  • Iterative Learning Control Systems
  • Fault Detection and Control Systems
  • Gaussian Processes and Bayesian Inference

Dexerials (Japan)
2025

Sony Computer Science Laboratories
2024

RIKEN
2023-2024

Sony Corporation (United States)
2023

Twin Cities Orthopedics
2021

University of Minnesota
2021

University of Minnesota System
2020

This paper summarizes the music demixing (MDX) track of Sound Demixing Challenge (SDX'23).We provide a summary challenge setup and introduce task robust source separation (MSS), i.e., training MSS models in presence errors data.We propose formalization that can occur design dataset for systems two new datasets simulate such errors: SDXDB23_LabelNoise SDXDB23_Bleeding 1 .We describe methods achieved highest scores competition.Moreover, we present direct comparison with previous edition (the...

10.5334/tismir.171 article EN cc-by Transactions of the International Society for Music Information Retrieval 2024-01-01

We propose a physics-aware Consistency Training (CT) method that accelerates sampling in Diffusion Models with physical constraints. Our approach leverages two-stage strategy: (1) learning the noise-to-data mapping via CT, and (2) incorporating physics constraints as regularizer. Experiments on toy examples show our generates samples single step while adhering to imposed This has potential efficiently solve partial differential equations (PDEs) using deep generative modeling.

10.48550/arxiv.2502.07636 preprint EN arXiv (Cornell University) 2025-02-11

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off for speed. To address this limitation, we propose Trajectory Model (CTM), generalization encompassing CM and models as special cases. CTM trains single neural network that can -- in forward pass output scores (i.e., gradients log-density) enables unrestricted traversal between any initial final time along Probability Flow Ordinary...

10.48550/arxiv.2310.02279 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream manipulations. Reverberation of contains two categories, natural reverb, and artificial reverb. Artificial has wider diversity than due its various parameter setups reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse numerous pairs observations retrieved data training in order be generalizable unseen during inference. To...

10.1109/icassp49357.2023.10095761 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

We propose a neural network for unsupervised anomaly detection with novel robust subspace recovery layer (RSR layer). This seeks to extract the underlying from latent representation of given data and removes outliers that lie away this subspace. It is used within an autoencoder. The encoder maps into space, which RSR extracts decoder then smoothly back "manifold" close original inliers. Inliers are distinguished according distances between mapped positions (small inliers large outliers)....

10.48550/arxiv.1904.00152 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct signal from noisy measurements. However, existing approaches require knowledge operator. In this paper, we propose GibbsDDRM, an extension Denoising Diffusion Restoration Models (DDRM) blind setting which measurement operator unknown. GibbsDDRM constructs joint distribution data, measurements, and by using pre-trained model for data prior, it solves...

10.48550/arxiv.2301.12686 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of generator is fundamentally limited by teacher DM. overcome limitation, we propose Progressive Growing Diffusion Autoencoder (PaGoDA), technique progressively grow beyond original Our key insight pre-trained, low-resolution DM can be used deterministically encode high-resolution structured latent space solving PF-ODE forward...

10.48550/arxiv.2405.14822 preprint EN arXiv (Cornell University) 2024-05-23

Generating novel views from a single image remains challenging task due to the complexity of 3D scenes and limited diversity in existing multi-view datasets train model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise handling in-the-wild images. In these methods, an input view is geometrically warped estimated maps, then inpainted by T2I models. However, they struggle noisy maps loss semantic details when warping...

10.48550/arxiv.2405.17251 preprint EN arXiv (Cornell University) 2024-05-27

In many physical systems, inputs related by intrinsic system symmetries are mapped to the same output. When inverting such i.e., solving associated inverse problems, there is no unique solution. This causes fundamental difficulties for deploying emerging end-to-end deep learning approach. Using generalized phase retrieval problem as an illustrative example, we show that careful symmetry breaking on training data can help get rid of and significantly improve performance. We also extract...

10.48550/arxiv.2003.09077 preprint EN other-oa arXiv (Cornell University) 2020-01-01

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction full capacity codebook, also known as codebook collapse. We hypothesize training scheme VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In paper, we propose new extends standard VAE via novel stochastic dequantization and quantization, called stochastically quantized (SQ-VAE). SQ-VAE, observe trend quantization at initial stage...

10.48550/arxiv.2205.07547 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

Generative adversarial networks (GANs) learn a target probability distribution by optimizing generator and discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides gradients that make its close to distribution. We derive metrizable conditions, sufficient conditions for serve as distance between distributions connecting GAN formulation concept sliced optimal transport. Furthermore, leveraging these theoretical results, we propose...

10.48550/arxiv.2301.12811 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Image generative models can learn the distributions of training data and consequently generate examples by sampling from these distributions. However, when dataset is corrupted with outliers, will likely produce that are also similar to outliers. In fact, a small portion outliers may induce state-of-the-art models, such as Vector Quantized-Variational AutoEncoder (VQ-VAE), significant mode To mitigate this problem, we propose robust model based on VQ-VAE, which name Robust VQ-VAE (RVQ-VAE)....

10.48550/arxiv.2202.01987 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It commonly performed variational autoencoding model, VQ-VAE, which can be further extended hierarchical structures for making high-fidelity reconstructions. However, such extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where not efficiently used express data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose novel...

10.48550/arxiv.2401.00365 preprint EN cc-by-nc-sa arXiv (Cornell University) 2024-01-01

Restoring degraded music signals is essential to enhance audio quality for downstream manipulation. Recent diffusion-based restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance introduce the way mitigate inspired by diverse guidance...

10.1109/icassp48485.2024.10446423 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP a key concept multimodal learning. In this work, we provide theoretical understanding of the through lens pointwise mutual information show that encoders achieve optimal similarity pretraining good downstream classification tasks under mild assumptions. Based on our results, also propose new metric...

10.48550/arxiv.2404.19228 preprint EN arXiv (Cornell University) 2024-04-29

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve valuable tools the creators. However, despite producing sounds, these often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial error to align them with artistic intentions. To address this issue, we introduce Consistency Trajectory Models (SoundCTM). Our model...

10.48550/arxiv.2405.18503 preprint EN arXiv (Cornell University) 2024-05-28

Existing work on pitch and timbre disentanglement has been mostly focused single-instrument music audio, excluding the cases where multiple instruments are presented. To fill gap, we propose DisMix, a generative framework in which representations act as modular building blocks for constructing melody instrument of source, collection forms set per-instrument latent underlying observed mixture. By manipulating representations, our model samples mixtures with novel combinations constituent...

10.48550/arxiv.2408.10807 preprint EN arXiv (Cornell University) 2024-08-20

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from feedback methods usually rely on predefined heuristic reward functions or pretrained models built large-scale datasets, limiting their applicability scenarios where collecting such data is costly difficult. To effectively efficiently utilize feedback, we develop a framework, HERO, which leverages online collected the fly...

10.48550/arxiv.2410.05116 preprint EN arXiv (Cornell University) 2024-10-07

Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion (DDMs) for variables. Despite recent advances, DDMs face challenge slow sampling speeds. While parallel methods like $\tau$-leaping accelerate this process, they introduce $\textit{Compounding Decoding Error}$ (CDE), where discrepancies arise between true distribution and approximation from token generation, degraded sample quality. In work, we present $\textit{Jump Your Steps}$...

10.48550/arxiv.2410.07761 preprint EN arXiv (Cornell University) 2024-10-10

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose novel method based on dual diffusion bridges, trained using CocoChorales Dataset, which consists unpaired monophonic single-instrument data. Each model specific instrument with Gaussian prior. During inference, designated as source to map input corresponding prior, and another target reconstruct from thereby...

10.48550/arxiv.2409.06096 preprint EN arXiv (Cornell University) 2024-09-09
Coming Soon ...