- Speech and Audio Processing
- Speech Recognition and Synthesis
- Music and Audio Processing
- Natural Language Processing Techniques
- Advanced Adaptive Filtering Techniques
- Topic Modeling
- Blind Source Separation Techniques
- Speech and dialogue systems
- Advanced Data Compression Techniques
- Hearing Loss and Rehabilitation
- Wireless Communication Networks Research
- Digital Media Forensic Detection
- Advanced Wireless Communication Techniques
- Advanced Steganography and Watermarking Techniques
- Advanced Text Analysis Techniques
- Quantum Chromodynamics and Particle Interactions
- Text Readability and Simplification
- Particle physics theoretical and experimental studies
- Healthcare and Venom Research
- Image and Signal Denoising Methods
- Phonetics and Phonology Research
- Veterinary Orthopedics and Neurology
- Structural Health Monitoring Techniques
- Indoor and Outdoor Localization Technologies
- Video Analysis and Summarization
Seoul National University
2016-2025
Seoul Media Institute of Technology
2005-2025
Indiana University Bloomington
2022
Brookhaven National Laboratory
2022
Voith (United States)
2020
Inha University
2020
Jeonbuk National University
2015
Soonchunhyang University
2004-2012
Hanyang University Seoul Hospital
2011
Seoul National University of Science and Technology
2008
In this letter, we develop a robust voice activity detector (VAD) for the application to variable-rate speech coding. The developed VAD employs decision-directed parameter estimation method likelihood ratio test. addition, propose an effective hang-over scheme which considers previous observations by first-order Markov process modeling of occurrences. According our simulation results, proposed shows significantly better performances than G.729B in low signal-to-noise (SNR) and vehicular...
One of the key issues in practical speech processing is to achieve robust voice activity detection (VAD) against background noise. Most statistical model-based approaches have tried employ Gaussian assumption discrete Fourier transform (DFT) domain, which, however, deviates from real observation. In this paper, we propose a class VAD algorithms based on several models. addition model, also incorporate complex Laplacian and Gamma probability density functions our analysis properties. With...
Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness architectural efficiency.In this work, we propose novel nonautoregressive TTS model, namely Diff-TTS, which achieves highly natural efficient speech synthesis.Given the text, Diff-TTS exploits denoising diffusion framework transform noise signal into mel-spectrogram via time steps.In order learn distribution...
In this letter, we propose a novel speech enhancement technique based on global soft decision. The proposed approach provides unified framework for such procedures as absence probability (SAP) computation, spectral gain modification, and noise spectrum estimation using the same statistical model assumption. Performances of algorithm are evaluated by subjective tests under various environments show better results compared with IS-127 standard method.
Non-negative matrix factorization (NMF) is one of the most well-known techniques that are applied to separate a desired source from mixture data. In NMF framework, collection data factorized into basis and an encoding matrix. The for usually constructed by augmenting matrices independent sources. However, target separation with concatenated turns out be problematic if there exists some overlap between subspaces bases individual sources span. this letter, we propose novel approach improve...
Ethics regarding social bias has recently thrown striking issues in natural language processing. Especially for gender-related topics, the need a system that reduces model grown areas such as image captioning, content recommendation, and automated employment. However, detection evaluation of gender machine translation systems are not yet thoroughly investigated, task being cross-lingual challenging to define. In this paper, we propose scheme making up test set evaluates system, with Korean,...
In this letter, we propose a new statistical model, two-sided generalized gamma distribution (G/spl Gamma/D) for an efficient parametric characterization of speech spectra. G/spl Gamma/D forms class distributions, including the Gaussian, Laplacian, and Gamma probability density functions (pdfs) as special cases. We also computationally inexpensive online maximum likelihood (ML) parameter estimation algorithm Gamma/D. Likelihoods, coefficients variation (CVs), Kolmogorov-Smirnov (KS) tests...
This letter presents a speech enhancement technique combining statistical models and non-negative matrix factorization (NMF) with on-line update of noise bases. The model-based methods have been known to be less effective non-stationary noises while the template-based techniques can deal them quite well. However, usually rely on priori information. To overcome shortcomings both approaches, we propose novel method that combines scheme NMF-based gain function. For better performance in...
Flow-based generative models are composed of invertible transformations between two random variables the same dimension. Therefore, flow-based cannot be adequately trained if dimension data distribution does not match that underlying target distribution. In this paper, we propose SoftFlow, a probabilistic framework for training normalizing flows on manifolds. To sidestep mismatch problem, SoftFlow estimates conditional perturbed input instead learning directly. We experimentally show can...
The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to line research utilising selective (SKA) mechanism. SKA mechanism allows each convolutional layer adaptively select the size in data-driven fashion. It is based on an which exploits both frequency channel domain. We first...
We present a novel training method for small-scale RNN-T models, widely used in real-world speech recognition applications. Despite efforts to scale down models edge devices, the demand even smaller and more compact persists accommodate broader range of devices. In this letter, we propose Sampling-based Pruned Knowledge Distillation (SP-KD) lightweight models. contrast conventional knowledge distillation techniques, proposed enables student distill from distribution teacher which is...
In this letter, we propose a novel approach to voice activity detection (VAD) based on the modified maximum posteriori (MAP) criterion conditioned decision made in previous frame. To exploit inter-frame correlation of activity, probability presence both observed spectrum and frame is employed instead conventional strategy that depends only current observation. The proposed conditional MAP incorporating temporal correlations leads two separate thresholds for likelihood ratio test (LRT)...
Knowledge distillation (KD) approach is widely used in the deep learning field mainly for model size reduction. KD utilizes soft labels of teacher model, which contain dark- knowledge that one-hot ground-truth does not have. This can improve performance already saturated student model. In case multiple-teacher models, generally, same weighted average (interpolated training) multiple-teacher's applied to training. However, if characteristics among teachers are somewhat different, interpolated...
Recently, notable improvements in voice activity detection (VAD) problem have been achieved by adopting several machine learning techniques. Among them, the deep neural network (DNN) which learns mapping between noisy speech features and corresponding status with its hidden structure has one of most popular In this letter, we propose a novel approach enhances robustness DNN mismatched noise conditions multi-task (MTL) framework. proposed algorithm, feature enhancement task for is jointly...
In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, models typically require expensive computational cost to show successful performance. To reduce this burden, knowledge distillation (KD), is popular model compression method, used transfer from deep complex (teacher) shallower simpler (student)....
Depression and suicide are critical social problems worldwide, but tools to objectively diagnose them lacking. Therefore, this study aimed depression through machine learning determine whether it is possible identify groups at high risk of words spoken by the participants in a semi-structured interview.A total 83 healthy depressed patients were recruited. All recorded during Mini-International Neuropsychiatric Interview. Through assessment from interview items, with classified into...
In this letter, we propose a robust approach to time-delay estimation for acoustic indoor localization. Particularly, focus on the localization that works in real environments, which have been seldom addressed previous studies. actual it is difficult estimate correct time delays due multipath signals caused by reverberation. The proposed algorithm minimizes effect of multipaths and determines target position based novel reliability measure. Experiments conducted both room simulated...
Recently, generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need be addressed: (1) GAN-based training is typically unstable due its non-convex property, and (2) most of the conventional methods do not fully take advantage characteristics, which could result in a sub-optimal solution. In order deal with these problems, we propose progressive generator can handle multi-resolution fashion. Additionally,...