Yang Ai

ORCID: 0000-0001-6668-022X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Advanced Data Compression Techniques
  • Natural Language Processing Techniques
  • Neural Networks and Applications
  • Phonetics and Phonology Research
  • Face recognition and analysis
  • Advanced Adaptive Filtering Techniques
  • Blind Source Separation Techniques
  • Indoor and Outdoor Localization Technologies
  • AI-based Problem Solving and Planning
  • Image and Signal Denoising Methods
  • Underwater Acoustics Research
  • Topic Modeling
  • Artificial Intelligence in Healthcare
  • Biomedical Text Mining and Ontologies
  • Voice and Speech Disorders
  • Service-Oriented Architecture and Web Services
  • Digital Filter Design and Implementation
  • Neural Networks and Reservoir Computing
  • COVID-19 diagnosis using AI
  • Advanced Steganography and Watermarking Techniques
  • Biometric Identification and Security
  • Advanced Computational Techniques and Applications

University of Science and Technology of China
2018-2025

Jilin University
2009-2010

University of Trento
2010

Jilin Province Science and Technology Department
2010

Ministry of Culture
2006

Fudan University
2005

This paper presents a novel speech phase prediction model which predicts wrapped spectra directly from amplitude by neural networks. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture composed two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused wrapping, we design anti-wrapping...

10.1109/icassp49357.2023.10096553 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Recently, fine-grained prosody representations have emerged and attracted growing attention to address the one-to-many problem in text-to-speech (TTS). In this paper, we propose PhonemeVec, a pre-trained with considering contextual information. To obtain representations, improve data2vec framework according characteristics of extract PhonemeVec from low-band mel-spectrogram, pre-train on 960 hours Chinese corpus high quality diverse pronunciation. is subsequently integrated into FastSpeech2,...

10.1145/3711828 article EN ACM Transactions on Asian and Low-Resource Language Information Processing 2025-01-15

10.1109/icassp49660.2025.10888169 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889792 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890694 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888985 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10887949 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/taslpro.2025.3557193 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters reconstructing wideband waveforms, this models predicts samples directly without vocoders. Inspired by SampleRNN is an unconditional audio generator, the HRNN model represents distribution of each or high-frequency sample conditioned on input narrowband network composed...

10.1109/taslp.2018.2798811 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2018-01-26

This article presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different existing vocoders such as WaveNet, SampleRNN WaveRNN directly generate waveform samples using single networks, the is composed of an spectrum predictor (ASP) (PSP). The ASP simple DNN model predicts log (LAS) features. predicted LAS are sent into PSP for recovery. Considering issue warping difficulty modeling,...

10.1109/taslp.2020.2970241 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-01-01

This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS).Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos.Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames.For modeling, discretized values are used influences history length DAR analyzed by experiments.An post-processing strategy is also...

10.21437/interspeech.2019-1563 article EN Interspeech 2022 2019-09-13

This paper presents a SampleRNN-based neural vocoder for statistical parametric speech synthesis. method utilizes conditional SampleRNN model composed of hierarchical structure GRU layers and feed-forward to capture long-span dependencies between acoustic features waveform sequences. Compared with conventional vocoders based on the source-filter model, our proposed is trained without assumptions derived from prior knowledge production able provide better modeling recovery phase information....

10.1109/icassp.2018.8461878 article EN 2018-04-01

This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The is composed of an spectrum predictor (ASP) (PSP). ASP residual convolution network predicts frame-level log features. PSP also adopts using as input, then passes the output this through two parallel linear layers respectively, finally integrates into calculation formula to estimate spectra. Finally, outputs are combined...

10.1109/taslp.2023.3277276 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

This paper presents a novel neural speech phase prediction model which predicts wrapped spectra directly from amplitude spectra. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture core module for direct prediction. consists two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused by...

10.1109/taslp.2024.3385285 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this builds multiple-target DNN, which predicts log amplitude spectra natural high-bit waveforms together ratios between and distorted spectra. Log are adopted as model input. generation enhanced obtained an ensemble decoding strategy, further combined phase produce final inverse STFT. In our experiments on WaveRNN...

10.1109/icassp.2019.8683016 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral extra-oral articulators without producing sound. falls under umbrella articulatory-to-acoustic conversion, may also be refered to as interface. We propose employ method built on pseudo target generation domain adversarial training with an iterative strategy improve intelligibility naturalness recovered...

10.1109/icassp49357.2023.10096920 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted natural reference are unavailable when only the silent video an unseen is given. In this paper, we propose personalized Lip2Speech synthesis...

10.1109/icassp49357.2023.10096464 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS). Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos. Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames. For modeling, discretized values are used influences history length DAR analyzed by experiments. An post-processing strategy is also...

10.48550/arxiv.1906.08977 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. HiNet, the spectrum predictor (ASP) predicts log (LAS) This paper proposes novel knowledge-and-data-driven ASP (KDD-ASP) to improve conventional one. First, features (i.e., F0 mel-cepstra) pass through knowledge-driven LAS recovery module obtain approximate (ALAS). is designed based on combination of STFT...

10.21437/interspeech.2020-1046 article EN Interspeech 2022 2020-10-25
Coming Soon ...