NFDI4DS | UHH-SEMS - Publication Details

Yang Ai

ORCID: 0000-0001-6668-022X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5045907056

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Advanced Data Compression Techniques
Natural Language Processing Techniques
Neural Networks and Applications
Phonetics and Phonology Research
Face recognition and analysis
Advanced Adaptive Filtering Techniques
Blind Source Separation Techniques
Indoor and Outdoor Localization Technologies
AI-based Problem Solving and Planning
Image and Signal Denoising Methods
Underwater Acoustics Research
Topic Modeling
Artificial Intelligence in Healthcare
Biomedical Text Mining and Ontologies
Voice and Speech Disorders
Service-Oriented Architecture and Web Services
Digital Filter Design and Implementation
Neural Networks and Reservoir Computing
COVID-19 diagnosis using AI
Advanced Steganography and Watermarking Techniques
Biometric Identification and Security
Advanced Computational Techniques and Applications

University of Science and Technology of China
2018-2025

Jilin University
2009-2010

University of Trento
2010

Jilin Province Science and Technology Department
2010

Ministry of Culture
2006

Fudan University
2005

Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses

OPENALEX - Publications

Yang Ai Zhen-Hua Ling

This paper presents a novel speech phase prediction model which predicts wrapped spectra directly from amplitude by neural networks. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture composed two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused wrapping, we design anti-wrapping...

10.1109/icassp49357.2023.10096553 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

APCodec: A Neural Audio Codec With Parallel Amplitude and Phase Spectrum Encoding and Decoding

OPENALEX - Publications

Yang Ai Xiao-Hang Jiang Ye-Xin Lu Hui-Peng Du Zhen-Hua Ling

10.1109/taslp.2024.3417347 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

PhonemeVec: A Phoneme-Level Contextual Prosody Representation For Speech Synthesis

OPENALEX - Publications

Shiming Wang Liping Chen Yang Ai Yajun Hu Zhen-Hua Ling

Recently, fine-grained prosody representations have emerged and attracted growing attention to address the one-to-many problem in text-to-speech (TTS). In this paper, we propose PhonemeVec, a pre-trained with considering contextual information. To obtain representations, improve data2vec framework according characteristics of extract PhonemeVec from low-band mel-spectrogram, pre-train on 960 hours Chinese corpus high quality diverse pronunciation. is subsequently integrated into FastSpeech2,...

10.1145/3711828 article EN ACM Transactions on Asian and Low-Resource Language Information Processing 2025-01-15

TEAR: A Cross-modal Pre-trained Text Encoder Enhanced by Acoustic Representations for Speech Synthesis

OPENALEX - Publications

Shiming Wang Yang Ai Liping Chen Liping Chen Zhen-Hua Ling

10.1109/taslpro.2025.3545274 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

OPENALEX - Publications

Ye-Xin Lu Hui-Peng Du Zheng-Yan Sheng Yang Ai Zhen-Hua Ling

10.1109/icassp49660.2025.10888169 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Aligning Noisy-Clean Speech Pairs at Feature and Embedding Levels for Learning Noise-Invariant Speaker Representations

OPENALEX - Publications

Zuoliang Li Yang Ai Jie Zhang Shengyu Peng Yu Guan and 2 more

10.1109/icassp49660.2025.10889792 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Can Automated Speech Recognition Errors Provide Valuable Clues for Alzheimer’s Disease Detection?

OPENALEX - Publications

Yin-Long Liu Rui Feng Ye-Xin Lu Jiaxin Chen Yang Ai and 2 more

10.1109/icassp49660.2025.10890694 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

OPENALEX - Publications

Hanjie Guo Hui-Peng Du Zheng-Yan Sheng Liping Chen Yang Ai and 1 more

10.1109/icassp49660.2025.10889831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Recursive Feature Learning from Pre-Trained Models for Spoofing Speech Detection

OPENALEX - Publications

Yu Guan Yang Ai Zuoliang Li Shengyu Peng Wu Guo

10.1109/icassp49660.2025.10888985 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

A Study of Multi-Scale Feature Learning From Pre-Trained Models on Speaker Verification

OPENALEX - Publications

Shengyu Peng Wu Guo Jie Zhang Zuoliang Li Yu Guan and 2 more

10.1109/icassp49660.2025.10887949 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Voice Attribute Editing With Text Prompt

OPENALEX - Publications

Zheng-Yan Sheng Lijuan Liu Yang Ai Jia Pan Zhen-Hua Ling

10.1109/taslpro.2025.3557193 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

OPENALEX - Publications

Xiao-Hang Jiang Yang Ai Rui-Chen Zheng Zhen-Hua Ling

10.1109/lsp.2025.3560172 article EN IEEE Signal Processing Letters 2025-01-01

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

OPENALEX - Publications

Zhen-Hua Ling Yang Ai Yu Gu Li-Rong Dai

This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters reconstructing wideband waveforms, this models predicts samples directly without vocoders. Inspired by SampleRNN is an unconditional audio generator, the HRNN model represents distribution of each or high-frequency sample conditioned on input narrowband network composed...

10.1109/taslp.2018.2798811 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2018-01-26

A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

OPENALEX - Publications

Yang Ai Zhen-Hua Ling

This article presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different existing vocoders such as WaveNet, SampleRNN WaveRNN directly generate waveform samples using single networks, the is composed of an spectrum predictor (ASP) (PSP). The ASP simple DNN model predicts log (LAS) features. predicted LAS are sent into PSP for recovery. Considering issue warping difficulty modeling,...

10.1109/taslp.2020.2970241 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-01-01

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

OPENALEX - Publications

Yuanhao Yi Yang Ai Zhen-Hua Ling Li-Rong Dai

This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS).Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos.Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames.For modeling, discretized values are used influences history length DAR analyzed by experiments.An post-processing strategy is also...

10.21437/interspeech.2019-1563 article EN Interspeech 2022 2019-09-13

Samplernn-Based Neural Vocoder for Statistical Parametric Speech Synthesis

OPENALEX - Publications

Yang Ai Hong-Chuan Wu Zhen-Hua Ling

This paper presents a SampleRNN-based neural vocoder for statistical parametric speech synthesis. method utilizes conditional SampleRNN model composed of hierarchical structure GRU layers and feed-forward to capture long-span dependencies between acoustic features waveform sequences. Compared with conventional vocoders based on the source-filter model, our proposed is trained without assumptions derived from prior knowledge production able provide better modeling recovery phase information....

10.1109/icassp.2018.8461878 article EN 2018-04-01

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

OPENALEX - Publications

Yang Ai Zhen-Hua Ling

This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The is composed of an spectrum predictor (ASP) (PSP). ASP residual convolution network predicts frame-level log features. PSP also adopts using as input, then passes the output this through two parallel linear layers respectively, finally integrates into calculation formula to estimate spectra. Finally, outputs are combined...

10.1109/taslp.2023.3277276 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

OPENALEX - Publications

Yang Ai Zhen-Hua Ling

This paper presents a novel neural speech phase prediction model which predicts wrapped spectra directly from amplitude spectra. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture core module for direct prediction. consists two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused by...

10.1109/taslp.2024.3385285 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

OPENALEX - Publications

Ye-Xin Lu Yang Ai Hui-Peng Du Zhen-Hua Ling

10.1109/taslp.2024.3519881 article EN IEEE Transactions on Audio Speech and Language Processing 2024-01-01

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization

OPENALEX - Publications

Yang Ai Jing-Xuan Zhang Liang Chen Zhen-Hua Ling

This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this builds multiple-target DNN, which predicts log amplitude spectra natural high-bit waveforms together ratios between and distorted spectra. Log are adopted as model input. generation enhanced obtained an ensemble decoding strategy, further combined phase produce final inverse STFT. In our experiments on WaveRNN...

10.1109/icassp.2019.8683016 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training

OPENALEX - Publications

Rui-Chen Zheng Yang Ai Zhen-Hua Ling

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral extra-oral articulators without producing sound. falls under umbrella articulatory-to-acoustic conversion, may also be refered to as interface. We propose employ method built on pseudo target generation domain adversarial training with an iterative strategy improve intelligibility naturalness recovered...

10.1109/icassp49357.2023.10096920 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Zero-Shot Personalized Lip-To-Speech Synthesis with Face Image Based Voice Control

OPENALEX - Publications

Zheng-Yan Sheng Yang Ai Zhen-Hua Ling

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted natural reference are unavailable when only the silent video an unseen is given. In this paper, we propose personalized Lip2Speech synthesis...

10.1109/icassp49357.2023.10096464 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

OPENALEX - Publications

Yuanhao Yi Yang Ai Zhen-Hua Ling Li-Rong Dai

This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS). Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos. Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames. For modeling, discretized values are used influences history length DAR analyzed by experiments. An post-processing strategy is also...

10.48550/arxiv.1906.08977 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

OPENALEX - Publications

Yang Ai Zhen-Hua Ling

In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. HiNet, the spectrum predictor (ASP) predicts log (LAS) This paper proposes novel knowledge-and-data-driven ASP (KDD-ASP) to improve conventional one. First, features (i.e., F0 mel-cepstra) pass through knowledge-driven LAS recovery module obtain approximate (ALAS). is designed based on combination of STFT...

10.21437/interspeech.2020-1046 article EN Interspeech 2022 2020-10-25

Coming Soon ...