- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Advanced Data Compression Techniques
- Natural Language Processing Techniques
- Neural Networks and Applications
- Phonetics and Phonology Research
- Face recognition and analysis
- Advanced Adaptive Filtering Techniques
- Blind Source Separation Techniques
- Indoor and Outdoor Localization Technologies
- AI-based Problem Solving and Planning
- Image and Signal Denoising Methods
- Underwater Acoustics Research
- Topic Modeling
- Artificial Intelligence in Healthcare
- Biomedical Text Mining and Ontologies
- Voice and Speech Disorders
- Service-Oriented Architecture and Web Services
- Digital Filter Design and Implementation
- Neural Networks and Reservoir Computing
- COVID-19 diagnosis using AI
- Advanced Steganography and Watermarking Techniques
- Biometric Identification and Security
- Advanced Computational Techniques and Applications
University of Science and Technology of China
2018-2025
Jilin University
2009-2010
University of Trento
2010
Jilin Province Science and Technology Department
2010
Ministry of Culture
2006
Fudan University
2005
This paper presents a novel speech phase prediction model which predicts wrapped spectra directly from amplitude by neural networks. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture composed two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused wrapping, we design anti-wrapping...
Recently, fine-grained prosody representations have emerged and attracted growing attention to address the one-to-many problem in text-to-speech (TTS). In this paper, we propose PhonemeVec, a pre-trained with considering contextual information. To obtain representations, improve data2vec framework according characteristics of extract PhonemeVec from low-band mel-spectrogram, pre-train on 960 hours Chinese corpus high quality diverse pronunciation. is subsequently integrated into FastSpeech2,...
This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters reconstructing wideband waveforms, this models predicts samples directly without vocoders. Inspired by SampleRNN is an unconditional audio generator, the HRNN model represents distribution of each or high-frequency sample conditioned on input narrowband network composed...
This article presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different existing vocoders such as WaveNet, SampleRNN WaveRNN directly generate waveform samples using single networks, the is composed of an spectrum predictor (ASP) (PSP). The ASP simple DNN model predicts log (LAS) features. predicted LAS are sent into PSP for recovery. Considering issue warping difficulty modeling,...
This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS).Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos.Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames.For modeling, discretized values are used influences history length DAR analyzed by experiments.An post-processing strategy is also...
This paper presents a SampleRNN-based neural vocoder for statistical parametric speech synthesis. method utilizes conditional SampleRNN model composed of hierarchical structure GRU layers and feed-forward to capture long-span dependencies between acoustic features waveform sequences. Compared with conventional vocoders based on the source-filter model, our proposed is trained without assumptions derived from prior knowledge production able provide better modeling recovery phase information....
This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The is composed of an spectrum predictor (ASP) (PSP). ASP residual convolution network predicts frame-level log features. PSP also adopts using as input, then passes the output this through two parallel linear layers respectively, finally integrates into calculation formula to estimate spectra. Finally, outputs are combined...
This paper presents a novel neural speech phase prediction model which predicts wrapped spectra directly from amplitude spectra. The proposed is cascade of residual convolutional network and parallel estimation architecture. architecture core module for direct prediction. consists two linear layers calculation formula, imitating the process calculating real imaginary parts complex strictly restricting predicted values to principal value interval. To avoid error expansion issue caused by...
This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this builds multiple-target DNN, which predicts log amplitude spectra natural high-bit waveforms together ratios between and distorted spectra. Log are adopted as model input. generation enhanced obtained an ensemble decoding strategy, further combined phase produce final inverse STFT. In our experiments on WaveRNN...
This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral extra-oral articulators without producing sound. falls under umbrella articulatory-to-acoustic conversion, may also be refered to as interface. We propose employ method built on pseudo target generation domain adversarial training with an iterative strategy improve intelligibility naturalness recovered...
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted natural reference are unavailable when only the silent video an unseen is given. In this paper, we propose personalized Lip2Speech synthesis...
This paper presents a method of using autoregressive neural networks for the acoustic modeling singing voice synthesis (SVS). Singing differs from speech and it contains more local dynamic movements features, e.g., vibratos. Therefore, our adopts deep (DAR) models to predict F0 spectral features in order better describe dependencies among consecutive frames. For modeling, discretized values are used influences history length DAR analyzed by experiments. An post-processing strategy is also...
In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. HiNet, the spectrum predictor (ASP) predicts log (LAS) This paper proposes novel knowledge-and-data-driven ASP (KDD-ASP) to improve conventional one. First, features (i.e., F0 mel-cepstra) pass through knowledge-driven LAS recovery module obtain approximate (ALAS). is designed based on combination of STFT...