- Speech and Audio Processing
- Speech Recognition and Synthesis
- Music and Audio Processing
- Neural Networks and Applications
- Advanced Data Compression Techniques
- Natural Language Processing Techniques
- Blind Source Separation Techniques
- Neural Networks and Reservoir Computing
- Image and Signal Denoising Methods
- Infant Health and Development
- Phonetics and Phonology Research
- Emotion and Mood Recognition
- COVID-19 diagnosis using AI
- Digital Filter Design and Implementation
- Artificial Intelligence in Healthcare
- Optical Polarization and Ellipsometry
- Voice and Speech Disorders
University of Science and Technology of China
2023-2025
This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel.The proposed MP-SENet adopts codec architecture the encoder decoder are bridged by convolution-augmented transformers.The aims to encode time-frequency representations from input noisy magnitude phase spectra.The is composed of parallel mask decoder, recovering clean clean-wrapped incorporating learnable sigmoid activation estimation architecture,...
Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature wrapping characteristics of phase, leading bottleneck enhanced quality. To overcome above issue, this paper, we proposed MP-SENet, novel Speech Enhancement Network that explicitly enhances Magnitude spectra parallel. The MP-SENet comprises Transformer-embedded encoder-decoder...
This paper proposes a novel neural audio codec framework which incorporates bandwidth reduction and recovery, facilitating its application in scenarios with high sampling rates low bitrates. The proposed consists of two-stage-downsampling-based encoder, quantizer, two-stage-upsampling-based decoder. encoder initially reduces the high-sampling-rate waveform before encoding it. Therefore, discrete tokens outputted by quantizer are derived from low-sampling-rate waveform, resulting bitrate....
This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For extraction, BiVocoder takes amplitude phase spectra derived from STFT as inputs, transforms them into long-frame-shift low-dimensional features through convolutional networks. The extracted are demonstrated suitable for direct prediction by acoustic models, supporting its application in...
Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech spectra from amplitude-related features. However, existing prediction methods are constrained recovering with short frame shifts, considerably smaller than theoretical upper bound required for exact waveform reconstruction short-time Fourier transform (STFT). To tackle this issue, we present novel long-frame-shift neural (LFS-NSPP) method enables precise log amplitude...
Speech bandwidth extension (BWE) refers to widening the frequency range of speech signals, enhancing quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction Amplitude Phase spectra, named AP-BWE, which achieves both high-quality efficient wideband waveform generation. The proposed AP-BWE generator is entirely on convolutional neural networks (CNNs). It features dual-stream architecture mutual interaction, where...
This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs codecs. The APCodec revolutionizes process encoding decoding by concurrently handling amplitude phase spectra as characteristics like It is composed an encoder decoder with modified ConvNeXt v2 network backbone, connected quantizer based on residual vector quantization (RVQ) mechanism. compresses in parallel,...
This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For extraction, BiVocoder takes amplitude phase spectra derived from STFT as inputs, transforms them into long-frame-shift low-dimensional features through convolutional networks. The extracted are demonstrated suitable for direct prediction by acoustic models, supporting its application in...
The majority of existing speech bandwidth extension (BWE) methods operate under the constraint fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage BWE model named MS-BWE, can handle set rate pairs achieve flexible extensions frequency bandwidth. proposed MS-BWE comprises cascade blocks, with each block featuring dual-stream architecture to realize amplitude phase extension, progressively painting bands...
The majority of existing speech bandwidth extension (BWE) methods operate under the constraint fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage BWE model named MS-BWE, can handle set rate pairs achieve flexible extensions frequency bandwidth. proposed MS-BWE comprises cascade blocks, with each block featuring dual-stream architecture to realize amplitude phase extension, progressively painting bands...
This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude by two-stage neural networks. In initial prior-construction stage, we preliminarily predict rough prior spectrum. The subsequent refinement stage transforms into refined high-quality conditioned on phase. Networks in both stages use ConvNeXt v2 blocks as backbone adopt adversarial training innovatively introducing discriminator (PSD)....
In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes MDCT spectrum of as input, encoding it into a continuous latent code which is then discretized by residual vector quantizer (RVQ). Subsequently, decoder decodes from quantized and reconstructs via inverse MDCT. During training phase, novel multi-resolution MDCT-based discriminator (MR-MDCTD) adopted to discriminate natural or...
Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for automatic evaluation synthesis systems. Early MOS took raw waveform or amplitude spectrum as input, whereas more advanced methods employed self-supervised-learning (SSL) based to extract semantic representations from prediction. These utilized limited aspects information prediction, resulting in restricted accuracy. Therefore, this paper, we propose SAMOS, a model that leverages...
We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict mean opinion score (MOS) singing samples. Our submission secured first place among all participating teams, excluding official baseline. In this paper, we further improve our and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on self-supervised-learning (SSL) MOS predictor, incorporating pitch spectral information, are extracted using histogram...
This paper proposes ESTVocoder, a novel excitation-spectral-transformed neural vocoder within the framework of source-filter theory. The ESTVocoder transforms amplitude and phase spectra excitation into corresponding speech using filter whose backbone is ConvNeXt v2 blocks. Finally, waveform reconstructed through inverse short-time Fourier transform (ISTFT). constructed based on F0: for voiced segments, it contains full harmonic information, while unvoiced represented by noise. provides with...
This paper proposes a novel neural denoising vocoder that can generate clean speech waveforms from noisy mel-spectrograms. The proposed consists of two components, i.e., spectrum predictor and enhancement module. first predicts the amplitude phase spectra input mel-spectrogram, subsequently module recovers ones. Finally, are reconstructed through inverse short-time Fourier transform (iSTFT). All operations performed at frame-level spectral domain, with APNet MP-SENet model used as backbones...
This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as TTS backbone. To effectively disentangle environment, speaker, and text factors, we propose incremental disentanglement process, where estimator is designed to first decompose environmental spectrogram into...