- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Voice and Speech Disorders
- Natural Language Processing Techniques
- Emotion and Mood Recognition
- Neural Networks and Applications
- Advanced Computational Techniques and Applications
- Service-Oriented Architecture and Web Services
- Ultrasonics and Acoustic Wave Propagation
- Power Systems and Technologies
- Face recognition and analysis
- Semantic Web and Ontologies
- Lexicography and Language Studies
- Quantum Computing Algorithms and Architecture
- Quantum Mechanics and Applications
- Advanced Algorithms and Applications
- Face and Expression Recognition
- Vehicle License Plate Recognition
- Quantum Information and Cryptography
- Neural dynamics and brain function
- Visual perception and processing mechanisms
- Multisensory perception and integration
National University of Singapore
2020-2024
Yunnan University
2023
Xinqiao Hospital
2011
Army Medical University
2011
Nanchang University
2010
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...
In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD database1 is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 h data were recorded in controlled acoustic...
Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content.Many studies require parallel speech data between different patterns, which is not practical in real life.Moreover, they often model fundamental frequency (F0) with a simple linear transform.As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet transform.We propose CycleGAN...
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking fact that speech also conveys with various intensity levels listener can perceive. this paper, we aim explicitly characterize control emotion. We propose disentangle style from encode into a embedding in continuous space forms prototype emotion embedding. further learn actual...
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging a specific emotion type. In this paper, we seek generate mixture of emotions at run-time. We propose novel formulation that measures the relative difference between samples different emotions. then incorporate our into sequence-to-sequence text-to-speech framework. During training, framework does not only explicitly...
Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity.The prior studies on emotional are mostly carried out under assumption that is speaker-dependent.We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible.In this paper, we propose framework, can anyone's without need parallel data.We VAW-GAN based encoderdecoder...
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely independently trained and concatenated may lead to inconsistent poor quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network leverages end-to-end training achieve high-fidelity super-resolution. Our model features...
Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on generation process in high-dimensional waveform or spectral domains, leading to increased complexity and slower inference speeds. Additionally, these primarily modelled clean distributions, with limited exploration noise thereby constraining discriminative capability for enhancement. To address issues, we propose...
Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...
Expressive voice conversion performs identity for emotional speakers by jointly converting speaker and style. Due to the hierarchical structure of speech emotion, it is challenging disentangle style different speakers. Inspired recent success disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive framework, that called StyleVC. StyleVC designed linguistic content, identity, pitch, information. We study use encoder model explicitly. At run-time, converts both...
Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence with limited amount speech data. We note that proposed EVC framework leverages text-to-speech (TTS) as they share common goal is generate high-quality expressive voice. stage 1, perform style initialization multi-speaker TTS corpus, disentangle speaking content. 2,...
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning global prosodic representation at utterance level, which strongly correlates with linguistic prosody. Our goal is construct hierarchical distribution (ED) that encapsulates intensity variations of emotions various levels granularity, encompassing phonemes, words, and utterances. During TTS training, ED extracted from ground-truth audio guides...
Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. The prior studies on emotional are mostly carried out under assumption that is speaker-dependent. We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible. In this paper, we propose framework, can anyone's without need parallel data. VAW-GAN based encoder-decoder...
Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content. Many studies require parallel speech data between different patterns, which is not practical in real life. Moreover, they often model fundamental frequency (F0) with a simple linear transform. As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet We propose CycleGAN network...
Traditional voice conversion (VC) has been focused on speaker identity for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the style of can be speaker-dependent. In this paper, we study technique to jointly convert speaker-dependent style, is called expressive conversion. propose StarGAN-based framework learn many-to-many mapping across different speakers, takes into account without need parallel data. To end, condition...
Singing voice conversion aims to convert singer's from source target without changing singing content. Parallel training data is typically required for the of system, that however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way learn a mapping through non-parallel data. In this paper, we propose framework based on VAW-GAN. We train encoder disentangle...
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...
Cross-lingual voice conversion aims to change source speaker's sound like that of target speaker, when and speakers speak different languages. It relies on non-parallel training data from two languages, hence, is more challenging than mono-lingual conversion. Previous studies cross-lingual mainly focus spectral with a linear transformation for F0 transfer. However, as an important prosodic factor, inherently hierarchical, thus it insufficient just use method We propose the continuous wavelet...
In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 hours data were recorded in controlled acoustic environment....
Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...