Kun Zhou

ORCID: 0000-0002-7869-4474
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech and dialogue systems
  • Voice and Speech Disorders
  • Natural Language Processing Techniques
  • Emotion and Mood Recognition
  • Neural Networks and Applications
  • Advanced Computational Techniques and Applications
  • Service-Oriented Architecture and Web Services
  • Ultrasonics and Acoustic Wave Propagation
  • Power Systems and Technologies
  • Face recognition and analysis
  • Semantic Web and Ontologies
  • Lexicography and Language Studies
  • Quantum Computing Algorithms and Architecture
  • Quantum Mechanics and Applications
  • Advanced Algorithms and Applications
  • Face and Expression Recognition
  • Vehicle License Plate Recognition
  • Quantum Information and Cryptography
  • Neural dynamics and brain function
  • Visual perception and processing mechanisms
  • Multisensory perception and integration

National University of Singapore
2020-2024

Yunnan University
2023

Xinqiao Hospital
2011

Army Medical University
2011

Nanchang University
2010

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...

10.1109/icassp39728.2021.9413391 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD database1 is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 h data were recorded in controlled acoustic...

10.1016/j.specom.2021.11.006 article EN cc-by Speech Communication 2021-12-21

Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content.Many studies require parallel speech data between different patterns, which is not practical in real life.Moreover, they often model fundamental frequency (F0) with a simple linear transform.As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet transform.We propose CycleGAN...

10.21437/odyssey.2020-33 article EN 2020-05-15

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking fact that speech also conveys with various intensity levels listener can perceive. this paper, we aim explicitly characterize control emotion. We propose disentangle style from encode into a embedding in continuous space forms prototype emotion embedding. further learn actual...

10.1109/taffc.2022.3175578 article EN cc-by IEEE Transactions on Affective Computing 2022-05-19

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging a specific emotion type. In this paper, we seek generate mixture of emotions at run-time. We propose novel formulation that measures the relative difference between samples different emotions. then incorporate our into sequence-to-sequence text-to-speech framework. During training, framework does not only explicitly...

10.1109/taffc.2022.3233324 article EN cc-by IEEE Transactions on Affective Computing 2022-12-30

Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity.The prior studies on emotional are mostly carried out under assumption that is speaker-dependent.We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible.In this paper, we propose framework, can anyone's without need parallel data.We VAW-GAN based encoderdecoder...

10.21437/interspeech.2020-2014 article EN Interspeech 2022 2020-10-25

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely independently trained and concatenated may lead to inconsistent poor quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network leverages end-to-end training achieve high-fidelity super-resolution. Our model features...

10.48550/arxiv.2501.10045 preprint EN arXiv (Cornell University) 2025-01-17

Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on generation process in high-dimensional waveform or spectral domains, leading to increased complexity and slower inference speeds. Additionally, these primarily modelled clean distributions, with limited exploration noise thereby constraining discriminative capability for enhancement. To address issues, we propose...

10.48550/arxiv.2501.10052 preprint EN arXiv (Cornell University) 2025-01-17

10.1109/icassp49660.2025.10890477 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888978 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888627 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...

10.1109/slt48900.2021.9383526 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Expressive voice conversion performs identity for emotional speakers by jointly converting speaker and style. Due to the hierarchical structure of speech emotion, it is challenging disentangle style different speakers. Inspired recent success disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive framework, that called StyleVC. StyleVC designed linguistic content, identity, pitch, information. We study use encoder model explicitly. At run-time, converts both...

10.21437/interspeech.2022-10249 article EN Interspeech 2022 2022-09-16

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence with limited amount speech data. We note that proposed EVC framework leverages text-to-speech (TTS) as they share common goal is generate high-quality expressive voice. stage 1, perform style initialization multi-speaker TTS corpus, disentangle speaking content. 2,...

10.21437/interspeech.2021-781 article EN Interspeech 2022 2021-08-27

It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning global prosodic representation at utterance level, which strongly correlates with linguistic prosody. Our goal is construct hierarchical distribution (ED) that encapsulates intensity variations of emotions various levels granularity, encompassing phonemes, words, and utterances. During TTS training, ED extracted from ground-truth audio guides...

10.1109/icassp48485.2024.10445996 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. The prior studies on emotional are mostly carried out under assumption that is speaker-dependent. We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible. In this paper, we propose framework, can anyone's without need parallel data. VAW-GAN based encoder-decoder...

10.48550/arxiv.2005.07025 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content. Many studies require parallel speech data between different patterns, which is not practical in real life. Moreover, they often model fundamental frequency (F0) with a simple linear transform. As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet We propose CycleGAN network...

10.48550/arxiv.2002.00198 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Traditional voice conversion (VC) has been focused on speaker identity for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the style of can be speaker-dependent. In this paper, we study technique to jointly convert speaker-dependent style, is called expressive conversion. propose StarGAN-based framework learn many-to-many mapping across different speakers, takes into account without need parallel data. To end, condition...

10.1109/asru51503.2021.9687906 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

10.1109/apsipaasc63619.2025.10848721 article EN 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2024-12-03

Singing voice conversion aims to convert singer's from source target without changing singing content. Parallel training data is typically required for the of system, that however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way learn a mapping through non-parallel data. In this paper, we propose framework based on VAW-GAN. We train encoder disentangle...

10.48550/arxiv.2008.03992 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...

10.48550/arxiv.2010.14794 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Cross-lingual voice conversion aims to change source speaker's sound like that of target speaker, when and speakers speak different languages. It relies on non-parallel training data from two languages, hence, is more challenging than mono-lingual conversion. Previous studies cross-lingual mainly focus spectral with a linear transformation for F0 transfer. However, as an important prosodic factor, inherently hierarchical, thus it insufficient just use method We propose the continuous wavelet...

10.48550/arxiv.2008.04562 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 hours data were recorded in controlled acoustic environment....

10.48550/arxiv.2105.14762 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...

10.48550/arxiv.2011.02314 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...