NFDI4DS | UHH-SEMS - Publication Details

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

OPENALEX - Publications

Kun Zhou Berrak Şişman Rui Liu Haizhou Li

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...

10.1109/icassp39728.2021.9413391 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Emotional voice conversion: Theory, databases and ESD

OPENALEX - Publications

Kun Zhou Berrak Şişman Rui Liu Haizhou Li

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD database1 is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 h data were recorded in controlled acoustic...

10.1016/j.specom.2021.11.006 article EN cc-by Speech Communication 2021-12-21

Quantum state engineering in three-level systems via Lewis-Riesenfeld invariants

OPENALEX - Publications

Xiangmin Yu Kun Zhou Hanyu Zhang Shaoxiong Li Zhiguo Huang and 3 more

10.1103/physreva.111.012623 article Physical review. A/Physical review, A 2025-01-22

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

OPENALEX - Publications

Kun Zhou Berrak Şişman Haizhou Li

Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content.Many studies require parallel speech data between different patterns, which is not practical in real life.Moreover, they often model fundamental frequency (F0) with a simple linear transform.As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet transform.We propose CycleGAN...

10.21437/odyssey.2020-33 article EN 2020-05-15

Emotion Intensity and its Control for Emotional Voice Conversion

OPENALEX - Publications

Kun Zhou Berrak Şişman Rajib Rana Björn W. Schuller Haizhou Li

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking fact that speech also conveys with various intensity levels listener can perceive. this paper, we aim explicitly characterize control emotion. We propose disentangle style from encode into a embedding in continuous space forms prototype emotion embedding. further learn actual...

10.1109/taffc.2022.3175578 article EN cc-by IEEE Transactions on Affective Computing 2022-05-19

Speech Synthesis With Mixed Emotions

OPENALEX - Publications

Kun Zhou Berrak Şişman Rajib Rana Björn W. Schuller Haizhou Li

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging a specific emotion type. In this paper, we seek generate mixture of emotions at run-time. We propose novel formulation that measures the relative difference between samples different emotions. then incorporate our into sequence-to-sequence text-to-speech framework. During training, framework does not only explicitly...

10.1109/taffc.2022.3233324 article EN cc-by IEEE Transactions on Affective Computing 2022-12-30

Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion

OPENALEX - Publications

Kun Zhou Berrak Şişman Mingyang Zhang Haizhou Li

Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity.The prior studies on emotional are mostly carried out under assumption that is speaker-dependent.We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible.In this paper, we propose framework, can anyone's without need parallel data.We VAW-GAN based encoderdecoder...

10.21437/interspeech.2020-2014 article EN Interspeech 2022 2020-10-25

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

OPENALEX - Publications

Shengkui Zhao Kun Zhou Zizheng Pan Yukun Ma Chong Zhang and 1 more

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely independently trained and concatenated may lead to inconsistent poor quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network leverages end-to-end training achieve high-fidelity super-resolution. Our model features...

10.48550/arxiv.2501.10045 preprint EN arXiv (Cornell University) 2025-01-17

Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning

OPENALEX - Publications

Shengkui Zhao Zexu Pan Kun Zhou Yukun Ma Chong Zhang and 1 more

Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on generation process in high-dimensional waveform or spectral domains, leading to increased complexity and slower inference speeds. Additionally, these primarily modelled clean distributions, with limited exploration noise thereby constraining discriminative capability for enhancement. To address issues, we propose...

10.48550/arxiv.2501.10052 preprint EN arXiv (Cornell University) 2025-01-17

Conditional Latent Diffusion-Based Speech Enhancement via Dual Context Learning

OPENALEX - Publications

Shengkui Zhao Zexu Pan Kun Zhou Yukun Ma Chong Zhang and 1 more

10.1109/icassp49660.2025.10890477 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

OPENALEX - Publications

Xin Jing Kun Zhou Andreas Triantafyllopoulos Björn W. Schuller

10.1109/icassp49660.2025.10888978 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

OPENALEX - Publications

Shengkui Zhao Kun Zhou Zizheng Pan Yukun Ma Chong Zhang and 1 more

10.1109/icassp49660.2025.10888627 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech

OPENALEX - Publications

Kun Zhou Berrak Şişman Haizhou Li

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...

10.1109/slt48900.2021.9383526 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

OPENALEX - Publications

Zongyang Du Berrak Şişman Kun Zhou Haizhou Li

Expressive voice conversion performs identity for emotional speakers by jointly converting speaker and style. Due to the hierarchical structure of speech emotion, it is challenging disentangle style different speakers. Inspired recent success disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive framework, that called StyleVC. StyleVC designed linguistic content, identity, pitch, information. We study use encoder model explicitly. At run-time, converts both...

10.21437/interspeech.2022-10249 article EN Interspeech 2022 2022-09-16

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training

OPENALEX - Publications

Kun Zhou Berrak Şişman Haizhou Li

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence with limited amount speech data. We note that proposed EVC framework leverages text-to-speech (TTS) as they share common goal is generate high-quality expressive voice. stage 1, perform style initialization multi-speaker TTS corpus, disentangle speaking content. 2,...

10.21437/interspeech.2021-781 article EN Interspeech 2022 2021-08-27

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

OPENALEX - Publications

Sho Inoue Kun Zhou Shuai Wang Haizhou Li

It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning global prosodic representation at utterance level, which strongly correlates with linguistic prosody. Our goal is construct hierarchical distribution (ED) that encapsulates intensity variations of emotions various levels granularity, encompassing phonemes, words, and utterances. During TTS training, ED extracted from ground-truth audio guides...

10.1109/icassp48485.2024.10445996 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

OPENALEX - Publications

Kun Zhou Berrak Şişman Mingyang Zhang Haizhou Li

Emotional voice conversion aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. The prior studies on emotional are mostly carried out under assumption that is speaker-dependent. We consider there a common code between speakers for expression in spoken language, therefore, speaker-independent mapping states possible. In this paper, we propose framework, can anyone's without need parallel data. VAW-GAN based encoder-decoder...

10.48550/arxiv.2005.07025 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

OPENALEX - Publications

Kun Zhou Berrak Şişman Haizhou Li

Emotional voice conversion aims to convert the spectrum and prosody change emotional patterns of speech, while preserving speaker identity linguistic content. Many studies require parallel speech data between different patterns, which is not practical in real life. Moreover, they often model fundamental frequency (F0) with a simple linear transform. As F0 key aspect intonation that hierarchical nature, we believe it more adequate temporal scales by using wavelet We propose CycleGAN network...

10.48550/arxiv.2002.00198 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

OPENALEX - Publications

Zongyang Du Berrak Şişman Kun Zhou Haizhou Li

Traditional voice conversion (VC) has been focused on speaker identity for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the style of can be speaker-dependent. In this paper, we study technique to jointly convert speaker-dependent style, is called expressive conversion. propose StarGAN-based framework learn many-to-many mapping across different speakers, takes into account without need parallel data. To end, condition...

10.1109/asru51503.2021.9687906 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

Fine-Grained Quantitative Emotion Editing for Speech Generation

OPENALEX - Publications

Sho Inoue Kun Zhou Shuai Wang Haizhou Li

10.1109/apsipaasc63619.2025.10848721 article EN 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2024-12-03

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

OPENALEX - Publications

Junchen Lu Kun Zhou Berrak Şişman Haizhou Li

Singing voice conversion aims to convert singer's from source target without changing singing content. Parallel training data is typically required for the of system, that however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way learn a mapping through non-parallel data. In this paper, we propose framework based on VAW-GAN. We train encoder disentangle...

10.48550/arxiv.2008.03992 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

OPENALEX - Publications

Kun Zhou Berrak Şişman Rui Liu Haizhou Li

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible disentangle using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn remember a fixed set of styles. In this paper, we propose novel framework based variational auto-encoding Wasserstein generative adversarial (VAW-GAN), which makes use pre-trained...

10.48550/arxiv.2010.14794 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

OPENALEX - Publications

Zongyang Du Kun Zhou Berrak Şişman Haizhou Li

Cross-lingual voice conversion aims to change source speaker's sound like that of target speaker, when and speakers speak different languages. It relies on non-parallel training data from two languages, hence, is more challenging than mono-lingual conversion. Previous studies cross-lingual mainly focus spectral with a linear transformation for F0 transfer. However, as an important prosodic factor, inherently hierarchical, thus it insufficient just use method We propose the continuous wavelet...

10.48550/arxiv.2008.04562 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Emotional Voice Conversion: Theory, Databases and ESD

OPENALEX - Publications

Kun Zhou Berrak Şişman Rui Liu Haizhou Li

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and existing speech databases. We then motivate development novel database (ESD) that addresses increasing research need. With ESD is now made available to community. The consists 350 parallel utterances spoken by 10 native English Chinese speakers covers 5 emotion categories (neutral, happy, angry, sad surprise). More than 29 hours data were recorded in controlled acoustic environment....

10.48550/arxiv.2105.14762 preprint EN cc-by arXiv (Cornell University) 2021-01-01

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

OPENALEX - Publications

Kun Zhou Berrak Şişman Haizhou Li

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state another while preserving linguistic content and speaker identity. In this paper, we study disentanglement recomposition emotional elements in through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, for spectrum conversion, prosody conversion. train spectral encoder disentangles...

10.48550/arxiv.2011.02314 preprint EN other-oa arXiv (Cornell University) 2020-01-01