NFDI4DS | UHH-SEMS - Publication Details

Controllable Emotion Transfer For End-to-End Speech Synthesis

OPENALEX - Publications

Tao Li Shan Yang Liumeng Xue Lei Xie

Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...

10.1109/iscslp49672.2021.9362069 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

On the localness modeling for the self-attention based end-to-end speech synthesis

OPENALEX - Publications

Shan Yang Heng Lu Shiyin Kang Liumeng Xue Jinba Xiao and 3 more

10.1016/j.neunet.2020.01.034 article EN Neural Networks 2020-02-11

Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit

OPENALEX - Publications

Xueyao Zhang Liumeng Xue Yicheng Gu Yuancheng Wang Jiaqi Li and 14 more

10.1109/slt61566.2024.10832255 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

OPENALEX - Publications

Zhen Ye Xinfa Zhu Chi-Min Chan Xinsheng Wang Xu Tan and 15 more

Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...

10.48550/arxiv.2502.04128 preprint EN arXiv (Cornell University) 2025-02-06

Binauralspeech: Controllable Text-to-Binaural Speech Synthesis with Text Prompt

OPENALEX - Publications

Jiawei Zhang Liumeng Xue Xinyuan Qian Tianhao Zhang Jiawen Chua and 1 more

10.2139/ssrn.5209579 preprint EN 2025-01-01

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

OPENALEX - Publications

Ziqian Ning Qicong Xie Pengcheng Zhu Zhichao Wang Liumeng Xue and 3 more

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice framework that leverages advantages from both neural bottleneck feature (BNF) approach information perturbation approach. Specifically, use BNF encoder Perturbed-Wav to form content extractor learn linguistic para-linguistic features respectively,...

10.1109/icassp49357.2023.10096057 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

OPENALEX - Publications

Yicheng Gu Xueyao Zhang Liumeng Xue Zhizheng Wu

Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators rooted Short-Time Fourier Transform (STFT), whose time-frequency resolution a spectrogram is fixed, making it incompatible with signals like singing voices that require...

10.1109/icassp48485.2024.10448436 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

OPENALEX - Publications

Xiaolian Zhu Yuchao Zhang Shan Yang Liumeng Xue Lei Xie

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...

10.1109/access.2019.2914149 article EN cc-by-nc-nd IEEE Access 2019-01-01

Building a Mixed-Lingual Neural TTS System with Only Monolingual Data

OPENALEX - Publications

Liumeng Xue Wei Song Guanghui Xu Lei Xie Zhizheng Wu

When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize utterances with English phrases or words embedded.This paper looks into problem in encoder-decoder framework when only monolingual data from target speaker available.Specifically, we view two aspects: consistency within an utterance and naturalness.We start investigation average voice model which built multispeaker data, i.e., Mandarin data.On basis that, look embedding for phoneme naturalness...

10.21437/interspeech.2019-3191 article EN Interspeech 2022 2019-09-13

Cycle consistent network for end-to-end style transfer TTS training

OPENALEX - Publications

Liumeng Xue Shifeng Pan Lei He Lei Xie Frank K. Soong

10.1016/j.neunet.2021.03.005 article EN Neural Networks 2021-03-18

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

OPENALEX - Publications

Hanzhao Li Liumeng Xue Haohan Guo Xinfa Zhu Yuanjun Lv and 4 more

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a singlecodebook single-sequence codec, which employs disentangled VQ-VAE decouple into time-invariant embedding phonetically-rich discrete sequence. Furthermore, encoder is enhanced with 1) contextual modeling BLSTM module exploit temporal information, 2) hybrid sampling...

10.21437/interspeech.2024-1559 article EN Interspeech 2022 2024-09-01

An Initial Investigation of Neural Replay Simulator for Over-The-Air Adversarial Perturbations to Automatic Speaker Verification

OPENALEX - Publications

Jiaqi Li Li Wang Liumeng Xue Lei Wang Zhizheng Wu

Deep Learning has advanced Automatic Speaker Verification (ASV) in the past few years. Although it is known that deep learning-based ASV systems are vulnerable to adversarial examples digital access, there studies on attacks context of physical where a replay process (i.e., over air) involved. An over-the-air attack involves loudspeaker, microphone, and replaying environment impacts movement sound wave. Our initial experiment confirms effectiveness performance. This study performs an...

10.1109/icassp48485.2024.10447811 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

SingVisio: Visual analytics of diffusion model for singing voice conversion

OPENALEX - Publications

Liumeng Xue Chaoren Wang Mingxuan Wang Xueyao Zhang Jun Han and 1 more

10.1016/j.cag.2024.104058 article EN Computers & Graphics 2024-08-30

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

OPENALEX - Publications

Linhan Ma Dake Guo Kun Song Yuepeng Jiang Shuai Wang and 5 more

With the development of large text-to-speech (TTS) models and scale-up training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from open-sourced WenetSpeech dataset. Tailored for tasks, refined by adjusting segment boundaries, enhancing audio quality, eliminating speaker mixing within each segment. Following more accurate transcription process quality-based data filtering process,...

10.21437/interspeech.2024-2343 article EN Interspeech 2022 2024-09-01

ParaTTS: Learning Linguistic and Prosodic Cross-Sentence Information in Paragraph-Based TTS

OPENALEX - Publications

Liumeng Xue Frank K. Soong Shaofei Zhang Lei Xie

Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-quality, natural synthesized speech a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when whole paragraph considered TTS, where large amount of contextual information needs be building paragraph-based TTS model. To alleviate the difficulty training, we propose model linguistic and prosodic by considering cross-sentence, embedded structure training. Three...

10.1109/taslp.2022.3202126 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Building a controllable expressive speech synthesis system with multiple emotion strengths

OPENALEX - Publications

Xiaolian Zhu Liumeng Xue

10.1016/j.cogsys.2019.09.009 article EN Cognitive Systems Research 2019-09-23

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoders

OPENALEX - Publications

Yicheng Gu Xueyao Zhang Liumeng Xue Haizhou Li Zhizheng Wu

10.1109/taslp.2024.3468005 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Leveraging Diverse Semantic-Based Audio Pretrained Models for Singing Voice Conversion

OPENALEX - Publications

Xueyao Zhang Zihao Fang Yicheng Gu Haopeng Chen Lexiao Zou and 3 more

10.1109/slt61566.2024.10832319 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

OPENALEX - Publications

Xueyao Zhang Liumeng Xue Yuancheng Wang Yicheng Gu Xi Chen and 8 more

Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way junior researchers engineers into these fields. It presents a unified framework that inclusive of diverse generation tasks models, with added bonus being easily extendable new incorporation. The designed beginner-friendly workflows pre-trained allowing both beginners seasoned kick-start their projects relative ease. Additionally, it provides interactive visualizations demonstrations classic...

10.48550/arxiv.2312.09911 preprint EN other-oa arXiv (Cornell University) 2023-01-01

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

OPENALEX - Publications

Liumeng Xue Xiaolian Zhu Xiaochun An Lei Xie

Adaptability and controllability in changing speaking styles speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on use DNNs for expressive with small set emotional data. Specifically, we three typical model adaptation approaches: (1) retraining by emotion-specific data (retrain), (2) augmenting network input using codes (code) (3) emotion-dependent output layers shared...

10.1145/3267935.3267947 article EN 2018-10-19

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

OPENALEX - Publications

Xi Chen Jiakun Pei Liumeng Xue Mingyang Zhang

Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...

10.48550/arxiv.2401.03538 preprint EN cc-by arXiv (Cornell University) 2024-01-01

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

OPENALEX - Publications

Liumeng Xue Chaoren Wang Mingxuan Wang Xueyao Zhang Jun Han and 1 more

In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a display of generation process models, showcasing step-by-step denoising noisy spectrum and its transformation into clean captures desired singer's timbre. The also facilitates side-by-side comparisons different conditions, such as source content, melody, target timbre, highlighting impact these conditions on resulting...

10.48550/arxiv.2402.12660 preprint EN arXiv (Cornell University) 2024-02-19

Transfer the Linguistic Representations from TTS to Accent Conversion with Non-Parallel Data

OPENALEX - Publications

Xi Chen Jiakun Pei Liumeng Xue Mingyang Zhang

Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...

10.1109/icassp48485.2024.10447205 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

OPENALEX - Publications

Yicheng Gu Xueyao Zhang Liumeng Xue Haizhou Li Zhizheng Wu

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators rooted Short-Time Fourier Transform (STFT), which owns a constant (TF) resolution, linearly scaled center frequencies, fixed decomposition basis, making it incompatible...

10.48550/arxiv.2404.17161 preprint EN arXiv (Cornell University) 2024-04-26

Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion

OPENALEX - Publications

Zhichao Wang Liumeng Xue Qiuqiang Kong Lei Xie Yuanzhe Chen and 2 more

Zero-shot voice conversion (VC) converts source speech into the of any desired speaker using only one utterance without requiring additional model updates. Typical methods use a representation from pre-trained verification (SV) or learn during VC training to achieve zero-shot VC. However, existing modeling overlook variation information richness in temporal and frequency channel dimensions speech. This insufficient hampers ability accurately represent unseen speakers who are not dataset. In...

10.1109/taslp.2024.3407577 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01