Liumeng Xue

ORCID: 0000-0003-2815-8494
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Topic Modeling
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Phonetics and Phonology Research
  • Voice and Speech Disorders
  • Music Technology and Sound Studies
  • AI in Service Interactions
  • Sensor Technology and Measurement Systems
  • Adversarial Robustness in Machine Learning
  • Advanced Data Compression Techniques
  • Geophysical Methods and Applications
  • Wireless Signal Modulation Classification
  • Advanced Sensor Technologies Research
  • Flow Measurement and Analysis
  • Computational Physics and Python Applications
  • Generative Adversarial Networks and Image Synthesis

Chinese University of Hong Kong, Shenzhen
2023-2024

Shenzhen Research Institute of Big Data
2024

Northwestern Polytechnical University
2018-2024

Association for Symbolic Logic
2023

Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...

10.1109/iscslp49672.2021.9362069 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021-01-24

Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...

10.48550/arxiv.2502.04128 preprint EN arXiv (Cornell University) 2025-02-06

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice framework that leverages advantages from both neural bottleneck feature (BNF) approach information perturbation approach. Specifically, use BNF encoder Perturbed-Wav to form content extractor learn linguistic para-linguistic features respectively,...

10.1109/icassp49357.2023.10096057 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators rooted Short-Time Fourier Transform (STFT), whose time-frequency resolution a spectrogram is fixed, making it incompatible with signals like singing voices that require...

10.1109/icassp48485.2024.10448436 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...

10.1109/access.2019.2914149 article EN cc-by-nc-nd IEEE Access 2019-01-01

When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize utterances with English phrases or words embedded.This paper looks into problem in encoder-decoder framework when only monolingual data from target speaker available.Specifically, we view two aspects: consistency within an utterance and naturalness.We start investigation average voice model which built multispeaker data, i.e., Mandarin data.On basis that, look embedding for phoneme naturalness...

10.21437/interspeech.2019-3191 article EN Interspeech 2022 2019-09-13

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a singlecodebook single-sequence codec, which employs disentangled VQ-VAE decouple into time-invariant embedding phonetically-rich discrete sequence. Furthermore, encoder is enhanced with 1) contextual modeling BLSTM module exploit temporal information, 2) hybrid sampling...

10.21437/interspeech.2024-1559 article EN Interspeech 2022 2024-09-01

Deep Learning has advanced Automatic Speaker Verification (ASV) in the past few years. Although it is known that deep learning-based ASV systems are vulnerable to adversarial examples digital access, there studies on attacks context of physical where a replay process (i.e., over air) involved. An over-the-air attack involves loudspeaker, microphone, and replaying environment impacts movement sound wave. Our initial experiment confirms effectiveness performance. This study performs an...

10.1109/icassp48485.2024.10447811 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

With the development of large text-to-speech (TTS) models and scale-up training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from open-sourced WenetSpeech dataset. Tailored for tasks, refined by adjusting segment boundaries, enhancing audio quality, eliminating speaker mixing within each segment. Following more accurate transcription process quality-based data filtering process,...

10.21437/interspeech.2024-2343 article EN Interspeech 2022 2024-09-01

Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-quality, natural synthesized speech a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when whole paragraph considered TTS, where large amount of contextual information needs be building paragraph-based TTS model. To alleviate the difficulty training, we propose model linguistic and prosodic by considering cross-sentence, embedded structure training. Three...

10.1109/taslp.2022.3202126 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

10.1109/taslp.2024.3468005 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way junior researchers engineers into these fields. It presents a unified framework that inclusive of diverse generation tasks models, with added bonus being easily extendable new incorporation. The designed beginner-friendly workflows pre-trained allowing both beginners seasoned kick-start their projects relative ease. Additionally, it provides interactive visualizations demonstrations classic...

10.48550/arxiv.2312.09911 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Adaptability and controllability in changing speaking styles speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on use DNNs for expressive with small set emotional data. Specifically, we three typical model adaptation approaches: (1) retraining by emotion-specific data (retrain), (2) augmenting network input using codes (code) (3) emotion-dependent output layers shared...

10.1145/3267935.3267947 article EN 2018-10-19

Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...

10.48550/arxiv.2401.03538 preprint EN cc-by arXiv (Cornell University) 2024-01-01

In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a display of generation process models, showcasing step-by-step denoising noisy spectrum and its transformation into clean captures desired singer's timbre. The also facilitates side-by-side comparisons different conditions, such as source content, melody, target timbre, highlighting impact these conditions on resulting...

10.48550/arxiv.2402.12660 preprint EN arXiv (Cornell University) 2024-02-19

Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...

10.1109/icassp48485.2024.10447205 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators rooted Short-Time Fourier Transform (STFT), which owns a constant (TF) resolution, linearly scaled center frequencies, fixed decomposition basis, making it incompatible...

10.48550/arxiv.2404.17161 preprint EN arXiv (Cornell University) 2024-04-26

Zero-shot voice conversion (VC) converts source speech into the of any desired speaker using only one utterance without requiring additional model updates. Typical methods use a representation from pre-trained verification (SV) or learn during VC training to achieve zero-shot VC. However, existing modeling overlook variation information richness in temporal and frequency channel dimensions speech. This insufficient hampers ability accurately represent unseen speakers who are not dataset. In...

10.1109/taslp.2024.3407577 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01
Coming Soon ...