- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Topic Modeling
- Natural Language Processing Techniques
- Speech and dialogue systems
- Phonetics and Phonology Research
- Voice and Speech Disorders
- Music Technology and Sound Studies
- AI in Service Interactions
- Sensor Technology and Measurement Systems
- Adversarial Robustness in Machine Learning
- Advanced Data Compression Techniques
- Geophysical Methods and Applications
- Wireless Signal Modulation Classification
- Advanced Sensor Technologies Research
- Flow Measurement and Analysis
- Computational Physics and Python Applications
- Generative Adversarial Networks and Image Synthesis
Chinese University of Hong Kong, Shenzhen
2023-2024
Shenzhen Research Institute of Big Data
2024
Northwestern Polytechnical University
2018-2024
Association for Symbolic Logic
2023
Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred synthetic not accurate and expressive enough with category confusions. Moreover, it hard select an appropriate reference deliver desired strength. To solve these problems, we propose novel based on Tacotron. First, plug two classifiers - one after encoder, decoder output enhance...
Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...
Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice framework that leverages advantages from both neural bottleneck feature (BNF) approach information perturbation approach. Specifically, use BNF encoder Perturbed-Wav to form content extractor learn linguistic para-linguistic features respectively,...
Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators rooted Short-Time Fourier Transform (STFT), whose time-frequency resolution a spectrogram is fixed, making it incompatible with signals like singing voices that require...
Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder structure, Tacotron2 no longer needs separately learned text analysis front-end, duration model, acoustic audio synthesis module. The key of lies in attention mechanism, which learns an alignment between encoder decoder, serving implicit model...
When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize utterances with English phrases or words embedded.This paper looks into problem in encoder-decoder framework when only monolingual data from target speaker available.Specifically, we view two aspects: consistency within an utterance and naturalness.We start investigation average voice model which built multispeaker data, i.e., Mandarin data.On basis that, look embedding for phoneme naturalness...
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a singlecodebook single-sequence codec, which employs disentangled VQ-VAE decouple into time-invariant embedding phonetically-rich discrete sequence. Furthermore, encoder is enhanced with 1) contextual modeling BLSTM module exploit temporal information, 2) hybrid sampling...
Deep Learning has advanced Automatic Speaker Verification (ASV) in the past few years. Although it is known that deep learning-based ASV systems are vulnerable to adversarial examples digital access, there studies on attacks context of physical where a replay process (i.e., over air) involved. An over-the-air attack involves loudspeaker, microphone, and replaying environment impacts movement sound wave. Our initial experiment confirms effectiveness performance. This study performs an...
With the development of large text-to-speech (TTS) models and scale-up training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from open-sourced WenetSpeech dataset. Tailored for tasks, refined by adjusting segment boundaries, enhancing audio quality, eliminating speaker mixing within each segment. Following more accurate transcription process quality-based data filtering process,...
Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-quality, natural synthesized speech a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when whole paragraph considered TTS, where large amount of contextual information needs be building paragraph-based TTS model. To alleviate the difficulty training, we propose model linguistic and prosodic by considering cross-sentence, embedded structure training. Three...
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way junior researchers engineers into these fields. It presents a unified framework that inclusive of diverse generation tasks models, with added bonus being easily extendable new incorporation. The designed beginner-friendly workflows pre-trained allowing both beginners seasoned kick-start their projects relative ease. Additionally, it provides interactive visualizations demonstrations classic...
Adaptability and controllability in changing speaking styles speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on use DNNs for expressive with small set emotional data. Specifically, we three typical model adaptation approaches: (1) retraining by emotion-specific data (retrain), (2) augmenting network input using codes (code) (3) emotion-dependent output layers shared...
Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a display of generation process models, showcasing step-by-step denoising noisy spectrum and its transformation into clean captures desired singer's timbre. The also facilitates side-by-side comparisons different conditions, such as source content, melody, target timbre, highlighting impact these conditions on resulting...
Accent conversion aims to convert the accent of a source speech target accent, meanwhile preserving speaker's identity. This paper introduces novel non-autoregressive framework for that learns accent-agnostic linguistic representations and employs them in speech. Specifically, proposed system aligns with obtained from Text-to-Speech (TTS) systems, enabling training voice model on non-parallel data. Furthermore, we investigate effectiveness pretraining strategy native data different acoustic...
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators rooted Short-Time Fourier Transform (STFT), which owns a constant (TF) resolution, linearly scaled center frequencies, fixed decomposition basis, making it incompatible...
Zero-shot voice conversion (VC) converts source speech into the of any desired speaker using only one utterance without requiring additional model updates. Typical methods use a representation from pre-trained verification (SV) or learn during VC training to achieve zero-shot VC. However, existing modeling overlook variation information richness in temporal and frequency channel dimensions speech. This insufficient hampers ability accurately represent unseen speakers who are not dataset. In...