NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zero (linguistics) Codec
DOI: 10.48550/arxiv.2403.03100 Publication Date: 2024-03-05
ABSTRACT
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering intricately encompasses various attributes (e.g., content, prosody, timbre, acoustic details) that pose challenges for generation, a natural idea is to factorize into individual subspaces representing different generate them individually. Motivated by it, we propose NaturalSpeech 3, TTS system with novel factorized diffusion zero-shot way. Specifically, 1) design neural codec vector quantization (FVQ) disentangle waveform of details; 2) model each subspace following its corresponding prompt. With this factorization design, 3 can effectively efficiently the intricate disentangled divide-and-conquer Experiments show outperforms state-of-the-art systems on intelligibility. Furthermore, achieve better performance scaling 1B parameters 200K hours training data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....