NFDI4DS | UHH-SEMS - Publication Details

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zero (linguistics) Codec

DOI: 10.48550/arxiv.2403.03100 Publication Date: 2024-03-05

Abstract Supplemental Material References Cited by

AUTHORS (19)

Zeqian Ju

Yuancheng Wang

Kai Shen

Xu Tan

Detai Xin

Dongchao Yang

Yanqing Liu

Yichong Leng

Kaitao Song

Siliang Tang

Zhizheng Wu

Tao Qin

Xiangyang Li

Wei Ye

Shikun Zhang

Jiang Bian

Lei He

Jinyu Li

Sheng Zhao

ABSTRACT

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering intricately encompasses various attributes (e.g., content, prosody, timbre, acoustic details) that pose challenges for generation, a natural idea is to factorize into individual subspaces representing different generate them individually. Motivated by it, we propose NaturalSpeech 3, TTS system with novel factorized diffusion zero-shot way. Specifically, 1) design neural codec vector quantization (FVQ) disentangle waveform of details; 2) model each subspace following its corresponding prompt. With this factorization design, 3 can effectively efficiently the intricate disentangled divide-and-conquer Experiments show outperforms state-of-the-art systems on intelligibility. Furthermore, achieve better performance scaling 1B parameters 200K hours training data.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....