NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Mean opinion score Leverage (statistics) Quality Score Benchmark (surveying)
DOI: 10.48550/arxiv.2205.04421 Publication Date: 2022-01-01
ABSTRACT
Text to speech (TTS) has made rapid progress in both academia and industry recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how define/judge quality it. In this paper, we answer these by first defining the based on statistical significance of subjective measure introducing appropriate guidelines judge it, then developing called NaturalSpeech achieves benchmark dataset. Specifically, leverage variational autoencoder (VAE) for end-to-end text waveform generation, with several key modules enhance capacity prior from reduce complexity posterior speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior memory mechanism VAE. Experiment evaluations popular LJSpeech dataset show our proposed -0.01 CMOS (comparative mean opinion score) human recordings at sentence level, Wilcoxon signed rank test p-level p >> 0.05, which demonstrates no statistically significant difference time
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....