NFDI4DS | UHH-SEMS - Publication Details

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Mean opinion score Leverage (statistics) Quality Score Benchmark (surveying)

DOI: 10.48550/arxiv.2205.04421 Publication Date: 2022-01-01

Abstract Supplemental Material References Cited by

AUTHORS (14)

Xu Tan

Jiawei Chen

Haohe Liu

Jian Cong

Chen Zhang

Yanqing Liu

Xi Wang

Yichong Leng

Yuanhao Yi

Lei He

Frank K. Soong

Tao Qin

Sheng Zhao

Tie‐Yan Liu

ABSTRACT

Text to speech (TTS) has made rapid progress in both academia and industry recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how define/judge quality it. In this paper, we answer these by first defining the based on statistical significance of subjective measure introducing appropriate guidelines judge it, then developing called NaturalSpeech achieves benchmark dataset. Specifically, leverage variational autoencoder (VAE) for end-to-end text waveform generation, with several key modules enhance capacity prior from reduce complexity posterior speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior memory mechanism VAE. Experiment evaluations popular LJSpeech dataset show our proposed -0.01 CMOS (comparative mean opinion score) human recordings at sentence level, Wilcoxon signed rank test p-level p >> 0.05, which demonstrates no statistically significant difference time

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....