Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
Zero (linguistics)
DOI:
10.24963/ijcai.2023/575
Publication Date:
2023-08-11T04:31:30Z
AUTHORS (6)
ABSTRACT
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due the need for paired text and studio-quality audio data. This paper proposes a method zero-shot using text-only data target language. The use of allows development low-resource which only textual resources available, making accessible thousands languages. Inspired by strong cross-lingual transferability language models, our framework first performs masked model pretraining with Then we train this in supervised manner, while freezing language-aware embedding layer. inference even not included but present Evaluation results demonstrate highly intelligible character error rate less than 12% an unseen
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (12)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....