Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech
Similarity (geometry)
Generative model
DOI:
10.21437/ssw.2021-17
Publication Date:
2021-08-24T06:58:52Z
AUTHORS (8)
ABSTRACT
Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, 3-step method was proposed to generate TTS while greatly reducing data required for training. However, we have observed ceiling effect in level naturalness achievable highly expressive voices when using this approach. paper, present building with as little 15 minutes speech Compared current state-of-the-art approach, our improvements close gap by 23.3% and 16.3% speaker similarity. Further, match similarity Tacotron2-based full-data (~10 hours) model only data, whereas 30 or more, significantly outperform it. The following are proposed: 1) changing an autoregressive, attention-based non-autoregressive replacing attention external duration 2) additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (11)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....