Enhancing audio quality for expressive Neural Text-to-Speech
Autoencoder
SIGNAL (programming language)
Mean opinion score
Granularity
DOI:
10.48550/arxiv.2108.06270
Publication Date:
2021-01-01
AUTHORS (10)
ABSTRACT
Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable producing with similar quality to human recordings. However, not all speaking styles easy model: highly expressive voices still challenging even TTS architectures since there seems be trade-off between expressiveness generated audio and its signal quality. In this paper, we present set techniques that can leveraged enhance the highly-expressive voice without use additional data. The proposed include: tuning autoregressive loop's granularity during training; using Generative Adversarial Networks acoustic modelling; Variational Auto-Encoders both model neural vocoder. We show that, when combined, these greatly closed gap perceived baseline system recordings by 39% MUSHRA scores for an celebrity voice.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....