NFDI4DS | UHH-SEMS - Publication Details

Enhancing audio quality for expressive Neural Text-to-Speech

Autoencoder SIGNAL (programming language) Mean opinion score Granularity

DOI: 10.48550/arxiv.2108.06270 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (10)

Abdelhamid Ezzerg

Adam Gabryś

Bartosz Putrycz

Daniel Korzekwa

Daniel Sáez-Trigu...

David McHardy

Kamil Pokora

Jakub Lachowicz

Jaime Lorenzo-Trueba

Viacheslav Klimkov

ABSTRACT

Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable producing with similar quality to human recordings. However, not all speaking styles easy model: highly expressive voices still challenging even TTS architectures since there seems be trade-off between expressiveness generated audio and its signal quality. In this paper, we present set techniques that can leveraged enhance the highly-expressive voice without use additional data. The proposed include: tuning autoregressive loop's granularity during training; using Generative Adversarial Networks acoustic modelling; Variational Auto-Encoders both model neural vocoder. We show that, when combined, these greatly closed gap perceived baseline system recordings by 39% MUSHRA scores for an celebrity voice.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

Enhancing audio quality for expressive Neural Text-to-Speech

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....