NFDI4DS | UHH-SEMS - Publication Details

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Signal Processing (eess.SP) FOS: Computer and information sciences Sound (cs.SD) Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Electrical Engineering and Systems Science - Signal Processing Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing

DOI: 10.21437/interspeech.2022-277 Publication Date: 2022-09-16T15:42:06Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Yanqing Liu

Ruiqing Xue

Lei He

Xu Tan

Sheng Zhao

ABSTRACT

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.<br/>To appear in Interspeech 2022<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (13)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....