NFDI4DS | UHH-SEMS - Publication Details

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

FOS: Computer and information sciences Sound (cs.SD) Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing

DOI: 10.48550/arxiv.2406.02430 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (46)

Anastassiou, Philip

Chen, Jiawei

Chen, Jitong

Chen, Yuanzhe

Chen, Zhuo

Chen, Ziyi

Cong, Jian

Deng, Lelai

Ding, Chuang

Gao, Lu

Gong, Mingqing

Huang, Peisong

Huang, Qingqing

Huang, Zhiying

Huo, Yuanyuan

Jia, Dongya

Li, Chumin

Li, Feiya

Li, Hui

Li, Jiaxin

Li, Xiaoyang

Li, Xingxing

Liu, Lin

Liu, Shouda

Liu, Sichao

Liu, Xudong

Liu, Yuchen

Liu, Zhengxi

Lu, Lu

Pan, Junjie

Wang, Xin

Wang, Yuping

Wang, Yuxuan

Wei, Zhen

Wu, Jian

Yao, Chao

Yang, Yifeng

Yi, Yuanhao

Zhang, Junteng

Zhang, Qidi

Zhang, Shuo

Zhang, Wenjie

Zhang, Yang

Zhao, Zilin

Zhong, Dejian

Zhuang, Xiaobin

ABSTRACT

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....