NFDI4DS | UHH-SEMS - Publication Details

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Sequence (biology) Coarticulation Interleaving Code (set theory)

DOI: 10.48550/arxiv.2401.07333 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Yakun Song

Zhuo Chen

Xiaofei Wang

Ziyang Ma

Xie Chen

ABSTRACT

The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, omissions output synthesized speech due to limited alignment constraints between phoneme tokens; 2) challenges fine-grained control over with autoregressive (AR) model; 3) infinite silence generation nature AR-based decoding, especially under greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based text-to-speech (TTS) framework, which enables at level. key ELLA-V is interleaving sequences tokens, where tokens appear ahead corresponding tokens. experimental findings reveal that our outperforms VALL-E terms accuracy delivers more stable results using both sampling-based decoding strategies. code will be open-sourced after cleanups. Audio samples are available https://ereboas.github.io/ELLAV/.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications

PlumX Metrics

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....