ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Sequence (biology) Coarticulation Interleaving Code (set theory)
DOI: 10.48550/arxiv.2401.07333 Publication Date: 2024-01-01
ABSTRACT
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, omissions output synthesized speech due to limited alignment constraints between phoneme tokens; 2) challenges fine-grained control over with autoregressive (AR) model; 3) infinite silence generation nature AR-based decoding, especially under greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based text-to-speech (TTS) framework, which enables at level. key ELLA-V is interleaving sequences tokens, where tokens appear ahead corresponding tokens. experimental findings reveal that our outperforms VALL-E terms accuracy delivers more stable results using both sampling-based decoding strategies. code will be open-sourced after cleanups. Audio samples are available https://ereboas.github.io/ELLAV/.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....