SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Single stage
DOI: 10.48550/arxiv.2502.13128 Publication Date: 2025-02-18
ABSTRACT
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics descriptions instrumentation, genre, mood, timbre, while also offering an optional three-second reference clip voice cloning. Within unified framework, SongGen supports two output modes: mixed mode, which generates mixture directly, dual-track synthesizes them separately greater flexibility downstream applications. We explore token pattern strategies each leading notable improvements valuable insights. Furthermore, design automated preprocessing pipeline with effective quality control. To foster community engagement future research, will release our weights, code, annotated data, pipeline. generated samples are showcased on project page at https://liuzh-19.github.io/SongGen/ , code be available https://github.com/LiuZH-19/SongGen .
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....