Soundwave: Less is More for Speech-Text Alignment in LLMs
FOS: Computer and information sciences
Sound (cs.SD)
Computer Science - Computation and Language
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Computation and Language (cs.CL)
Computer Science - Sound
DOI:
10.48550/arxiv.2502.12900
Publication Date:
2025-02-18
AUTHORS (6)
ABSTRACT
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus two fundamental problems between and text: the representation space gap sequence length inconsistency. propose Soundwave, which utilizes an efficient strategy a novel architecture to address these issues. Results show that Soundwave outperforms advanced Qwen2-Audio translation AIR-Bench tasks, using only one-fiftieth of data. Further analysis shows still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....