NFDI4DS | UHH-SEMS - Publication Details

Soundwave: Less is More for Speech-Text Alignment in LLMs

FOS: Computer and information sciences Sound (cs.SD) Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Computer Science - Sound

DOI: 10.48550/arxiv.2502.12900 Publication Date: 2025-02-18

Abstract Supplemental Material References Cited by

AUTHORS (6)

Yuhao Zhang

Zhi‐Heng Liu

Fan Bu

Ruiyu Zhang

Benyou Wang

Haizhou Li

ABSTRACT

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus two fundamental problems between and text: the representation space gap sequence length inconsistency. propose Soundwave, which utilizes an efficient strategy a novel architecture to address these issues. Results show that Soundwave outperforms advanced Qwen2-Audio translation AIR-Bench tasks, using only one-fiftieth of data. Further analysis shows still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Soundwave: Less is More for Speech-Text Alignment in LLMs

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....