NFDI4DS | UHH-SEMS - Publication Details

AudioPaLM: A Large Language Model That Can Speak and Listen

FOS: Computer and information sciences Sound (cs.SD) Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Statistics - Machine Learning Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering Machine Learning (stat.ML) Computation and Language (cs.CL) Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing

DOI: 10.48550/arxiv.2306.12925 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (30)

Rubenstein, Paul K.

Asawaroengchai, C...

Nguyen, Duc Dung

Bapna, Ankur

Borsos, Zalán

Quitry, Félix de ...

Chen, Peter

Badawy, Dalia El

Han, Wei

Kharitonov, Eugene

Muckenhirn, Hannah

Padfield, Dirk

Qin, James

Rozenberg, Danny

Sainath, Tara

Schalkwyk, Johan

Sharifi, Matt

Ramanovich, Miche...

Tagliasacchi, Marco

Tudor, Alexandru

Velimirović, Mihajlo

Vincent, Damien

Yu, Jiahui

Wang, Yongqiang

Zayats, Vicky

Zeghidour, Neil

Zhang, Yu

Zhang, Zhishuai

Zilka, Lukas

Frank, Christian

ABSTRACT

Technical report<br/>We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

AudioPaLM: A Large Language Model That Can Speak and Listen

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....