NFDI4DS | UHH-SEMS - Publication Details

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Audio mining

DOI: 10.21437/interspeech.2022-10712 Publication Date: 2022-09-16T15:42:06Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Xinjian Li

Florian Metze

David R. Mortensen

Alan W Black

Shinji Watanabe

ABSTRACT

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.<br/>INTERSPEECH 2022<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (8)

EXTERNAL LINKS

OPENALEX - Publications CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

ASR2K: Speech Recognition for Around 2000 Languages without Audio

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....