NFDI4DS | UHH-SEMS - Publication Details

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

Word error rate Brazilian Portuguese

DOI: 10.48550/arxiv.2110.15731 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (11)

Arnaldo Cândido

Edresson Casanova

Anderson da Silva...

Frederico Santos ...

Lucas Silva de Ol...

Ricardo Corso Fer...

Daniel Peixoto Pi...

Fernando Gorgulho...

Bruno Baldissera ...

Lucas Rafael Stef...

Sandra Maria Aluísio

ABSTRACT

Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. particular, for Brazilian Portuguese (BP) language, were about 376 hours public available ASR task until second half of 2020. With release new datasets early 2021, this number increased to 574 hours. The existing resources, however, are composed audios containing only read prepared speech. There lack including spontaneous speech, which essential different applications. This paper presents CORAA (Corpus Annotated Audios) v1. with 290.77 hours, publicly dataset BP validated pairs (audio-transcription). also contains European (4.69 hours). We present model based on Wav2Vec 2.0 XLSR-53 fine-tuned over CORAA. Our achieved Word Error Rate 24.18% test set 20.08% Common Voice set. When measuring Character Rate, we obtained 11.02% 6.34% Voice, respectively. corpora assembled both improve models phenomena from speech motivate young researchers start their studies Portuguese. All at https://github.com/nilc-nlp/CORAA under CC BY-NC-ND 4.0 license.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....