CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
Word error rate
Brazilian Portuguese
Transcription
DOI:
10.1007/s10579-022-09621-4
Publication Date:
2022-11-22T23:04:40Z
AUTHORS (11)
ABSTRACT
Abstract Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. particular, for Brazilian Portuguese (BP) language, were around 376 h publicly available ASR task until second half of 2020. With release new datasets early 2021, this number increased to 574 h. The existing resources, however, are composed audios containing only read prepared speech. There lack including spontaneous speech, which essential several applications. This paper presents CORAA (Corpus Annotated Audios) with 290 h, dataset BP validated pairs audio-transcription. also contains European (4.6 h). We present public model based on Wav2Vec 2.0 XLSR-53, fine-tuned over ASR. Our achieved Word Error Rate (WER) 24.18% test set 20.08% Common Voice set. When measuring Character (CER), we obtained 11.02% 6.34% Voice, respectively. corpora assembled both improve models phenomena from speech motivate young researchers start their studies Portuguese. All at https://github.com/nilc-nlp/CORAA under CC BY-NC-ND 4.0 license.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (39)
CITATIONS (5)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....