CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
Word error rate
Brazilian Portuguese
DOI:
10.48550/arxiv.2110.15731
Publication Date:
2021-01-01
AUTHORS (11)
ABSTRACT
Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. particular, for Brazilian Portuguese (BP) language, were about 376 hours public available ASR task until second half of 2020. With release new datasets early 2021, this number increased to 574 hours. The existing resources, however, are composed audios containing only read prepared speech. There lack including spontaneous speech, which essential different applications. This paper presents CORAA (Corpus Annotated Audios) v1. with 290.77 hours, publicly dataset BP validated pairs (audio-transcription). also contains European (4.69 hours). We present model based on Wav2Vec 2.0 XLSR-53 fine-tuned over CORAA. Our achieved Word Error Rate 24.18% test set 20.08% Common Voice set. When measuring Character Rate, we obtained 11.02% 6.34% Voice, respectively. corpora assembled both improve models phenomena from speech motivate young researchers start their studies Portuguese. All at https://github.com/nilc-nlp/CORAA under CC BY-NC-ND 4.0 license.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....