Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages
Transformers
Transf
Electronic computers. Computer science
Multilingual
Pretraining
Language Modelling
QA75.5-76.95
0102 computer and information sciences
01 natural sciences
Low-Resource Languages
DOI:
10.3897/jucs.118889
Publication Date:
2024-12-20T15:51:42Z
AUTHORS (3)
ABSTRACT
To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of model development under resource-constrained conditions. We investigate interplay between size, pretraining objectives, and multilingual dataset composition context low-resource languages such as Zulu Xhosa. In approach, we initially pretrain models from scratch on specific using a variety configurations, incrementally add related to explore effect additional performance these models. demonstrate that smaller data volumes can be effectively leveraged, choice objective significantly influences performance. Our monolingual models, exhibit competitive, some cases superior, compared established XLM-R-base AfroXLM-R-base.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....