Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

Transformers Transf Electronic computers. Computer science Multilingual Pretraining Language Modelling QA75.5-76.95 0102 computer and information sciences 01 natural sciences Low-Resource Languages
DOI: 10.3897/jucs.118889 Publication Date: 2024-12-20T15:51:42Z
ABSTRACT
To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of model development under resource-constrained conditions. We investigate interplay between size, pretraining objectives, and multilingual dataset composition context low-resource languages such as Zulu Xhosa. In approach, we initially pretrain models from scratch on specific using a variety configurations, incrementally add related to explore effect additional performance these models. demonstrate that smaller data volumes can be effectively leveraged, choice objective significantly influences performance. Our monolingual models, exhibit competitive, some cases superior, compared established XLM-R-base AfroXLM-R-base.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)