NFDI4DS | UHH-SEMS - Publication Details

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

Transformers Transf Electronic computers. Computer science Multilingual Pretraining Language Modelling QA75.5-76.95 0102 computer and information sciences 01 natural sciences Low-Resource Languages

DOI: 10.3897/jucs.118889 Publication Date: 2024-12-20T15:51:42Z

Abstract Supplemental Material References Cited by

AUTHORS (3)

Ruan Visser

Trieko Grobler

Marcel Dunaiski

ABSTRACT

To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of model development under resource-constrained conditions. We investigate interplay between size, pretraining objectives, and multilingual dataset composition context low-resource languages such as Zulu Xhosa. In approach, we initially pretrain models from scratch on specific using a variety configurations, incrementally add related to explore effect additional performance these models. demonstrate that smaller data volumes can be effectively leveraged, choice objective significantly influences performance. Our monolingual models, exhibit competitive, some cases superior, compared established XLM-R-base AfroXLM-R-base.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (0)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....