NFDI4DS | UHH-SEMS - Publication Details

Multilingual Language Model Pretraining using Machine-translated Data

FOS: Computer and information sciences Computer Science - Computation and Language Computation and Language (cs.CL)

DOI: 10.48550/arxiv.2502.13252 Publication Date: 2025-02-18

Abstract Supplemental Material References Cited by

AUTHORS (8)

Jiayi Wang

Yao Lu

Maurice Weber

Max Ryabinin

David Ifeoluwa Ad...

Yihong Chen

Raphael Tang

Pontus Stenetorp

ABSTRACT

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other LLMs still underperform non-English languages, likely due to a gap in quality and diversity available multilingual corpora. In this work, we find that machine-translated texts from single source contribute significantly LLMs. We translate FineWeb-Edu, English web dataset, into nine resulting 1.7-trillion-token which call TransWebEdu pretrain 1.3B-parameter model, TransWebLLM, scratch on dataset. Across reasoning tasks, show TransWebLLM matches or outperforms state-of-the-art trained using closed data, Llama3.2, Qwen2.5, Gemma, despite an order magnitude less data. demonstrate adding than 5% domain-specific data sets new Arabic, Italian, Indonesian, Swahili, Welsh understanding commonsense tasks. To promote reproducibility, release our corpus, models, training pipeline under Open Source Initiative-approved licenses.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Multilingual Language Model Pretraining using Machine-translated Data

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....