Multilingual Language Model Pretraining using Machine-translated Data
FOS: Computer and information sciences
Computer Science - Computation and Language
Computation and Language (cs.CL)
DOI:
10.48550/arxiv.2502.13252
Publication Date:
2025-02-18
AUTHORS (8)
ABSTRACT
High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other LLMs still underperform non-English languages, likely due to a gap in quality and diversity available multilingual corpora. In this work, we find that machine-translated texts from single source contribute significantly LLMs. We translate FineWeb-Edu, English web dataset, into nine resulting 1.7-trillion-token which call TransWebEdu pretrain 1.3B-parameter model, TransWebLLM, scratch on dataset. Across reasoning tasks, show TransWebLLM matches or outperforms state-of-the-art trained using closed data, Llama3.2, Qwen2.5, Gemma, despite an order magnitude less data. demonstrate adding than 5% domain-specific data sets new Arabic, Italian, Indonesian, Swahili, Welsh understanding commonsense tasks. To promote reproducibility, release our corpus, models, training pipeline under Open Source Initiative-approved licenses.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....