Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2502.12982 Publication Date: 2025-02-18
ABSTRACT
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, undergoes continuous pre-training 500B tokens (400B SEA-specific 100B replay tokens) support 13 SEA languages while retaining proficiency Chinese English. Sailor2-20B model achieves 50-50 win rate against GPT-4o across languages. We also deliver comprehensive cookbook how develop the an efficient manner, including five key aspects: data curation, pre-training, post-training, customization evaluation. hope that (Apache 2.0 license) will drive development region, inspire researchers build more inclusive LLMs other under-served
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....