Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Computation and Language
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Computation and Language (cs.CL)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2502.12982
Publication Date:
2025-02-18
AUTHORS (41)
ABSTRACT
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, undergoes continuous pre-training 500B tokens (400B SEA-specific 100B replay tokens) support 13 SEA languages while retaining proficiency Chinese English. Sailor2-20B model achieves 50-50 win rate against GPT-4o across languages. We also deliver comprehensive cookbook how develop the an efficient manner, including five key aspects: data curation, pre-training, post-training, customization evaluation. hope that (Apache 2.0 license) will drive development region, inspire researchers build more inclusive LLMs other under-served
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....