SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Benchmark (surveying)
DOI: 10.48550/arxiv.2406.10118 Publication Date: 2024-06-14
ABSTRACT
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages population of 671 million people. However, prevailing AI models suffer from significant lack representation texts, images, audio datasets SEA, compromising the quality for SEA languages. Evaluating challenging due to scarcity high-quality datasets, compounded by dominance English training data, raising concerns about potential misrepresentation. To address these challenges, we introduce SEACrowd, collaborative initiative that consolidates comprehensive resource hub fills gap providing standardized corpora nearly 1,000 across three modalities. Through our SEACrowd benchmarks, assess on 36 13 tasks, offering valuable insights into current landscape SEA. Furthermore, propose strategies facilitate greater advancements, maximizing utility equity future
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....