MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Relevance
DOI: 10.1162/tacl_a_00595 Publication Date: 2023-09-06T14:08:06Z
ABSTRACT
Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource designed to support monolingual tasks, where queries and corpora are in same language. In total, we have gathered 726k high-quality relevance judgments 78k Wikipedia these languages, all annotations been performed by hired our team. covers both typologically close as well distant from 10 language families 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification manual assessments were during annotation process control data quality. represents an investment five person-years human annotator effort. Our goal spur research on improving continuum thus enhancing information access capabilities diverse populations world, particularly those traditionally underserved. at http://miracl.ai/.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (49)
CITATIONS (12)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....