Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Similarity (geometry) Concreteness Lexical analysis
DOI: 10.1162/coli_a_00391 Publication Date: 2020-10-22T19:51:46Z
ABSTRACT
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well less-resourced ones Welsh, Kiswahili). Each language set is annotated the relation of semantic similarity contains 1,888 semantically aligned concept pairs, providing representative coverage word classes (nouns, verbs, adjectives, adverbs), frequency ranks, intervals, fields, concreteness levels. Additionally, owing to alignment concepts across we provide suite 66 crosslingual sets. Because its extensive size coverage, Multi-SimLex provides entirely novel opportunities experimental analysis. On monolingual benchmarks, evaluate analyze wide array recent state-of-the-art representation models, static contextualized embeddings (such fastText, multilingual BERT, XLM), externally informed representations, fully unsupervised (weakly) supervised embeddings. also present step-by-step creation protocol creating consistent, Multi-Simlex–style resources additional languages. make these contributions—the public release sets, their protocol, strong baseline results, in-depth analyses which can be helpful in guiding future developments semantics learning—available via Web site that will encourage community effort further expansion Multi-Simlex many more Such could inspire significant advances NLP
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (179)
CITATIONS (24)