Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
SemEval
Similarity (geometry)
Distributional semantics
DOI:
10.1371/journal.pone.0246751
Publication Date:
2021-02-17T19:58:25Z
AUTHORS (4)
ABSTRACT
Research into semantic similarity has a long history in lexical semantics, and it applications many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating is usually presented the form datasets which contain pairs human-assigned score. Algorithms are then evaluated by their ability to approximate gold standard scores. Many such datasets, with different characteristics, have been created for English language. Recently, four those were transformed Thai versions, namely WordSim-353, SimLex-999, SemEval-2017-500, R&G-65. Given this work we aim improve previous baseline evaluations solve challenges unsegmented Asian languages (particularly high fraction out-of-vocabulary (OOV) dataset terms). To end apply integrate strategies compute similarity, including traditional word-level embeddings, subword-unit ontological hybrid sources WordNet ConceptNet. With our best model, combines self-trained fastText subword embeddings ConceptNet Numberbatch, managed raise state-of-the-art, measured harmonic mean Pearson on Spearman ρ , large margin from 0.356 0.688 TH-WordSim-353, 0.286 0.769 TH-SemEval-500, 0.397 0.717 TH-SimLex-999, 0.505 0.901 TWS-65.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (56)
CITATIONS (5)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....