- Natural Language Processing Techniques
- Topic Modeling
- Semantic Web and Ontologies
- Linguistics and Discourse Analysis
- Advanced Text Analysis Techniques
- Speech and dialogue systems
- French Language Learning Methods
- Biomedical Text Mining and Ontologies
- Digital Humanities and Scholarship
- Language and cultural evolution
- linguistics and terminology studies
- Sentiment Analysis and Opinion Mining
- Text Readability and Simplification
- Historical Linguistics and Language Studies
- Web Data Mining and Analysis
- Language, Metaphor, and Cognition
- Syntax, Semantics, Linguistic Variation
- Translation Studies and Practices
- Lexicography and Language Studies
- Linguistics and language evolution
- Linguistics and Cultural Studies
- Complex Network Analysis Techniques
- Cultural Insights and Digital Impacts
- Computational and Text Analysis Methods
- Phonetics and Phonology Research
Langues, Textes, Traitements Informatiques, Cognition
2015-2024
École Normale Supérieure - PSL
2009-2023
École Normale Supérieure
2011-2023
Université Sorbonne Nouvelle
2012-2022
Centre National de la Recherche Scientifique
2012-2022
Sorbonne Université
2016-2022
University of Pisa
2022
University of California, Berkeley
2020
Bocconi University
2020
University of Michigan
2020
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly languages that suffer from lack of human labeled resources. We present an extensive literature survey on use typological information in development NLP techniques. Our demonstrates date, existing databases has resulted consistent but modest improvements system performance. show this...
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well less-resourced ones Welsh, Kiswahili). Each language set is annotated the relation of semantic similarity contains 1,888 semantically aligned concept pairs, providing representative coverage word classes (nouns, verbs, adjectives, adverbs), frequency ranks, intervals, fields,...
A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious—i.e., the model not rely on it when making predictions. In this paper, we try find an encoding that actually uses, introducing usage-based setup. We first choose behavioral task which cannot solved without using property. Then, attempt remove by intervening model’s contend that, if used model, its removal should harm performance...
Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes treebanks, discusses process through which they created, and outlines future plans timeline for next improvements. Special attention is paid to possibilities of using UD documentation description endangered languages.
This paper presents a multilingual system designed to recognize named entities in wide variety of languages (currently more than 12 are concerned). The includes original strategies deal with encoding character sets, analysis and algorithms process these languages.
This paper gives an overview of the Caderige project. project involves teams from different areas (biology, machine learning, natural language processing) in order to develop highlevel analysis tools for extracting structured information biological bibliographical databases, especially Medline. The approach and compares it state art.
We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is encoding three different modes contextual information parsing: (i) Treebank feature representations, (ii) Multilingual word (iii) ELMo representations obtained via unsupervised learning external resources. Our performed well in official end-to-end evaluation (73.02 LAS –...