- Topic Modeling
- Natural Language Processing Techniques
- Speech and dialogue systems
- Multimodal Machine Learning Applications
- Biomedical Text Mining and Ontologies
- Machine Learning in Bioinformatics
- Text Readability and Simplification
- Explainable Artificial Intelligence (XAI)
- Anomaly Detection Techniques and Applications
- Data Management and Algorithms
- Data Mining Algorithms and Applications
- Data Quality and Management
- Text and Document Classification Technologies
- Authorship Attribution and Profiling
- Advanced Text Analysis Techniques
- Image Retrieval and Classification Techniques
- Neurobiology of Language and Bilingualism
University of Cambridge
2017-2023
Center for Applied Linguistics
2020
In order to simulate human language capacity, natural processing systems must be able reason about the dynamics of everyday situations, including their possible causes and effects. Moreover, they should generalise acquired world knowledge new languages, modulo cultural differences. Advances in machine reasoning cross-lingual transfer depend on availability challenging evaluation benchmarks. Motivated by both demands, we introduce Cross-lingual Choice Plausible Alternatives (XCOPA), a...
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well less-resourced ones Welsh, Kiswahili). Each language set is annotated the relation of semantic similarity contains 1,888 semantically aligned concept pairs, providing representative coverage word classes (nouns, verbs, adjectives, adverbs), frequency ranks, intervals, fields,...
Anne Lauscher, Olga Majewska, Leonardo F. R. Ribeiro, Iryna Gurevych, Nikolai Rozanov, Goran Glavaš. Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Architectures. 2020.
Abstract Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, its potential is not fully realized, as current multilingual ToD datasets—both modular end-to-end modeling—suffer from severe limitations. 1) When created scratch, they are usually small in scale fail cover possible flows. 2) Translation-based datasets might lack naturalness cultural specificity the target language. In this work, tackle these...
In task-oriented dialogue (ToD), a user holds conversation with an artificial agent the aim of completing concrete task. Although this technology represents one central objectives AI and has been focus ever more intense research development efforts, it is currently limited to few narrow domains (e.g., food ordering, ticket booking) handful languages English, Chinese). This work provides extensive overview existing methods resources in multilingual ToD as entry point exciting emerging field....
Olga Majewska, Ivan Vulić, Goran Glavaš, Edoardo Maria Ponti, Anna Korhonen. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Abstract Background Recent advances in representation learning have enabled large strides natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge been shown to boost model performance different Natural Language Processing (NLP) tasks where accurate handling verb meaning and behaviour is critical. The costliness time required manual lexicon construction has major obstacle...
VerbNet, an extensive computational verb lexicon for English, has proved useful supporting a wide range of Natural Language Processing tasks requiring information about the behaviour and meaning verbs. Biomedical text processing mining could benefit from similar resource. We take first step towards development BioVerbNet: A VerbNet specifically aimed at describing verbs in area biomedicine. Because VerbNet-style classification is extremely time consuming, we start small manual biomedical...
Abstract Research into representation learning models of lexical semantics usually utilizes some form intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical similarity estimation is a widely used method, but efforts have typically focused on pairwise judgments words in isolation, or are limited specific contexts and stimuli. There limitations with these approaches either do not provide any context for judgments, thereby ignore ambiguity,...
Recent advances in deep learning have also enabled fast progress the research of task-oriented dialogue (ToD) systems. However, majority ToD systems are developed for English and merely a handful other widely spoken languages, e.g., Chinese German. This hugely limits global reach and, consequently, transformative socioeconomic potential such In this tutorial, we will thus discuss demonstrate importance (building) multilingual systems, then provide systematic overview current gaps, challenges...
VerbNet-the most extensive online verb lexicon currently available for English-has proved useful in supporting a variety of NLP tasks. However, its exploitation multilingual has been limited by the fact that such classifications are few languages only. Since manual development VerbNet is major undertaking, researchers have recently translated classes from English to other languages. no systematic investigation conducted into applicability and accuracy translation approach across different,...
Recent developments in language modeling have enabled large text encoders to derive a wealth of linguistic information from raw corpora without supervision. Their success across natural processing (NLP) tasks has called into question the role man-made computational resources, such as verb lexicons, supporting modern NLP. Still, probing analyses concurrently exposed limitations knowledge possessed by neural architectures, revealing them be clever task solvers rather than self-taught...
We present the first evaluation of applicability a spatial arrangement method (SpAM) to typologically diverse language sample, and its potential produce semantic resources support multilingual NLP, with focus on verb semantics. demonstrate SpAM’s utility in allowing for quick bottom-up creation large-scale datasets that balance cross-lingual alignment specificity. Starting from shared sample 825 English verbs, translated into Chinese, Japanese, Finnish, Polish, Italian, we apply two-phase...
Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets multilingual ToD - both modular end-to-end modelling suffer from severe limitations. 1) When created scratch, they are usually small in scale fail cover possible flows. 2) Translation-based might lack naturalness cultural specificity target language. In work, tackle these...
In task-oriented dialogue (ToD), a user holds conversation with an artificial agent to complete concrete task. Although this technology represents one of the central objectives AI and has been focus ever more intense research development efforts, it is currently limited few narrow domains (e.g., food ordering, ticket booking) handful languages English, Chinese). This work provides extensive overview existing methods resources in multilingual ToD as entry point exciting emerging field. We...
In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range syntactic and semantic properties language, they are still prone fall back on superficial cues simple heuristics solve downstream rather than leverage deeper linguistic knowledge. this paper, we target one such area deficiency, verbal reasoning. We...