- Topic Modeling
- Natural Language Processing Techniques
- Speech and dialogue systems
- Multimodal Machine Learning Applications
- Text Readability and Simplification
- Speech Recognition and Synthesis
- Domain Adaptation and Few-Shot Learning
- Advanced Text Analysis Techniques
- Text and Document Classification Technologies
- Semantic Web and Ontologies
- Biomedical Text Mining and Ontologies
- Advanced Image and Video Retrieval Techniques
- Software Engineering Research
- Hate Speech and Cyberbullying Detection
- Web Data Mining and Analysis
- Authorship Attribution and Profiling
- Recommender Systems and Techniques
- Advanced Graph Neural Networks
- AI in Service Interactions
- Cooperative Communication and Network Coding
- Advanced MIMO Systems Optimization
- Video Analysis and Summarization
- Network Security and Intrusion Detection
- Information and Cyber Security
- Sentiment Analysis and Opinion Mining
University of Cambridge
2015-2024
IT University of Copenhagen
2023
Tokyo Institute of Technology
2023
Administration for Community Living
2023
Language Science (South Korea)
2016-2023
University of Defence
2019-2023
Cambridge School
2023
University of Glasgow
2023
American Jewish Committee
2023
University of Edinburgh
2023
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide comprehensive typology embedding models. We compare their data requirements objective functions. The recurring theme the survey is that many presented literature optimize same objectives, seemingly different often equivalent, modulo...
The main goal behind state-of-the-art pre-trained multilingual models such as BERT and XLM-R is enabling bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance the weakest exactly on unseen during pre-training. We propose MAD-X, an adapter-based framework that enables high portability parameter-efficient arbitrary tasks by learning modular language task representations....
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020.
Unsupervised machine translation - i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora seems impossible, but nevertheless, Lample et al. (2017) recently proposed fully unsupervised (MT) model. The model relies heavily on an adversarial, word embedding technique for bilingual dictionary induction (Conneau al., 2017), which we examine here. Our results identify the limitations of current MT: performs much worse morphologically rich...
We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as embeddings (WE) from comparable data. To this end, we make several important contributions: (1) present novel representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is first able to learn bilingual solely basis document-aligned data; (2) demonstrate simple yet effective approach building...
Building conversational systems in new domains and with added functionality requires resource-efficient models that work under low-data regimes (i.e., few-shot setups). Motivated by these requirements, we introduce intent detection methods backed pretrained dual sentence encoders such as USE ConveRT. We demonstrate the usefulness wide applicability of proposed detectors, showing that: 1) they outperform detectors based on fine-tuning full BERT-Large model or using BERT a fixed black-box...
Data scarcity is a long-standing and crucial challenge that hinders quick development of task-oriented dialogue systems across multiple domains: models are expected to learn grammar, syntax, reasoning, decision making, language generation from absurdly small amounts task-specific data. In this paper, we demonstrate recent progress in modeling pre-training transfer learning shows promise overcome problem. We propose model operates solely on text input: it effectively bypasses explicit policy...
Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot transfer in NLP, offering unmatched performance. Current evaluations, however, verify their efficacy transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream MMTs, showing that, much like cross-lingual word embeddings, they are substantially less...
Verbs play a critical role in the meaning of sentences, but these ubiquitous words have received little attention recent distributional semantics research. We introduce SimVerb-3500, an evaluation resource that provides human ratings for similarity 3,500 verb pairs. SimVerb-3500 covers all normed types from USF free-association database, providing at least three examples every VerbNet class. This broad coverage facilitates detailed analyses how syntactic and semantic phenomena together...
Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties different In this work, we take first step towards a comprehensive evaluation models: thoroughly evaluate both supervised unsupervised...
Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, corpus over 300 languages with around 100 thousand sentences per language pair average. In this paper, we present resource showcase its utility experiments word embedding induction multi-source part-of-speech projection.
We present Attract-Repel, an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. Attract-Repel facilitates use mono- and cross-lingual resources, yielding semantically specialized vector spaces. Our evaluation shows that method can make existing lexicons to construct high-quality spaces a plethora different languages, facilitating transfer high- lower-resource ones. The effectiveness our approach is demonstrated with...
The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, order to unveil what types knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, world knowledge, it remains unclear which extent LMs also derive lexical type-level from words context. In this work, we present a systematic empirical analysis across six typologically diverse languages five different tasks, addressing the...
General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they computationally heavy, slow, and expensive to train. We propose ConveRT (Conversational Representations from Transformers), a pretraining framework tasks satisfying all the following requirements: it is effective, affordable, quick pretrain using retrieval-based response selection task, effectively leveraging quantization subword-level parameterization in dual encoder...
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, Iryna Gurevych. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly languages that suffer from lack of human labeled resources. We present an extensive literature survey on use typological information in development NLP techniques. Our demonstrates date, existing databases has resulted consistent but modest improvements system performance. show this...
Ivan Vulić, Marie-Francine Moens. Proceedings of the 53rd Annual Meeting Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015.
In order to simulate human language capacity, natural processing systems must be able reason about the dynamics of everyday situations, including their possible causes and effects. Moreover, they should generalise acquired world knowledge new languages, modulo cultural differences. Advances in machine reasoning cross-lingual transfer depend on availability challenging evaluation benchmarks. Motivated by both demands, we introduce Cross-lingual Choice Plausible Alternatives (XCOPA), a...
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in representation learning, our learns dense real-valued vectors, that is, embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, article reveals may be learned solely basis of comparable data without any additional lexical nor...
A shared bilingual word embedding space (SBWES) is an indispensable resource in a variety of cross-language NLP and IR tasks.A common approach to the SB-WES induction learn mapping function between monolingual semantic spaces, where critically relies on seed lexicon used learning process.In this work, we analyze importance properties lexicons for SBWES across different dimensions (i.e., source, size, translation method, pair reliability).On basis our analysis, propose simple but effective...
We present LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as IS-A or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into initial space, LE specialisation procedure brings true pairs closer together in transformed Euclidean space. The proposed distance measure adjusts norms vectors reflect actual...
We introduce HyperLex—a data set and evaluation resource that quantifies the extent of semantic category membership, is, type-of relation, also known as hyponymy–hypernymy or lexical entailment (LE) relation between 2,616 concept pairs. Cognitive psychology research has established typicality category/class membership are computed in human memory a gradual rather than binary relation. Nevertheless, most NLP existing large-scale inventories (WordNet, DBPedia, etc.) treat LE binary. To address...
Ivan Vulić, Goran Glavaš, Roi Reichart, Anna Korhonen. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.
Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrkšić, Pei-Hao Su. Proceedings of the 57th Annual Meeting Association for Computational Linguistics. 2019.