NFDI4DS | UHH-SEMS - Publication Details

Ivan Vulić

ORCID: 0000-0002-5161-5422

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5014866912

Research Areas

Topic Modeling
Natural Language Processing Techniques
Speech and dialogue systems
Multimodal Machine Learning Applications
Text Readability and Simplification
Speech Recognition and Synthesis
Domain Adaptation and Few-Shot Learning
Advanced Text Analysis Techniques
Text and Document Classification Technologies
Semantic Web and Ontologies
Biomedical Text Mining and Ontologies
Advanced Image and Video Retrieval Techniques
Software Engineering Research
Hate Speech and Cyberbullying Detection
Web Data Mining and Analysis
Authorship Attribution and Profiling
Recommender Systems and Techniques
Advanced Graph Neural Networks
AI in Service Interactions
Cooperative Communication and Network Coding
Advanced MIMO Systems Optimization
Video Analysis and Summarization
Network Security and Intrusion Detection
Information and Cyber Security
Sentiment Analysis and Opinion Mining

University of Cambridge
2015-2024

IT University of Copenhagen
2023

Tokyo Institute of Technology
2023

Administration for Community Living
2023

Language Science (South Korea)
2016-2023

University of Defence
2019-2023

Cambridge School
2023

University of Glasgow
2023

American Jewish Committee
2023

University of Edinburgh
2023

A Survey of Cross-lingual Word Embedding Models

OPENALEX - Publications

Sebastian Ruder Ivan Vulić Anders Søgaard

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide comprehensive typology embedding models. We compare their data requirements objective functions. The recurring theme the survey is that many presented literature optimize same objectives, seemingly different often equivalent, modulo...

10.1613/jair.1.11640 article EN cc-by Journal of Artificial Intelligence Research 2019-08-12

MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

OPENALEX - Publications

Jonas Pfeiffer Ivan Vulić Iryna Gurevych Sebastian Ruder

The main goal behind state-of-the-art pre-trained multilingual models such as BERT and XLM-R is enabling bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance the weakest exactly on unseen during pre-training. We propose MAD-X, an adapter-based framework that enables high portability parameter-efficient arbitrary tasks by learning modular language task representations....

10.18653/v1/2020.emnlp-main.617 article EN cc-by 2020-01-01

AdapterHub: A Framework for Adapting Transformers

OPENALEX - Publications

Jonas Pfeiffer Andreas Rücklé Clifton Poth Aishwarya Kamath Ivan Vulić and 3 more

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020.

10.18653/v1/2020.emnlp-demos.7 article EN cc-by 2020-01-01

On the Limitations of Unsupervised Bilingual Dictionary Induction

OPENALEX - Publications

Anders Søgaard Sebastian Ruder Ivan Vulić

Unsupervised machine translation - i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora seems impossible, but nevertheless, Lample et al. (2017) recently proposed fully unsupervised (MT) model. The model relies heavily on an adversarial, word embedding technique for bilingual dictionary induction (Conneau al., 2017), which we examine here. Our results identify the limitations of current MT: performs much worse morphologically rich...

10.18653/v1/p18-1072 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

OPENALEX - Publications

Ivan Vulić Marie‐Francine Moens

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as embeddings (WE) from comparable data. To this end, we make several important contributions: (1) present novel representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is first able to learn bilingual solely basis document-aligned data; (2) demonstrate simple yet effective approach building...

10.1145/2766462.2767752 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-04

Efficient Intent Detection with Dual Sentence Encoders

OPENALEX - Publications

Iñigo Casanueva Tadas Temčinas Daniela Gerz Matthew Henderson Ivan Vulić

Building conversational systems in new domains and with added functionality requires resource-efficient models that work under low-data regimes (i.e., few-shot setups). Motivated by these requirements, we introduce intent detection methods backed pretrained dual sentence encoders such as USE ConveRT. We demonstrate the usefulness wide applicability of proposed detectors, showing that: 1) they outperform detectors based on fine-tuning full BERT-Large model or using BERT a fixed black-box...

10.18653/v1/2020.nlp4convai-1.5 article EN cc-by 2020-01-01

Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems

OPENALEX - Publications

Paweł Budzianowski Ivan Vulić

Data scarcity is a long-standing and crucial challenge that hinders quick development of task-oriented dialogue systems across multiple domains: models are expected to learn grammar, syntax, reasoning, decision making, language generation from absurdly small amounts task-specific data. In this paper, we demonstrate recent progress in modeling pre-training transfer learning shows promise overcome problem. We propose model operates solely on text input: it effectively bypasses explicit policy...

10.18653/v1/d19-5602 article EN cc-by 2019-01-01

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

OPENALEX - Publications

Anne Lauscher Vinit Ravishankar Ivan Vulić Goran Glavaš

Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot transfer in NLP, offering unmatched performance. Current evaluations, however, verify their efficacy transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream MMTs, showing that, much like cross-lingual word embeddings, they are substantially less...

10.18653/v1/2020.emnlp-main.363 article EN 2020-01-01

SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity

OPENALEX - Publications

Daniela Gerz Ivan Vulić Felix Hill Roi Reichart Anna Korhonen

Verbs play a critical role in the meaning of sentences, but these ubiquitous words have received little attention recent distributional semantics research. We introduce SimVerb-3500, an evaluation resource that provides human ratings for similarity 3,500 verb pairs. SimVerb-3500 covers all normed types from USF free-association database, providing at least three examples every VerbNet class. This broad coverage facilitates detailed analyses how syntactic and semantic phenomena together...

10.18653/v1/d16-1235 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2016-01-01

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

OPENALEX - Publications

Goran Glavaš Robert Litschko Sebastian Ruder Ivan Vulić

Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties different In this work, we take first step towards a comprehensive evaluation models: thoroughly evaluate both supervised unsupervised...

10.18653/v1/p19-1070 article EN 2019-01-01

JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages

OPENALEX - Publications

Željko Agić Ivan Vulić

Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, corpus over 300 languages with around 100 thousand sentences per language pair average. In this paper, we present resource showcase its utility experiments word embedding induction multi-source part-of-speech projection.

10.18653/v1/p19-1310 article EN cc-by 2019-01-01

Semantic Specialization of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints

OPENALEX - Publications

Nikola Mrkšić Ivan Vulić Diarmuid Ó Séaghdha Ira Leviant Roi Reichart and 3 more

We present Attract-Repel, an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. Attract-Repel facilitates use mono- and cross-lingual resources, yielding semantically specialized vector spaces. Our evaluation shows that method can make existing lexicons to construct high-quality spaces a plethora different languages, facilitating transfer high- lower-resource ones. The effectiveness our approach is demonstrated with...

10.1162/tacl_a_00063 article EN cc-by Transactions of the Association for Computational Linguistics 2017-12-01

Probing Pretrained Language Models for Lexical Semantics

OPENALEX - Publications

Ivan Vulić Edoardo Maria Ponti Robert Litschko Goran Glavaš Anna Korhonen

The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, order to unveil what types knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, world knowledge, it remains unclear which extent LMs also derive lexical type-level from words context. In this work, we present a systematic empirical analysis across six typologically diverse languages five different tasks, addressing the...

10.18653/v1/2020.emnlp-main.586 article EN cc-by 2020-01-01

ConveRT: Efficient and Accurate Conversational Representations from Transformers

OPENALEX - Publications

Matthew Henderson Iñigo Casanueva Nikola Mrkšić Pei-Hao Su Tsung-Hsien Wen and 1 more

General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they computationally heavy, slow, and expensive to train. We propose ConveRT (Conversational Representations from Transformers), a pretraining framework tasks satisfying all the following requirements: it is effective, affordable, quick pretrain using retrieval-based response selection task, effectively leveraging quantization subword-level parameterization in dual encoder...

10.18653/v1/2020.findings-emnlp.196 article EN cc-by 2020-01-01

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

OPENALEX - Publications

Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych

Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, Iryna Gurevych. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.243 article EN cc-by 2021-01-01

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

OPENALEX - Publications

Edoardo Maria Ponti Helen O’Horan Yevgeni Berzak Ivan Vulić Roi Reichart and 3 more

Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly languages that suffer from lack of human labeled resources. We present an extensive literature survey on use typological information in development NLP techniques. Our demonstrates date, existing databases has resulted consistent but modest improvements system performance. show this...

10.1162/coli_a_00357 article EN cc-by-nc-nd Computational Linguistics 2019-06-25

Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction

OPENALEX - Publications

Ivan Vulić Marie‐Francine Moens

Ivan Vulić, Marie-Francine Moens. Proceedings of the 53rd Annual Meeting Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015.

10.3115/v1/p15-2118 article EN cc-by 2015-01-01

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

OPENALEX - Publications

Edoardo Maria Ponti Goran Glavaš Olga Majewska Qianchu Liu Ivan Vulić and 1 more

In order to simulate human language capacity, natural processing systems must be able reason about the dynamics of everyday situations, including their possible causes and effects. Moreover, they should generalise acquired world knowledge new languages, modulo cultural differences. Advances in machine reasoning cross-lingual transfer depend on availability challenging evaluation benchmarks. Motivated by both demands, we introduce Cross-lingual Choice Plausible Alternatives (XCOPA), a...

10.18653/v1/2020.emnlp-main.185 article EN 2020-01-01

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

OPENALEX - Publications

Ivan Vulić Marie‐Francine Moens

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in representation learning, our learns dense real-valued vectors, that is, embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, article reveals may be learned solely basis of comparable data without any additional lexical nor...

10.1613/jair.4986 article EN cc-by Journal of Artificial Intelligence Research 2016-04-12

On the Role of Seed Lexicons in Learning Bilingual Word Embeddings

OPENALEX - Publications

Ivan Vulić Anna Korhonen

A shared bilingual word embedding space (SBWES) is an indispensable resource in a variety of cross-language NLP and IR tasks.A common approach to the SB-WES induction learn mapping function between monolingual semantic spaces, where critically relies on seed lexicon used learning process.In this work, we analyze importance properties lexicons for SBWES across different dimensions (i.e., source, size, translation method, pair reliability).On basis our analysis, propose simple but effective...

10.18653/v1/p16-1024 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016-01-01

Specialising Word Vectors for Lexical Entailment

OPENALEX - Publications

Ivan Vulić Nikola Mrkšić

We present LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as IS-A or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into initial space, LE specialisation procedure brings true pairs closer together in transformed Euclidean space. The proposed distance measure adjusts norms vectors reflect actual...

10.18653/v1/n18-1103 article EN cc-by 2018-01-01

HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment

OPENALEX - Publications

Ivan Vulić Daniela Gerz Douwe Kiela Felix Hill Anna Korhonen

We introduce HyperLex—a data set and evaluation resource that quantifies the extent of semantic category membership, is, type-of relation, also known as hyponymy–hypernymy or lexical entailment (LE) relation between 2,616 concept pairs. Cognitive psychology research has established typicality category/class membership are computed in human memory a gradual rather than binary relation. Nevertheless, most NLP existing large-scale inventories (WordNet, DBPedia, etc.) treat LE binary. To address...

10.1162/coli_a_00301 article EN cc-by-nc-nd Computational Linguistics 2017-09-11

Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?

OPENALEX - Publications

Ivan Vulić Goran Glavaš Roi Reichart Anna Korhonen

Ivan Vulić, Goran Glavaš, Roi Reichart, Anna Korhonen. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1449 article EN cc-by 2019-01-01

Training Neural Response Selection for Task-Oriented Dialogue Systems

OPENALEX - Publications

Matthew Henderson Ivan Vulić Daniela Gerz Iñigo Casanueva Paweł Budzianowski and 5 more

Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrkšić, Pei-Hao Su. Proceedings of the 57th Annual Meeting Association for Computational Linguistics. 2019.

10.18653/v1/p19-1536 preprint EN cc-by 2019-01-01

Coming Soon ...