NFDI4DS | UHH-SEMS - Publication Details

Aitor Soroa

ORCID: 0000-0001-8573-2654

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5053230169

Research Areas

Natural Language Processing Techniques
Topic Modeling
Semantic Web and Ontologies
Speech and dialogue systems
Text Readability and Simplification
Basque language and culture studies
Wikis in Education and Collaboration
Spanish Linguistics and Language Studies
Biomedical Text Mining and Ontologies
Multimodal Machine Learning Applications
Web Data Mining and Analysis
Advanced Text Analysis Techniques
Advanced Database Systems and Queries
Artificial Intelligence in Games
AI in Service Interactions
Advanced Image and Video Retrieval Techniques
Data Quality and Management
linguistics and terminology studies
Text and Document Classification Technologies
Translation Studies and Practices
Video Analysis and Summarization
Service-Oriented Architecture and Web Services
Expert finding and Q&A systems
3D Surveying and Cultural Heritage
Robotics and Automated Systems

University of the Basque Country
2014-2023

Association of Electronic and Information Technologies
2021

Basque Center on Cognition, Brain and Language
2021

Wageningen University & Research
2021

Bocconi University
2021

Yangon University Of Distance Education
2014

National University of Distance Education
2014

University of Edinburgh
2012

Ikerbasque
2012

University of Sheffield
2010

A study on similarity and relatedness using distributional and WordNet-based approaches

OPENALEX - Publications

Eneko Agirre Enrique Alfonseca Keith Hall Jana Kravalová Marius Paşca and 1 more

This paper presents and compares WordNet-based distributional similarity approaches. The strengths weaknesses of each approach regarding relatedness tasks are discussed, a combination is presented. Each our methods independently provide the best results in their class on RG WordSim353 datasets, supervised them yields published all datasets. Finally, we pioneer cross-lingual similarity, showing that easily adapted for task with minor losses.

10.3115/1620754.1620758 article EN 2009-01-01

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

OPENALEX - Publications

Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Personalizing PageRank for word sense disambiguation

OPENALEX - Publications

Eneko Agirre Aitor Soroa

In this paper we propose a new graph-based method that uses the knowledge in LKB (based on WordNet) order to perform unsupervised Word Sense Disambiguation. Our algorithm full graph of efficiently, performing better than previous approaches English all-words datasets. We also show can be easily ported other languages with good results, only requirement having wordnet. addition, make an analysis performance algorithm, showing it is efficient and could tuned faster.

10.3115/1609067.1609070 article EN 2009-01-01

Random Walks for Knowledge-Based Word Sense Disambiguation

OPENALEX - Publications

Eneko Agirre Oier López de Lacalle Aitor Soroa

Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our performs better than other graph-based methods when run graph built from WordNet and eXtended WordNet. Our LKB combination compares favorably to knowledge-based approaches literature use similar knowledge variety English data sets set Spanish. include detailed analysis...

10.1162/coli_a_00164 article EN cc-by-nc-nd Computational Linguistics 2013-04-23

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

OPENALEX - Publications

Hugo Laurençon Lucile Saulnier Thomas J. Wang Christopher Akiki A. Villanova del Moral and 49 more

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Building event-centric knowledge graphs from news

OPENALEX - Publications

Marco Rospocher Marieke van Erp Piek Vossen Antske Fokkens Itziar Aldabe and 4 more

10.1016/j.websem.2015.12.004 article EN Journal of Web Semantics 2016-01-13

Semeval-2007 task 02

OPENALEX - Publications

Eneko Agirre Aitor Soroa

The goal of this task is to allow for comparison across sense-induction and discrimination systems, also compare these systems other supervised knowledge-based systems. In total there were 6 participating We reused the SemEval-2007 English lexical sample subtask 17, set up both clustering-style unsupervised evaluation (using OntoNotes senses as gold-standard) a part dataset mapping). provide results in 17.

10.3115/1621474.1621476 article EN 2007-01-01

WikiWalk

OPENALEX - Publications

Eric Yeh Daniel Ramage Christopher D. Manning Eneko Agirre Aitor Soroa

Computing semantic relatedness of natural language texts is a key component tasks such as information retrieval and summarization, often depends on knowledge broad range real-world concepts relationships. We address this integration issue by computing using personalized PageRank (random walks) graph derived from Wikipedia. This paper evaluates methods for building the graph, including link selection strategies, two representing input distributions over nodes: one based dictionary lookup,...

10.3115/1708124.1708133 article EN 2009-01-01

Image captioning for effective use of language models in knowledge-based visual question answering

OPENALEX - Publications

Ander Salaberria Gorka Azkune Oier López de Lacalle Aitor Soroa Eneko Agirre

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images models. More specifically, verbalize the image contents allow better leverage their implicit solve knowledge-intensive tasks. Focusing task which requires...

10.1016/j.eswa.2022.118669 article EN cc-by Expert Systems with Applications 2022-08-28

Do Multilingual Language Models Think Better in English?

OPENALEX - Publications

Julen Etxaniz Gorka Azkune Aitor Soroa Oier Lacalle Mikel Artetxe

10.18653/v1/2024.naacl-short.46 article EN 2024-01-01

Big data for Natural Language Processing: A streaming approach

OPENALEX - Publications

Rodrigo Agerri Xabier Artola Zubillaga Zuhaitz Beloki Germán Rigau Aitor Soroa

10.1016/j.knosys.2014.11.007 article EN Knowledge-Based Systems 2014-11-20

Random Walks and Neural Network Language Models on Knowledge Bases

OPENALEX - Publications

Josu Goikoetxea Aitor Soroa Eneko Agirre

Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of base continuous vector space, combining random neural net language models order to produce representations. Evaluation similarity datasets yields equal or better results than...

10.3115/v1/n15-1165 article EN Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015-01-01

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

OPENALEX - Publications

Aitor Ormazabal Mikel Artetxe Gorka Labaka Aitor Soroa Eneko Agirre

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, states that approximately same structure, it is not clear whether this an inherent limitation of mapping approaches or more general issue when learning embeddings. So as answer question, we experiment with...

10.18653/v1/p19-1492 preprint EN cc-by 2019-01-01

Two graph-based algorithms for state-of-the-art WSD

OPENALEX - Publications

Eneko Agirre David Martínez Oier López de Lacalle Aitor Soroa

This paper explores the use of two graph algorithms for unsupervised induction and tagging nominal word senses based on corpora. Our main contribution is optimization free parameters those its evaluation against publicly available gold standards. We present a thorough comprising supervised modes, both lexical-sample all-words tasks. The results show that, in spite information loss inherent to mapping induced gold-standard, small sample nouns carries over all nouns, performing close systems...

10.3115/1610075.1610157 article EN 2006-01-01

Graph-based Word Sense Disambiguation of biomedical documents

OPENALEX - Publications

Eneko Agirre Aitor Soroa Mark Stevenson

Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage text processing. This article presents a graph-based approach to WSD biomedical domain. The method unsupervised and does not require any labeled training data. It makes use knowledge from Unified Medical Language System (UMLS) Metathesaurus which represented as graph. A state-of-the-art algorithm, Personalized PageRank, used perform WSD.When evaluated on NLM-WSD...

10.1093/bioinformatics/btq555 article EN Bioinformatics 2010-10-07

Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet

OPENALEX - Publications

Josu Goikoetxea Eneko Agirre Aitor Soroa

Text and Knowledge Bases are complementary sources of information. Given the success distributed word representations learned from text, several techniques to infuse additional information like WordNet into have been proposed. In this paper, we follow an alternative route. We learn text independently, then explore simple sophisticated methods combine them. The combined applied extensive set datasets on similarity relatedness. Simple combination happen perform better that more complex CCA or...

10.1609/aaai.v30i1.10321 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2016-03-05

DoQA - Accessing Domain-Specific FAQs via Conversational QA

OPENALEX - Publications

Jon Ander Campos Arantxa Otegi Aitor Soroa Jan Deriu Mark Cieliebak and 1 more

The goal of this work is to build conversational Question Answering (QA) interfaces for the large body domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. are collected from three Stack Exchange sites using Wizard Oz method crowdsourcing. Compared previous work, DoQA comprises well-defined needs, leading more coherent natural conversations less factoid questions multi-domain. In addition, we introduce realistic retrieval...

10.18653/v1/2020.acl-main.652 article EN cc-by 2020-01-01

KIDE4I: A Generic Semantics-Based Task-Oriented Dialogue System for Human-Machine Interaction in Industry 5.0

OPENALEX - Publications

Cristina Aceta Izaskun Fernández Aitor Soroa

In Industry 5.0, human workers and their wellbeing are placed at the centre of production process. this context, task-oriented dialogue systems allow to delegate simple tasks industrial assets while working on other, more complex ones. The possibility naturally interacting with these reduces cognitive demand use them triggers acceptation. Most modern solutions, however, do not a natural communication, techniques obtain such require large amounts data be trained, which is scarce in scenarios....

10.3390/app12031192 article EN cc-by Applied Sciences 2022-01-24

Euska\~nolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching

OPENALEX - Publications

Maite Heredia Jeremy Barnes Aitor Soroa

Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due lack of relevant data. In the context contact between Basque and Spanish languages north Iberian Peninsula, CS frequently occurs both formal informal spontaneous interactions. However, resources to analyse this phenomenon support development evaluation models capable understanding generating code-switched language for pair are almost non-existent. We introduce first approach develop naturally...

10.48550/arxiv.2502.03188 preprint EN arXiv (Cornell University) 2025-02-05

Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

OPENALEX - Publications

Maite Heredia Gorka Labaka Jeremy Barnes Aitor Soroa

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Models (LLMs) struggle to interpret and generate code-switched text, primarily due the scarcity of large-scale CS datasets for training. This paper presents novel methodology data using LLMs, test it on English-Spanish language pair. We propose back-translating natural sentences into monolingual English, resulting parallel corpus fine-tune LLMs turn CS. Unlike previous approaches generation,...

10.48550/arxiv.2502.12924 preprint EN arXiv (Cornell University) 2025-02-18

Two birds with one stone

OPENALEX - Publications

Roberto Navigli Stefano Faralli Aitor Soroa Oier López de Lacalle Eneko Agirre

In this paper we present a novel approach to learning semantic models for multiple domains, which use categorize Wikipedia pages and perform domain Word Sense Disambiguation (WSD). order learn model each first extract relevant terms from the texts in then these initialize random walk over WordNet graph. Given an input text, check models, choose appropriate that text best-matching WSD. Our results show considerable improvements on categorization WSD tasks.

10.1145/2063576.2063955 article EN 2011-10-24

Improving search over Electronic Health Records using UMLS-based query expansion through random walks

OPENALEX - Publications

David Martínez Arantxa Otegi Aitor Soroa Eneko Agirre

10.1016/j.jbi.2014.04.013 article EN publisher-specific-oa Journal of Biomedical Informatics 2014-04-21

Coming Soon ...