Aitor Soroa

ORCID: 0000-0001-8573-2654
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Semantic Web and Ontologies
  • Speech and dialogue systems
  • Text Readability and Simplification
  • Basque language and culture studies
  • Wikis in Education and Collaboration
  • Spanish Linguistics and Language Studies
  • Biomedical Text Mining and Ontologies
  • Multimodal Machine Learning Applications
  • Web Data Mining and Analysis
  • Advanced Text Analysis Techniques
  • Advanced Database Systems and Queries
  • Artificial Intelligence in Games
  • AI in Service Interactions
  • Advanced Image and Video Retrieval Techniques
  • Data Quality and Management
  • linguistics and terminology studies
  • Text and Document Classification Technologies
  • Translation Studies and Practices
  • Video Analysis and Summarization
  • Service-Oriented Architecture and Web Services
  • Expert finding and Q&A systems
  • 3D Surveying and Cultural Heritage
  • Robotics and Automated Systems

University of the Basque Country
2014-2023

Association of Electronic and Information Technologies
2021

Basque Center on Cognition, Brain and Language
2021

Wageningen University & Research
2021

Bocconi University
2021

Yangon University Of Distance Education
2014

National University of Distance Education
2014

University of Edinburgh
2012

Ikerbasque
2012

University of Sheffield
2010

This paper presents and compares WordNet-based distributional similarity approaches. The strengths weaknesses of each approach regarding relatedness tasks are discussed, a combination is presented. Each our methods independently provide the best results in their class on RG WordSim353 datasets, supervised them yields published all datasets. Finally, we pioneer cross-lingual similarity, showing that easily adapted for task with minor losses.

10.3115/1620754.1620758 article EN 2009-01-01
Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more Daniel Hesslow Roman Castagné Alexandra Sasha Luccioni François Yvon Matthias Gallé Jonathan Tow Alexander M. Rush Stella Biderman Albert Webson Pawan Sasanka Ammanamanchi Thomas J. Wang Benoît Sagot Niklas Muennighoff A. Villanova del Moral Olatunji Ruwase Rachel Bawden Stas Bekman Angelina McMillan-Major Iz Beltagy Huu Du Nguyen Lucile Saulnier Samson Tan Pedro Ortiz Suarez Victor Sanh Hugo Laurençon Yacine Jernite Julien Launay Margaret Mitchell Colin Raffel Aaron Gokaslan Adi Simhi Aitor Soroa Alham Fikri Aji Amit Alfassy Anna Rogers Ariel Kreisberg Nitzav Canwen Xu Chenghao Mou Chris Chinenye Emezue Christopher Klamm Colin Leong Daniel van Strien David Ifeoluwa Adelani Dragomir Radev Eduardo González Ponferrada Efrat Levkovizh Ethan Kim Eyal Bar Natan Francesco De Toni Gérard Dupont Germán Kruszewski Giada Pistilli Hady Elsahar Hamza Benyamina Hieu Tran Ian Yu Idris Abdulmumin Isaac Johnson Itziar González-Dios Javier de la Rosa Jenny Chim Jesse Dodge Jianguo Zhu Jonathan Chang Jörg Frohberg Joseph Tobing Joydeep Bhattacharjee Khalid Almubarak Kimbo Chen Kyle Lo Leandro von Werra Leon Weber Long Phan Loubna Ben Allal Ludovic Tanguy Manan Dey Manuel Romero Muñoz Maraim Masoud María Grandury Mario Šaško Max Tze Han Huang Maximin Coavoux Mayank Singh Mike Tian-Jian Jiang Minh Chien Vu Mohammad Ali Jauhar Mustafa Ghaleb Nishant Subramani Nora Kassner Nurulaqilla Khamis Olivier Nguyen Omar Espejel Ona De Gibert Paulo Villegas Peter Henderson

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

In this paper we propose a new graph-based method that uses the knowledge in LKB (based on WordNet) order to perform unsupervised Word Sense Disambiguation. Our algorithm full graph of efficiently, performing better than previous approaches English all-words datasets. We also show can be easily ported other languages with good results, only requirement having wordnet. addition, make an analysis performance algorithm, showing it is efficient and could tuned faster.

10.3115/1609067.1609070 article EN 2009-01-01

Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our performs better than other graph-based methods when run graph built from WordNet and eXtended WordNet. Our LKB combination compares favorably to knowledge-based approaches literature use similar knowledge variety English data sets set Spanish. include detailed analysis...

10.1162/coli_a_00164 article EN cc-by-nc-nd Computational Linguistics 2013-04-23

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The goal of this task is to allow for comparison across sense-induction and discrimination systems, also compare these systems other supervised knowledge-based systems. In total there were 6 participating We reused the SemEval-2007 English lexical sample subtask 17, set up both clustering-style unsupervised evaluation (using OntoNotes senses as gold-standard) a part dataset mapping). provide results in 17.

10.3115/1621474.1621476 article EN 2007-01-01

Computing semantic relatedness of natural language texts is a key component tasks such as information retrieval and summarization, often depends on knowledge broad range real-world concepts relationships. We address this integration issue by computing using personalized PageRank (random walks) graph derived from Wikipedia. This paper evaluates methods for building the graph, including link selection strategies, two representing input distributions over nodes: one based dictionary lookup,...

10.3115/1708124.1708133 article EN 2009-01-01

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images models. More specifically, verbalize the image contents allow better leverage their implicit solve knowledge-intensive tasks. Focusing task which requires...

10.1016/j.eswa.2022.118669 article EN cc-by Expert Systems with Applications 2022-08-28

Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of base continuous vector space, combining random neural net language models order to produce representations. Evaluation similarity datasets yields equal or better results than...

10.3115/v1/n15-1165 article EN Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015-01-01

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, states that approximately same structure, it is not clear whether this an inherent limitation of mapping approaches or more general issue when learning embeddings. So as answer question, we experiment with...

10.18653/v1/p19-1492 preprint EN cc-by 2019-01-01

This paper explores the use of two graph algorithms for unsupervised induction and tagging nominal word senses based on corpora. Our main contribution is optimization free parameters those its evaluation against publicly available gold standards. We present a thorough comprising supervised modes, both lexical-sample all-words tasks. The results show that, in spite information loss inherent to mapping induced gold-standard, small sample nouns carries over all nouns, performing close systems...

10.3115/1610075.1610157 article EN 2006-01-01

Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage text processing. This article presents a graph-based approach to WSD biomedical domain. The method unsupervised and does not require any labeled training data. It makes use knowledge from Unified Medical Language System (UMLS) Metathesaurus which represented as graph. A state-of-the-art algorithm, Personalized PageRank, used perform WSD.When evaluated on NLM-WSD...

10.1093/bioinformatics/btq555 article EN Bioinformatics 2010-10-07

Text and Knowledge Bases are complementary sources of information. Given the success distributed word representations learned from text, several techniques to infuse additional information like WordNet into have been proposed. In this paper, we follow an alternative route. We learn text independently, then explore simple sophisticated methods combine them. The combined applied extensive set datasets on similarity relatedness. Simple combination happen perform better that more complex CCA or...

10.1609/aaai.v30i1.10321 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2016-03-05

The goal of this work is to build conversational Question Answering (QA) interfaces for the large body domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. are collected from three Stack Exchange sites using Wizard Oz method crowdsourcing. Compared previous work, DoQA comprises well-defined needs, leading more coherent natural conversations less factoid questions multi-domain. In addition, we introduce realistic retrieval...

10.18653/v1/2020.acl-main.652 article EN cc-by 2020-01-01

In Industry 5.0, human workers and their wellbeing are placed at the centre of production process. this context, task-oriented dialogue systems allow to delegate simple tasks industrial assets while working on other, more complex ones. The possibility naturally interacting with these reduces cognitive demand use them triggers acceptation. Most modern solutions, however, do not a natural communication, techniques obtain such require large amounts data be trained, which is scarce in scenarios....

10.3390/app12031192 article EN cc-by Applied Sciences 2022-01-24

Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due lack of relevant data. In the context contact between Basque and Spanish languages north Iberian Peninsula, CS frequently occurs both formal informal spontaneous interactions. However, resources to analyse this phenomenon support development evaluation models capable understanding generating code-switched language for pair are almost non-existent. We introduce first approach develop naturally...

10.48550/arxiv.2502.03188 preprint EN arXiv (Cornell University) 2025-02-05

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Models (LLMs) struggle to interpret and generate code-switched text, primarily due the scarcity of large-scale CS datasets for training. This paper presents novel methodology data using LLMs, test it on English-Spanish language pair. We propose back-translating natural sentences into monolingual English, resulting parallel corpus fine-tune LLMs turn CS. Unlike previous approaches generation,...

10.48550/arxiv.2502.12924 preprint EN arXiv (Cornell University) 2025-02-18

In this paper we present a novel approach to learning semantic models for multiple domains, which use categorize Wikipedia pages and perform domain Word Sense Disambiguation (WSD). order learn model each first extract relevant terms from the texts in then these initialize random walk over WordNet graph. Given an input text, check models, choose appropriate that text best-matching WSD. Our results show considerable improvements on categorization WSD tasks.

10.1145/2063576.2063955 article EN 2011-10-24
Coming Soon ...