- Natural Language Processing Techniques
- Topic Modeling
- Semantic Web and Ontologies
- Speech and dialogue systems
- Text Readability and Simplification
- Basque language and culture studies
- Wikis in Education and Collaboration
- Spanish Linguistics and Language Studies
- Biomedical Text Mining and Ontologies
- Multimodal Machine Learning Applications
- Web Data Mining and Analysis
- Advanced Text Analysis Techniques
- Advanced Database Systems and Queries
- Artificial Intelligence in Games
- AI in Service Interactions
- Advanced Image and Video Retrieval Techniques
- Data Quality and Management
- linguistics and terminology studies
- Text and Document Classification Technologies
- Translation Studies and Practices
- Video Analysis and Summarization
- Service-Oriented Architecture and Web Services
- Expert finding and Q&A systems
- 3D Surveying and Cultural Heritage
- Robotics and Automated Systems
University of the Basque Country
2014-2023
Association of Electronic and Information Technologies
2021
Basque Center on Cognition, Brain and Language
2021
Wageningen University & Research
2021
Bocconi University
2021
Yangon University Of Distance Education
2014
National University of Distance Education
2014
University of Edinburgh
2012
Ikerbasque
2012
University of Sheffield
2010
This paper presents and compares WordNet-based distributional similarity approaches. The strengths weaknesses of each approach regarding relatedness tasks are discussed, a combination is presented. Each our methods independently provide the best results in their class on RG WordSim353 datasets, supervised them yields published all datasets. Finally, we pioneer cross-lingual similarity, showing that easily adapted for task with minor losses.
Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...
In this paper we propose a new graph-based method that uses the knowledge in LKB (based on WordNet) order to perform unsupervised Word Sense Disambiguation. Our algorithm full graph of efficiently, performing better than previous approaches English all-words datasets. We also show can be easily ported other languages with good results, only requirement having wordnet. addition, make an analysis performance algorithm, showing it is efficient and could tuned faster.
Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our performs better than other graph-based methods when run graph built from WordNet and eXtended WordNet. Our LKB combination compares favorably to knowledge-based approaches literature use similar knowledge variety English data sets set Spanish. include detailed analysis...
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...
The goal of this task is to allow for comparison across sense-induction and discrimination systems, also compare these systems other supervised knowledge-based systems. In total there were 6 participating We reused the SemEval-2007 English lexical sample subtask 17, set up both clustering-style unsupervised evaluation (using OntoNotes senses as gold-standard) a part dataset mapping). provide results in 17.
Computing semantic relatedness of natural language texts is a key component tasks such as information retrieval and summarization, often depends on knowledge broad range real-world concepts relationships. We address this integration issue by computing using personalized PageRank (random walks) graph derived from Wikipedia. This paper evaluates methods for building the graph, including link selection strategies, two representing input distributions over nodes: one based dictionary lookup,...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images models. More specifically, verbalize the image contents allow better leverage their implicit solve knowledge-intensive tasks. Focusing task which requires...
Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of base continuous vector space, combining random neural net language models order to produce representations. Evaluation similarity datasets yields equal or better results than...
Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, states that approximately same structure, it is not clear whether this an inherent limitation of mapping approaches or more general issue when learning embeddings. So as answer question, we experiment with...
This paper explores the use of two graph algorithms for unsupervised induction and tagging nominal word senses based on corpora. Our main contribution is optimization free parameters those its evaluation against publicly available gold standards. We present a thorough comprising supervised modes, both lexical-sample all-words tasks. The results show that, in spite information loss inherent to mapping induced gold-standard, small sample nouns carries over all nouns, performing close systems...
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage text processing. This article presents a graph-based approach to WSD biomedical domain. The method unsupervised and does not require any labeled training data. It makes use knowledge from Unified Medical Language System (UMLS) Metathesaurus which represented as graph. A state-of-the-art algorithm, Personalized PageRank, used perform WSD.When evaluated on NLM-WSD...
Text and Knowledge Bases are complementary sources of information. Given the success distributed word representations learned from text, several techniques to infuse additional information like WordNet into have been proposed. In this paper, we follow an alternative route. We learn text independently, then explore simple sophisticated methods combine them. The combined applied extensive set datasets on similarity relatedness. Simple combination happen perform better that more complex CCA or...
The goal of this work is to build conversational Question Answering (QA) interfaces for the large body domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. are collected from three Stack Exchange sites using Wizard Oz method crowdsourcing. Compared previous work, DoQA comprises well-defined needs, leading more coherent natural conversations less factoid questions multi-domain. In addition, we introduce realistic retrieval...
In Industry 5.0, human workers and their wellbeing are placed at the centre of production process. this context, task-oriented dialogue systems allow to delegate simple tasks industrial assets while working on other, more complex ones. The possibility naturally interacting with these reduces cognitive demand use them triggers acceptation. Most modern solutions, however, do not a natural communication, techniques obtain such require large amounts data be trained, which is scarce in scenarios....
Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due lack of relevant data. In the context contact between Basque and Spanish languages north Iberian Peninsula, CS frequently occurs both formal informal spontaneous interactions. However, resources to analyse this phenomenon support development evaluation models capable understanding generating code-switched language for pair are almost non-existent. We introduce first approach develop naturally...
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Models (LLMs) struggle to interpret and generate code-switched text, primarily due the scarcity of large-scale CS datasets for training. This paper presents novel methodology data using LLMs, test it on English-Spanish language pair. We propose back-translating natural sentences into monolingual English, resulting parallel corpus fine-tune LLMs turn CS. Unlike previous approaches generation,...
In this paper we present a novel approach to learning semantic models for multiple domains, which use categorize Wikipedia pages and perform domain Word Sense Disambiguation (WSD). order learn model each first extract relevant terms from the texts in then these initialize random walk over WordNet graph. Given an input text, check models, choose appropriate that text best-matching WSD. Our results show considerable improvements on categorization WSD tasks.