Gorka Labaka

ORCID: 0000-0003-4611-2502
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Speech and dialogue systems
  • Spanish Linguistics and Language Studies
  • Basque language and culture studies
  • Wikis in Education and Collaboration
  • Multimodal Machine Learning Applications
  • Translation Studies and Practices
  • Semantic Web and Ontologies
  • Software Engineering Research
  • Biomedical Text Mining and Ontologies
  • Hate Speech and Cyberbullying Detection
  • Speech Recognition and Synthesis
  • Algorithms and Data Compression
  • Linguistic Studies and Language Acquisition
  • Discourse Analysis in Language Studies
  • Linguistics and Discourse Analysis
  • Text and Document Classification Technologies
  • Reproductive Health and Technologies
  • Language, Metaphor, and Cognition
  • Hearing Impairment and Communication
  • Authorship Attribution and Profiling
  • Reproductive Biology and Fertility
  • Interpreting and Communication in Healthcare

University of the Basque Country
2014-2024

Google (United States)
2020

Universitat de Barcelona
2017

In spite of the recent success neural machine translation (NMT) in standard benchmarks, lack large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, instance, triangulation and semi-supervised learning techniques, but they still require strong cross-lingual signal. work, we completely remove need data propose novel method train an NMT system unsupervised manner, relying on nothing monolingual corpora. Our...

10.48550/arxiv.1710.11041 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual a shared space through adversarial training. However, their evaluation focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This proposes an alternative approach based fully unsupervised initialization explicitly exploits the structural similarity of embeddings, robust self-learning...

10.18653/v1/p18-1073 preprint EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult obtain for most language pairs. This has motivated an active research line relax this requirement, with that use document-aligned corpora or dictionaries of a few thousand words instead. In work, we further reduce the need resources using very simple self-learning approach can be combined any dictionary-based mapping technique. Our method exploits structural similarity embedding spaces, and...

10.18653/v1/p17-1042 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017-01-01

Mapping word embeddings of different languages into a single space has multiple applications.In order to map from source target space, common approach is learn linear mapping that minimizes the distances between equivalences listed in bilingual dictionary.In this paper, we propose framework generalizes previous work, provides an efficient exact method optimal transformation and yields best results translation induction while preserving monolingual performance analogy task.

10.18653/v1/d16-1250 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2016-01-01

While modern machine translation has relied on large parallel corpora, a recent line of work managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample 2018). Despite the potential this approach for low-resource settings, existing are far behind their supervised counterparts, limiting practical interest. In paper, we propose an alternative based phrase-based Statistical (SMT) that significantly closes gap with systems. Our method...

10.18653/v1/d18-1399 preprint EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Using a dictionary to map independently trained word embeddings shared space has shown be an effective approach learn bilingual embeddings. In this work, we propose multi-step framework of linear transformations that generalizes substantial body previous work. The core step the is orthogonal transformation, and existing methods can explained in terms additional normalization, whitening, re-weighting, de-whitening dimensionality reduction steps. This allows us gain new insights into behavior...

10.1609/aaai.v32i1.11992 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2018-04-27

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line managed to train both Neural Machine Translation (NMT) and Statistical (SMT) systems using monolingual corpora only. In this paper, we identify address several deficiencies existing unsupervised SMT approaches by exploiting subword information, developing theoretically well founded tuning method, incorporating joint refinement procedure. Moreover, use our improved system initialize...

10.18653/v1/p19-1019 preprint EN 2019-01-01

Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation as an entailment task, with simple, hand-made, verbalizations relations produced in less than 15 min per relation. The system relies on a pretrained textual engine is run as-is (no training examples, zero-shot) or further fine-tuned (few-shot fully trained). our experiments TACRED attain 63% F1 zero-shot, 69% 16 (17% points better the best supervised same...

10.18653/v1/2021.emnlp-main.92 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Sign Languages (SLs) are employed by deaf and hard-of-hearing (DHH) people to communicate on a daily basis. However, the communication with hearing still faces some barriers, mainly because of scarce knowledge about SLs among people. Hence, tools allow between users either sign or spoken languages must be encouraged. A stepping stone in this direction is research language translation (SLT) task, which aims produce video vice versa. By implementing these types translators portable devices, we...

10.1016/j.eswa.2022.118993 article EN cc-by Expert Systems with Applications 2022-10-13

Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional services, using to translate either the test set or training is widely used technique. In this paper, we show that such process can introduce subtle artifacts notable impact existing models. For instance, natural language inference, translating premise hypothesis independently reduce lexical overlap between them, which current models are...

10.18653/v1/2020.emnlp-main.618 preprint EN cc-by 2020-01-01

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, states that approximately same structure, it is not clear whether this an inherent limitation of mapping approaches or more general issue when learning embeddings. So as answer question, we experiment with...

10.18653/v1/p19-1492 preprint EN cc-by 2019-01-01

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning call a more rigorous position in each of them. An existing rationale such research is based on the lack parallel data many world's languages. However, we argue that scenario without any abundant monolingual unrealistic practice. also discuss different training signals have been used previous work, which depart from pure setting. then describe common methodological issues tuning evaluation...

10.18653/v1/2020.acl-main.658 preprint EN cc-by 2020-01-01

A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual to induce translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach problem that builds work unsupervised machine translation. This way, instead of directly inducing a from embeddings, use them build phrase-table, combine it with language...

10.18653/v1/p19-1494 preprint EN cc-by 2019-01-01

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Models (LLMs) struggle to interpret and generate code-switched text, primarily due the scarcity of large-scale CS datasets for training. This paper presents novel methodology data using LLMs, test it on English-Spanish language pair. We propose back-translating natural sentences into monolingual English, resulting parallel corpus fine-tune LLMs turn CS. Unlike previous approaches generation,...

10.48550/arxiv.2502.12924 preprint EN arXiv (Cornell University) 2025-02-18

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax similarity/relatedness. In this paper, we show each embedding model captures more information than directly apparent. A linear transformation adjusts similarity order without any external resource can tailor achieve better results in those aspects,...

10.18653/v1/k18-1028 preprint EN cc-by 2018-01-01

To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations neural systems used different methods to overcome the lack bilingual corpus clinical texts or in Spanish.We trained recurrent networks on an out-of-domain with hyperparameter values. Subsequently, we optimal configuration evaluate EHR templates Spanish, manual translations into standard. successively...

10.1093/jamia/ocz110 article EN Journal of the American Medical Informatics Association 2019-05-31
Coming Soon ...