- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Speech Recognition and Synthesis
- Multimodal Machine Learning Applications
- Speech and dialogue systems
- Semantic Web and Ontologies
- Algorithms and Data Compression
- Translation Studies and Practices
- Hate Speech and Cyberbullying Detection
- Music and Audio Processing
- Biomedical Text Mining and Ontologies
- Software Engineering Research
- Explainable Artificial Intelligence (XAI)
- Web Data Mining and Analysis
- Subtitles and Audiovisual Media
- linguistics and terminology studies
- Authorship Attribution and Profiling
- Adversarial Robustness in Machine Learning
- Wikis in Education and Collaboration
- Handwritten Text Recognition Techniques
- Gender Studies in Language
- Advanced Text Analysis Techniques
- Text and Document Classification Technologies
- Spanish Linguistics and Language Studies
Universitat Politècnica de Catalunya
2014-2023
University of the Basque Country
2021
Apple (Germany)
2020
National Student Clearinghouse Research Center
2005-2019
Uppsala University
2019
Google (United States)
2019
Tokyo Metropolitan University
2019
University of Michigan–Ann Arbor
2019
Stanford University
2019
Hamad bin Khalifa University
2016
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, Marcos Zampieri. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as key focus artificial intelligence research today. However, such efforts have coalesced around small subset languages, leaving behind vast majority mostly low-resource languages. What does it take to break 200 barrier while ensuring safe, high quality results, all keeping ethical considerations in mind? In No Language Left Behind, we took this challenge first contextualizing...
Neural Machine Translation (MT) has reached state-of-the-art results.However, one of the main challenges that neural MT still faces is dealing with very large vocabularies and morphologically rich languages.In this paper, we propose a system using character-based embeddings in combination convolutional highway layers to replace standard lookup-based word representations.The resulting unlimited-vocabulary affixaware source are tested based on an attention-based bidirectional recurrent...
Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word enhanced previous embedding techniques by computing vector representations dependent on the sentence they appear in. In this paper, we study impact of conceptual change computation relation with bias. Our analysis includes different measures previously applied...
Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing in cascaded fashion exist1–3, scalable high-performing unified systems4,5 remain underexplored. To address this gap, here we introduce SEAMLESSM4T–Massively Multilingual Multimodal Machine Translation–a single model supports...
This article describes in detail an n-gram approach to statistical machine translation. consists of a log-linear combination translation model based on n-grams bilingual units, which are referred as tuples, along with four specific feature functions. Translation performance, happens be the state art, is demonstrated Spanish-to-English and English-to-Spanish translations European Parliament Plenary Sessions (EPPS).
Neural machine translation has significantly pushed forward the quality of field. However, there are remaining big issues with output translations and one them is fairness. models trained on large text corpora which contain biases stereotypes. As a consequence, inherit these social biases. Recent methods have shown results in reducing gender bias other natural language processing tools such as word embeddings. We take advantage fact that embeddings used neural to propose method equalize...
Abstract The development of neural techniques has opened up new avenues for research in machine translation. Today, translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results terms language coverage quality. However, scaling quality NMT requires large volumes parallel bilingual data, which are not equally available the 7,000+ languages world 1 . Focusing on improving qualities a relatively small group...
This survey on hybrid machine translation (MT) is motivated by the fact that hybridization techniques have become popular as they attempt to combine best characteristics of highly advanced pure rule or corpus-based MT approaches. Existing research typically covers either simple more complex architectures guided The goal properties each type. provides a detailed overview modification standard rule-based architecture include statistical knowledge, introduction rules in approaches, and...
Continual learning (CL) aims to enable information systems learn from a continuous data stream across time. However, it is difficult for existing deep architectures new task without largely forgetting previously acquired knowledge. Furthermore, CL particularly challenging language learning, as natural ambiguous: discrete, compositional, and its meaning context-dependent. In this work, we look at the problem of through lens various NLP tasks. Our survey discusses major challenges in current...
Speech translation models are unable to directly process long audios, like TED talks, which have be split into shorter segments.Speech datasets provide manual segmentations of the not available in real-world scenarios, and existing segmentation methods usually significantly reduce quality at inference time.To bridge gap between training automatic one inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn optimal from any manually segmented...
While the problem of hallucinations in neural machine translation has long been recognized, so far progress on its alleviation is very little. Indeed, recently it turned out that without artificially encouraging models to hallucinate, previously existing methods fall short and even standard sequence log-probability more informative. It means internal characteristics model can give much information than we expect, before using external measures, first need ask: how go if use nothing but...
Neural machine translation has significantly pushed forward the quality of field. However, there are remaining big issues with output translations and one them is fairness. models trained on large text corpora which contain biases stereotypes. As a consequence, inherit these social biases. Recent methods have shown results in reducing gender bias other natural language processing tools such as word embeddings. We take advantage fact that embeddings used neural to propose method equalize...
Carlos Escolano, Marta R. Costa-jussà, José A. Fonollosa, Mikel Artetxe. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.
Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models enable end-to-end expressive and multilingual translations in streaming fashion. First, contribute an improved version the massively multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating updated UnitY2 framework, was trained on more low-resource language...
This paper presents BOUQuET, a multicentric and multi-register/domain dataset benchmark, its broader collaborative extension initiative. is handcrafted in non-English languages first, each of these source being represented among the 23 commonly used by half world's population therefore having potential to serve as pivot that will enable more accurate translations. The specially designed avoid contamination be multicentric, so enforce representation multilingual language features. In...
This paper presents experiments comparing character-based and byte-based neural machine translation systems. The main motivation of the system is to build multi-lingual systems that can share same vocabulary. We compare performance both in several language pairs we see test similar for most while training time slightly reduced case translation.
Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, unavailability of large-scale datasets makes challenging to train QA systems with performance comparable English ones. In this work, we develop Translate Align Retrieve (TAR) method automatically translate Stanford Question Answering Dataset (SQuAD) v1.1 Spanish. We then used dataset Spanish by fine-tuning Multilingual-BERT model. Finally, evaluated...