NFDI4DS | UHH-SEMS - Publication Details

Marta R. Costa‐jussà

ORCID: 0000-0002-5703-520X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5074210163

Research Areas

Natural Language Processing Techniques
Topic Modeling
Text Readability and Simplification
Speech Recognition and Synthesis
Multimodal Machine Learning Applications
Speech and dialogue systems
Semantic Web and Ontologies
Algorithms and Data Compression
Translation Studies and Practices
Hate Speech and Cyberbullying Detection
Music and Audio Processing
Biomedical Text Mining and Ontologies
Software Engineering Research
Explainable Artificial Intelligence (XAI)
Web Data Mining and Analysis
Subtitles and Audiovisual Media
linguistics and terminology studies
Authorship Attribution and Profiling
Adversarial Robustness in Machine Learning
Wikis in Education and Collaboration
Handwritten Text Recognition Techniques
Gender Studies in Language
Advanced Text Analysis Techniques
Text and Document Classification Technologies
Spanish Linguistics and Language Studies

Universitat Politècnica de Catalunya
2014-2023

University of the Basque Country
2021

Apple (Germany)
2020

National Student Clearinghouse Research Center
2005-2019

Uppsala University
2019

Google (United States)
2019

Tokyo Metropolitan University
2019

University of Michigan–Ann Arbor
2019

Stanford University
2019

Hamad bin Khalifa University
2016

Findings of the 2019 Conference on Machine Translation (WMT19)

OPENALEX - Publications

Loïc Barrault Ondřej Bojar Marta R. Costa‐jussà Christian Federmann Mark Fishel and 10 more

Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, Marcos Zampieri. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.

10.18653/v1/w19-5301 article EN cc-by 2019-01-01

No Language Left Behind: Scaling Human-Centered Machine Translation

OPENALEX - Publications

Nllb Team Marta R. Costa‐jussà James H. Cross Onur Çelebi Maha Elbayad and 34 more

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as key focus artificial intelligence research today. However, such efforts have coalesced around small subset languages, leaving behind vast majority mostly low-resource languages. What does it take to break 200 barrier while ensuring safe, high quality results, all keeping ethical considerations in mind? In No Language Left Behind, we took this challenge first contextualizing...

10.48550/arxiv.2207.04672 preprint EN cc-by-sa arXiv (Cornell University) 2022-01-01

Character-based Neural Machine Translation

OPENALEX - Publications

Marta R. Costa‐jussà José A. R. Fonollosa

Neural Machine Translation (MT) has reached state-of-the-art results.However, one of the main challenges that neural MT still faces is dealing with very large vocabularies and morphologically rich languages.In this paper, we propose a system using character-based embeddings in combination convolutional highway layers to replace standard lookup-based word representations.The resulting unlimited-vocabulary affixaware source are tested based on an attention-based bidirectional recurrent...

10.18653/v1/p16-2058 article EN cc-by 2016-01-01

Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

OPENALEX - Publications

Christine Basta Marta R. Costa‐jussà Noé Casas

Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word enhanced previous embedding techniques by computing vector representations dependent on the sentence they appear in. In this paper, we study impact of conceptual change computation relation with bias. Our analysis includes different measures previously applied...

10.18653/v1/w19-3805 preprint EN cc-by 2019-01-01

Joint speech and text machine translation for up to 100 languages

OPENALEX - Publications

Loïc Barrault Yu-An Chung Mariano Coria Meglioli David Dale Ning Dong and 62 more

Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing in cascaded fashion exist1–3, scalable high-performing unified systems4,5 remain underexplored. To address this gap, here we introduce SEAMLESSM4T–Massively Multilingual Multimodal Machine Translation–a single model supports...

10.1038/s41586-024-08359-z article EN cc-by-nc-nd Nature 2025-01-15

N-gram-based Machine Translation

OPENALEX - Publications

José Bernardo Mariño Acebal Rafael E. Banchs Josep Crego Adrià de Gispert Patrik Lambert and 2 more

This article describes in detail an n-gram approach to statistical machine translation. consists of a log-linear combination translation model based on n-grams bilingual units, which are referred as tuples, along with four specific feature functions. Translation performance, happens be the state art, is demonstrated Spanish-to-English and English-to-Spanish translations European Parliament Plenary Sessions (EPPS).

10.1162/coli.2006.32.4.527 article EN Computational Linguistics 2006-11-21

Equalizing Gender Bias in Neural Machine Translation with Word Embeddings Techniques

OPENALEX - Publications

Joel Escudé Font Marta R. Costa‐jussà

Neural machine translation has significantly pushed forward the quality of field. However, there are remaining big issues with output translations and one them is fairness. models trained on large text corpora which contain biases stereotypes. As a consequence, inherit these social biases. Recent methods have shown results in reducing gender bias other natural language processing tools such as word embeddings. We take advantage fact that embeddings used neural to propose method equalize...

10.18653/v1/w19-3821 article EN cc-by 2019-01-01

Scaling neural machine translation to 200 languages

OPENALEX - Publications

Marta R. Costa‐jussà James H. Cross Onur Çelebi Maha Elbayad Kenneth Heafield and 33 more

Abstract The development of neural techniques has opened up new avenues for research in machine translation. Today, translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results terms language coverage quality. However, scaling quality NMT requires large volumes parallel bilingual data, which are not equally available the 7,000+ languages world 1 . Focusing on improving qualities a relatively small group...

10.1038/s41586-024-07335-x article EN cc-by Nature 2024-06-05

Latest trends in hybrid machine translation and its applications

OPENALEX - Publications

Marta R. Costa‐jussà José A. R. Fonollosa

This survey on hybrid machine translation (MT) is motivated by the fact that hybridization techniques have become popular as they attempt to combine best characteristics of highly advanced pure rule or corpus-based MT approaches. Existing research typically covers either simple more complex architectures guided The goal properties each type. provides a detailed overview modification standard rule-based architecture include statistical knowledge, introduction rules in approaches, and...

10.1016/j.csl.2014.11.001 article EN cc-by-nc-nd Computer Speech & Language 2014-11-17

End-to-End Speech Translation with the Transformer

OPENALEX - Publications

Laura Cross Vila Carlos Escolano José A. R. Fonollosa Marta R. Costa‐jussà

10.21437/iberspeech.2018-13 article EN 2018-11-19

Continual Lifelong Learning in Natural Language Processing: A Survey

OPENALEX - Publications

Magdalena Biesialska Katarzyna Biesialska Marta R. Costa‐jussà

Continual learning (CL) aims to enable information systems learn from a continuous data stream across time. However, it is difficult for existing deep architectures new task without largely forgetting previously acquired knowledge. Furthermore, CL particularly challenging language learning, as natural ambiguous: discrete, compositional, and its meaning context-dependent. In this work, we look at the problem of through lens various NLP tasks. Our survey discusses major challenges in current...

10.18653/v1/2020.coling-main.574 article EN cc-by Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

OPENALEX - Publications

Ioannis Tsiamas Gerard I. Gállego José A. R. Fonollosa Marta R. Costa‐jussà

Speech translation models are unable to directly process long audios, like TED talks, which have be split into shorter segments.Speech datasets provide manual segmentations of the not available in real-world scenarios, and existing segmentation methods usually significantly reduce quality at inference time.To bridge gap between training automatic one inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn optimal from any manually segmented...

10.21437/interspeech.2022-59 article EN Interspeech 2022 2022-09-16

Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better

OPENALEX - Publications

David C. Dale Elena Voita Loïc Barrault Marta R. Costa‐jussà

While the problem of hallucinations in neural machine translation has long been recognized, so far progress on its alleviation is very little. Indeed, recently it turned out that without artificially encouraging models to hallucinate, previously existing methods fall short and even standard sequence log-probability more informative. It means internal characteristics model can give much information than we expect, before using external measures, first need ask: how go if use nothing but...

10.18653/v1/2023.acl-long.3 article EN cc-by 2023-01-01

An analysis of gender bias studies in natural language processing

OPENALEX - Publications

Marta R. Costa‐jussà

10.1038/s42256-019-0105-5 article EN Nature Machine Intelligence 2019-10-14

Equalizing Gender Biases in Neural Machine Translation with Word Embeddings Techniques

OPENALEX - Publications

Joel Escudé Font Marta R. Costa‐jussà

10.48550/arxiv.1901.03116 preprint EN cc-by-nc-sa arXiv (Cornell University) 2019-01-01

Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

OPENALEX - Publications

Carlos Escolano Marta R. Costa‐jussà José A. R. Fonollosa Mikel Artetxe

Carlos Escolano, Marta R. Costa-jussà, José A. Fonollosa, Mikel Artetxe. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.80 article EN cc-by 2021-01-01

Seamless: Multilingual Expressive and Streaming Speech Translation

OPENALEX - Publications

Seamless Communication Loïc Barrault Yu-An Chung Mariano Coria Meglioli David C. Dale and 60 more

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models enable end-to-end expressive and multilingual translations in streaming fashion. First, contribute an improved version the massively multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating updated UnitY2 framework, was trained on more low-resource language...

10.48550/arxiv.2312.05187 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

OPENALEX - Publications

The Omnilingual MT Team Pierre Andrews Mikel Artetxe Mariano Coria Meglioli Marta R. Costa‐jussà and 12 more

This paper presents BOUQuET, a multicentric and multi-register/domain dataset benchmark, its broader collaborative extension initiative. is handcrafted in non-English languages first, each of these source being represented among the 23 commonly used by half world's population therefore having potential to serve as pivot that will enable more accurate translations. The specially designed avoid contamination be multicentric, so enforce representation multilingual language features. In...

10.48550/arxiv.2502.04314 preprint EN arXiv (Cornell University) 2025-02-06

Introduction to the special issue on deep learning approaches for machine translation

OPENALEX - Publications

Marta R. Costa‐jussà Alexandre Allauzen Loïc Barrault Kyunghun Cho Holger Schwenk

10.1016/j.csl.2017.03.001 article EN Computer Speech & Language 2017-05-25

Byte-based Neural Machine Translation

OPENALEX - Publications

Marta R. Costa‐jussà Carlos Escolano José A. R. Fonollosa

This paper presents experiments comparing character-based and byte-based neural machine translation systems. The main motivation of the system is to build multi-lingual systems that can share same vocabulary. We compare performance both in several language pairs we see test similar for most while training time slightly reduced case translation.

10.18653/v1/w17-4123 article EN cc-by 2017-01-01

Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering

OPENALEX - Publications

Casimiro Pio Carrino Marta R. Costa‐jussà José A. R. Fonollosa

Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, unavailability of large-scale datasets makes challenging to train QA systems with performance comparable English ones. In this work, we develop Translate Align Retrieve (TAR) method automatically translate Stanford Question Answering Dataset (SQuAD) v1.1 Spanish. We then used dataset Spanish by fine-tuning Multilingual-BERT model. Finally, evaluated...

10.48550/arxiv.1912.05200 preprint EN cc-by arXiv (Cornell University) 2019-01-01

Coming Soon ...