- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Semantic Web and Ontologies
- Multimodal Machine Learning Applications
- Hand Gesture Recognition Systems
- Biomedical Text Mining and Ontologies
- Speech and dialogue systems
- Translation Studies and Practices
- Speech Recognition and Synthesis
- Wikis in Education and Collaboration
- Hearing Impairment and Communication
- Human Pose and Action Recognition
- Cosmology and Gravitation Theories
- Radiomics and Machine Learning in Medical Imaging
- Spanish Linguistics and Language Studies
- Lung Cancer Treatments and Mutations
- Neural Networks and Applications
- Linguistic Studies and Language Acquisition
- Machine Learning and Data Classification
- Computational and Text Analysis Methods
- Media Influence and Politics
- Authorship Attribution and Profiling
- Discourse Analysis and Cultural Communication
- Language and cultural evolution
German Research Centre for Artificial Intelligence
2017-2025
Saarland University
2017-2022
Universitat Politècnica de Catalunya
2009-2019
Hamad bin Khalifa University
2017
University of Sheffield
2017
National Student Clearinghouse Research Center
2016
Qatar Foundation
2015
Chalmers University of Technology
2012-2013
Universitat de Barcelona
2003-2009
End-to-end neural machine translation has overtaken statistical in terms of quality for some language pairs, specially those with large amounts parallel data. Besides this palpable improvement, networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a semantic representation words -or sentences- which, unlike standard...
We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal representations. This done in self-supervised way without parallel data, such that both tasks enhance each other during training. The language independent, introduces no additional hyper-parameters, achieves BLEU scores of 29.21 (en2fr) 27.36 (fr2en) on newstest2014 using English French Wikipedia
Multiple approaches to grab comparable data from the Web have been developed up date.Nevertheless, coming out with a high-quality corpus of specific topic is not straightforward.We present model for automatic extraction texts in multiple languages and on topics Wikipedia.In order prove value model, we automatically extract parallel sentences collections use them train statistical machine translation engines domains.Our experiments English-Spanish pair domains Computer Science, Sports show...
Mathias Müller, Malihe Alikhani, Eleftherios Avramidis, Richard Bowden, Annelies Braffort, Necati Cihan Camgöz, Sarah Ebling, Cristina España-Bonet, Anne Göhring, Roman Grundkiewicz, Mert Inan, Zifan Jiang, Oscar Koller, Amit Moryossef, Annette Rios, Dimitar Shterionov, Sandra Sidler-Miserez, Katja Tissi, Davy Van Landuyt. Proceedings of the Eighth Conference on Machine Translation. 2023.
Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear interpreting with higher frequency than text translation, but the reasons for this are unclear. This study analyzes patterns translation outputs order explore possible reasons. In our analysis we – (i) detail two non-invasive ways of detecting (ii) compare across translations from speech. We find that shows traces translationese,...
Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop feature designed score translation. This feature, which applies words that have been translated into different forms within document, uses word embeddings measure adequacy each given its context. Second, extend with stochastic mechanism that, at time, allows introduce changes oriented consistency. evaluate our system on...
This paper describes the TALP-UPC system in Spanish-English WMT 2016 biomedical shared task.Our is a standard phrase-based enhanced with vocabulary expansion using bilingual word embeddings and characterbased neural language model rescoring.The former focuses on resolving outof-vocabulary words, while latter enhances fluency of system.The two modules progressively improve final translation as measured by combination several lexical metrics.
Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora translate, in a way that the two tasks support each other virtuous circle. In this study, we provide an in-depth analysis of sampling choices SSNMT model makes during training. We show how, without it having been told do so, self-selects samples increasing (i) complexity (ii) task-relevance combination with (iii) performing denoising...
Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence estimated from raw texts key achieving state-of-the-art resultsin various tasks requiring semantic understanding. However, obtaining at the document level is challenging due to computational requirements lack of appropriate data. Instead, most approaches fall back on computing based representations. Although there exist architectures models encode documents fully, they general limited...
We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the generates probabilistic list of possible translations target language using embeddings. integrate these translation options into standard phrase-based statistical system obtain consistent...
We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced gender. While our is customizable any number of languages (and different domains), this work we present 2,000 sentences English, Spanish Catalan, which post-edited by native speakers become high-quality dataset...
Recent studies use a combination of lexical and syntactic features to show that footprints the source language remain visible in translations, extent it is possible predict original from translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures isomorphism between spaces built target translations into relations languages an unsupervised way. We different views data — words, parts speech, tags synsets track translationese. Our analysis shows (i) distances...
Abstract We propose a language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based exploration of encyclopedia’s category graph and can produce both mono- multilingual comparable collections. run thorough experiments assess quality obtained corpora in 10 languages 743 domains. According an extensive manual evaluation, our reaches average precision $$84\%$$ <mml:math...
A design for an Arabic-to-English translation system is presented. The core of the implements a standard phrase-based statistical machine architecture, but it extended by incorporating local discriminative phrase selection model to address semantic ambiguity Arabic. Local classifiers are trained using linguistic information and context translate phrase, this significantly increases accuracy in with respect most frequent traditionally considered. These integrated into so that global task gets...
This paper describes the UdS-DFKI submission to WMT2019 news translation task for Gujarati–English (low-resourced pair) and German–English (document-level evaluation). Our systems rely on on-line extraction of parallel sentences from comparable corpora first scenario inclusion coreference-related information in training data second one.