- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Advanced Image and Video Retrieval Techniques
- Text Readability and Simplification
- Speech and dialogue systems
- Human Pose and Action Recognition
- Semantic Web and Ontologies
- Software Engineering Research
- Sentiment Analysis and Opinion Mining
- Advanced Text Analysis Techniques
- Domain Adaptation and Few-Shot Learning
- Biomedical Text Mining and Ontologies
- Computational and Text Analysis Methods
- Translation Studies and Practices
- Mobile Crowdsensing and Crowdsourcing
- Lexicography and Language Studies
- Explainable Artificial Intelligence (XAI)
- AI in Service Interactions
- Algorithms and Data Compression
- Cognitive Science and Mapping
- Authorship Attribution and Profiling
- Mathematics, Computing, and Information Processing
- Logic, programming, and type systems
Trinity College Dublin
2015-2024
Dublin City University
2007-2021
University of Sheffield
2017-2021
University of Amsterdam
2016-2021
Bar-Ilan University
2021
University of Helsinki
2021
Tel Aviv University
2021
Technical University of Darmstadt
2021
University of Copenhagen
2021
Edinburgh Napier University
2021
Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Christof Monz. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. 2018.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016.
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, Marcos Zampieri. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, Marco Turchi. Proceedings of the Second Conference on Machine Translation. 2017.
This paper presents the results of WMT19 Metrics Shared Task. Participants were asked to score outputs translations systems competing in News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 which are reference-less "metrics" and constitute submissions joint task Quality Estimation Task, "QE as a Metric". In addition, we computed 11 baseline 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, chrF) 3 reimplementations (chrF+,...
Abstract Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work filtered avoid contamination results through inclusion false assessments. One method filter via agreement with experts, but even amongst experts levels may not high. In this paper, we present new methodology for crowd-sourcing human quality, which allows individual workers develop their own assessment strategy....
We provide an analysis of current evaluation methodologies applied to summarization metrics and identify the following areas concern: (1) movement away from by correlation with human assessment; (2) omission important components assessment evaluations, in addition large numbers metric variants; (3) absence methods significance testing improvements over a baseline.We outline methodology that overcomes all such challenges, providing first method suitable for metrics.Our reveals time which...
This paper presents the results of WMT17 Metrics Shared Task.We asked participants this task to score outputs MT systems involved in news translation and Neural training task.We collected scores 14 metrics from 8 research groups.In addition that, we computed 7 standard (BLEU, SentBLEU, NIST, WER, PER, TER CDER) as baselines.The were evaluated terms system-level correlation (how well each metric's correlate with official manual ranking systems) segment level often a metric agrees humans...
This paper presents the results of WMT16 Metrics Shared Task.We asked participants this task to score outputs MT systems involved in Translation collected scores 16 metrics from 9 research groups.In addition that, we computed standard (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines.The were evaluated terms system-level correlation (how well each metric's correlate with official manual ranking systems) segment level often a metric agrees humans comparing two translations...
This paper presents the results of WMT18 Metrics Shared Task. We asked participants this task to score outputs MT systems involved in News Translation Task with automatic metrics. collected scores 10 metrics and 8 research groups. In addition that, we computed standard (BLEU, SentBLEU, chrF, NIST, WER, PER, TER CDER) as baselines. The were evaluated terms system-level correlation (how well each metric's correlate official manual ranking systems) segment-level often a metric agrees humans...
Evaluation of segment-level machine translation metrics is currently hampered by: (1) low inter-annotator agreement levels in human assessments; (2) lack an effective mechanism for evaluation translations equal quality; and (3) methods significance testing improvements over a baseline.In this paper, we provide solutions to each these challenges outline new methodology aimed specifically at assessment metrics.We replicate the component WMT-13 reveal that current state-of-the-art performance...
We report results from the SR'19 Shared Task, second edition of a multilingual surface realisation task organised as part EMNLP'19 Workshop on Multilingual Surface Realisation. As in SR'18, shared comprised two tracks with different levels complexity: (a) shallow track where inputs were full UD structures word order information removed and tokens lemmatised; (b) deep additionally, functional words morphological removed. The was offered eleven, three languages. Systems evaluated...
The term translationese has been used to describe features of translated text, and in this paper, we provide detailed analysis potential adverse effects on machine translation evaluation. Our shows differences conclusions drawn from evaluations that include test data compared experiments tested only with text originally composed language. For reason recommend reverse-created be omitted future sets. In addition, a re-evaluation past evaluation claiming human-parity MT. One important issue not...
Evaluation of open-domain dialogue systems is highly challenging and development better techniques highlighted time again as desperately needed. Despite substantial efforts to carry out reliable live evaluation in recent competitions, annotations have been abandoned reported too unreliable yield sensible results. This a serious problem since automatic metrics are not known provide good indication what may or be high-quality conversation. Answering the distress call competitions that...
Automatic metrics are widely used in machine translation as a substitute for human assessment. With the introduction of any new metric comes question just how well that mimics assessment quality. This is often measured by correlation with judgment. Significance tests generally not to establish whether improvements over existing methods such BLEU statistically significant or have occurred simply chance, however. In this paper, we introduce significance test comparing correlations two metrics,...
The term translationese has been used to describe the presence of unusual features translated text. In this paper, we provide a detailed analysis adverse effects on machine translation evaluation results. Our shows evidence support differences in text originally written given language relative and can potentially negatively impact accuracy evaluations. For reason recommend that reverse-created test data be omitted from future sets. addition, re-evaluation past high-profile claiming...
Recent human evaluation of machine translation has focused on relative preference judgments quality, making it difficult to track longitudinal improvements over time. We carry out a large-scale crowd-sourcing experiment estimate the degree which state-of-theart performance in increased past five years. To facilitate evaluation, we move away from and instead ask judges provide direct estimates quality individual translations isolation alternate outputs. For seven European language pairs, our...
Yvette Graham. Proceedings of the 53rd Annual Meeting Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.
Existing metrics to evaluate the quality of Machine Translation hypotheses take different perspectives into account.DPM-Fcomb, a metric combining merits range metrics, achieved best performance for evaluation to-English language pairs in previous two years WMT Metrics Shared Tasks.This year, we submit novel combined metric, Blend, WMT17 task.Compared DPMFcomb, Blend includes following adaptations: i) We use DA human guide training process with vast reduction required data, while still...
We report results from the SR'18 Shared Task, a new multilingual surface realisation task organised as part of ACL'18 Workshop on Multilingual Surface Realisation. As in its English-only predecessor SR'11, shared comprised two tracks with different levels complexity: (a) shallow track where inputs were full UD structures word order information removed and tokens lemmatised; (b) deep additionally, functional words morphological removed. The was offered ten, three languages. Systems evaluated...
Randomized methods of significance testing enable estimation the probability that an increase in score has occurred simply by chance. In this paper, we examine accuracy three randomized context machine translation: paired bootstrap resampling, resampling and approximate randomization. We carry out a large-scale human evaluation shared task systems for two language pairs to provide gold standard tests. Results show very little difference across testing. Notably, all test/metric combinations...