NFDI4DS | UHH-SEMS - Publication Details

Yvette Graham

ORCID: 0000-0001-6741-4855

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5020500064

Research Areas

Topic Modeling
Natural Language Processing Techniques
Multimodal Machine Learning Applications
Video Analysis and Summarization
Advanced Image and Video Retrieval Techniques
Text Readability and Simplification
Speech and dialogue systems
Human Pose and Action Recognition
Semantic Web and Ontologies
Software Engineering Research
Sentiment Analysis and Opinion Mining
Advanced Text Analysis Techniques
Domain Adaptation and Few-Shot Learning
Biomedical Text Mining and Ontologies
Computational and Text Analysis Methods
Translation Studies and Practices
Mobile Crowdsensing and Crowdsourcing
Lexicography and Language Studies
Explainable Artificial Intelligence (XAI)
AI in Service Interactions
Algorithms and Data Compression
Cognitive Science and Mapping
Authorship Attribution and Profiling
Mathematics, Computing, and Information Processing
Logic, programming, and type systems

Trinity College Dublin
2015-2024

Dublin City University
2007-2021

University of Sheffield
2017-2021

University of Amsterdam
2016-2021

Bar-Ilan University
2021

University of Helsinki
2021

Tel Aviv University
2021

Technical University of Darmstadt
2021

University of Copenhagen
2021

Edinburgh Napier University
2021

Findings of the 2018 Conference on Machine Translation (WMT18)

OPENALEX - Publications

Tom Kocmi Rachel Bawden Ondřej Bojar Anton Dvorkovich Christian Federmann and 15 more

Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Christof Monz. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. 2018.

10.18653/v1/w18-6401 preprint EN cc-by 2018-01-01

Findings of the 2016 Conference on Machine Translation

OPENALEX - Publications

Ondřej Bojar Rajen Chatterjee Christian Federmann Yvette Graham Barry Haddow and 16 more

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016.

10.18653/v1/w16-2301 article EN 2016-01-01

Findings of the 2019 Conference on Machine Translation (WMT19)

OPENALEX - Publications

Loïc Barrault Ondřej Bojar Marta R. Costa‐jussà Christian Federmann Mark Fishel and 10 more

Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, Marcos Zampieri. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.

10.18653/v1/w19-5301 article EN cc-by 2019-01-01

Findings of the 2017 Conference on Machine Translation (WMT17)

OPENALEX - Publications

Tom Kocmi Rachel Bawden Ondřej Bojar Anton Dvorkovich Christian Federmann and 15 more

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, Marco Turchi. Proceedings of the Second Conference on Machine Translation. 2017.

10.18653/v1/w17-4717 article EN cc-by 2017-01-01

Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges

OPENALEX - Publications

Qingsong Ma Johnny Tian-Zheng Wei Ondřej Bojar Yvette Graham

This paper presents the results of WMT19 Metrics Shared Task. Participants were asked to score outputs translations systems competing in News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 which are reference-less "metrics" and constitute submissions joint task Quality Estimation Task, "QE as a Metric". In addition, we computed 11 baseline 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, chrF) 3 reimplementations (chrF+,...

10.18653/v1/w19-5302 article EN cc-by 2019-01-01

Can machine translation systems be evaluated by the crowd alone

OPENALEX - Publications

Yvette Graham Timothy Baldwin Alistair Moffat Justin Zobel

Abstract Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work filtered avoid contamination results through inclusion false assessments. One method filter via agreement with experts, but even amongst experts levels may not high. In this paper, we present new methodology for crowd-sourcing human quality, which allows individual workers develop their own assessment strategy....

10.1017/s1351324915000339 article EN Natural Language Engineering 2015-09-15

Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE

OPENALEX - Publications

Yvette Graham

We provide an analysis of current evaluation methodologies applied to summarization metrics and identify the following areas concern: (1) movement away from by correlation with human assessment; (2) omission important components assessment evaluations, in addition large numbers metric variants; (3) absence methods significance testing improvements over a baseline.We outline methodology that overcomes all such challenges, providing first method suitable for metrics.Our reveals time which...

10.18653/v1/d15-1013 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2015-01-01

Results of the WMT17 Metrics Shared Task

OPENALEX - Publications

Ondřej Bojar Yvette Graham Amir Kamran

This paper presents the results of WMT17 Metrics Shared Task.We asked participants this task to score outputs MT systems involved in news translation and Neural training task.We collected scores 14 metrics from 8 research groups.In addition that, we computed 7 standard (BLEU, SentBLEU, NIST, WER, PER, TER CDER) as baselines.The were evaluated terms system-level correlation (how well each metric's correlate with official manual ranking systems) segment level often a metric agrees humans...

10.18653/v1/w17-4755 article EN cc-by 2017-01-01

Results of the WMT16 Metrics Shared Task

OPENALEX - Publications

Ondřej Bojar Yvette Graham Amir Kamran Miloš Stanojević

This paper presents the results of WMT16 Metrics Shared Task.We asked participants this task to score outputs MT systems involved in Translation collected scores 16 metrics from 9 research groups.In addition that, we computed standard (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines.The were evaluated terms system-level correlation (how well each metric's correlate with official manual ranking systems) segment level often a metric agrees humans comparing two translations...

10.18653/v1/w16-2302 article EN cc-by 2016-01-01

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

OPENALEX - Publications

Qingsong Ma Ondřej Bojar Yvette Graham

This paper presents the results of WMT18 Metrics Shared Task. We asked participants this task to score outputs MT systems involved in News Translation Task with automatic metrics. collected scores 10 metrics and 8 research groups. In addition that, we computed standard (BLEU, SentBLEU, chrF, NIST, WER, PER, TER CDER) as baselines. The were evaluated terms system-level correlation (how well each metric's correlate official manual ranking systems) segment-level often a metric agrees humans...

10.18653/v1/w18-6450 article EN cc-by 2018-01-01

Accurate Evaluation of Segment-level Machine Translation Metrics

OPENALEX - Publications

Yvette Graham Timothy Baldwin Nitika Mathur

Evaluation of segment-level machine translation metrics is currently hampered by: (1) low inter-annotator agreement levels in human assessments; (2) lack an effective mechanism for evaluation translations equal quality; and (3) methods significance testing improvements over a baseline.In this paper, we provide solutions to each these challenges outline new methodology aimed specifically at assessment metrics.We replicate the component WMT-13 reveal that current state-of-the-art performance...

10.3115/v1/n15-1124 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015-01-01

The Second Multilingual Surface Realisation Shared Task (SR’19): Overview and Evaluation Results

OPENALEX - Publications

Simon Mille Anja Belz Bernd Bohnet Yvette Graham Leo Wanner

We report results from the SR'19 Shared Task, second edition of a multilingual surface realisation task organised as part EMNLP'19 Workshop on Multilingual Surface Realisation. As in SR'18, shared comprised two tracks with different levels complexity: (a) shallow track where inputs were full UD structures word order information removed and tokens lemmatised; (b) deep additionally, functional words morphological removed. The was offered eleven, three languages. Systems evaluated...

10.18653/v1/d19-6301 article EN cc-by 2019-01-01

Statistical Power and Translationese in Machine Translation Evaluation

OPENALEX - Publications

Yvette Graham Barry Haddow Philipp Koehn

The term translationese has been used to describe features of translated text, and in this paper, we provide detailed analysis potential adverse effects on machine translation evaluation. Our shows differences conclusions drawn from evaluations that include test data compared experiments tested only with text originally composed language. For reason recommend reverse-created be omitted future sets. In addition, a re-evaluation past evaluation claiming human-parity MT. One important issue not...

10.18653/v1/2020.emnlp-main.6 article EN cc-by 2020-01-01

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems

OPENALEX - Publications

Tianbo Ji Yvette Graham Gareth J. F. Jones Chenyang Lyu Qun Liu

Evaluation of open-domain dialogue systems is highly challenging and development better techniques highlighted time again as desperately needed. Despite substantial efforts to carry out reliable live evaluation in recent competitions, annotations have been abandoned reported too unreliable yield sensible results. This a serious problem since automatic metrics are not known provide good indication what may or be high-quality conversation. Answering the distress call competitions that...

10.18653/v1/2022.acl-long.445 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Testing for Significance of Increased Correlation with Human Judgment

OPENALEX - Publications

Yvette Graham Timothy Baldwin

Automatic metrics are widely used in machine translation as a substitute for human assessment. With the introduction of any new metric comes question just how well that mimics assessment quality. This is often measured by correlation with judgment. Significance tests generally not to establish whether improvements over existing methods such BLEU statistically significant or have occurred simply chance, however. In this paper, we introduce significance test comparing correlations two metrics,...

10.3115/v1/d14-1020 article EN 2014-01-01

Translationese in Machine Translation Evaluation

OPENALEX - Publications

Yvette Graham Barry Haddow Philipp Koehn

The term translationese has been used to describe the presence of unusual features translated text. In this paper, we provide a detailed analysis adverse effects on machine translation evaluation results. Our shows evidence support differences in text originally written given language relative and can potentially negatively impact accuracy evaluations. For reason recommend that reverse-created test data be omitted from future sets. addition, re-evaluation past high-profile claiming...

10.48550/arxiv.1906.09833 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Is Machine Translation Getting Better over Time?

OPENALEX - Publications

Yvette Graham Timothy Baldwin Alistair Moffat Justin Zobel

Recent human evaluation of machine translation has focused on relative preference judgments quality, making it difficult to track longitudinal improvements over time. We carry out a large-scale crowd-sourcing experiment estimate the degree which state-of-theart performance in increased past five years. To facilitate evaluation, we move away from and instead ask judges provide direct estimates quality individual translations isolation alternate outputs. For seven European language pairs, our...

10.3115/v1/e14-1047 article EN cc-by 2014-01-01

Improving Evaluation of Machine Translation Quality Estimation

OPENALEX - Publications

Yvette Graham

Yvette Graham. Proceedings of the 53rd Annual Meeting Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.

10.3115/v1/p15-1174 article EN cc-by 2015-01-01

Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task

OPENALEX - Publications

Qingsong Ma Yvette Graham Shugen Wang Qun Liu

Existing metrics to evaluate the quality of Machine Translation hypotheses take different perspectives into account.DPM-Fcomb, a metric combining merits range metrics, achieved best performance for evaluation to-English language pairs in previous two years WMT Metrics Shared Tasks.This year, we submit novel combined metric, Blend, WMT17 task.Compared DPMFcomb, Blend includes following adaptations: i) We use DA human guide training process with vast reduction required data, while still...

10.18653/v1/w17-4768 article EN cc-by 2017-01-01

The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results

OPENALEX - Publications

Simon Mille Anja Belz Bernd Bohnet Yvette Graham Emily Pitler and 1 more

We report results from the SR'18 Shared Task, a new multilingual surface realisation task organised as part of ACL'18 Workshop on Multilingual Surface Realisation. As in its English-only predecessor SR'11, shared comprised two tracks with different levels complexity: (a) shallow track where inputs were full UD structures word order information removed and tokens lemmatised; (b) deep additionally, functional words morphological removed. The was offered ten, three languages. Systems evaluated...

10.18653/v1/w18-3601 article EN cc-by 2018-01-01

Randomized Significance Tests in Machine Translation

OPENALEX - Publications

Yvette Graham Nitika Mathur Timothy Baldwin

Randomized methods of significance testing enable estimation the probability that an increase in score has occurred simply by chance. In this paper, we examine accuracy three randomized context machine translation: paired bootstrap resampling, resampling and approximate randomization. We carry out a large-scale human evaluation shared task systems for two language pairs to provide gold standard tests. Results show very little difference across testing. Notably, all test/metric combinations...

10.3115/v1/w14-3333 article EN 2014-01-01

Coming Soon ...