NFDI4DS | UHH-SEMS - Publication Details

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

OPENALEX - Publications

Craig Thomson Ehud Reiter

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy generated texts, which is intended serve as gold-standard evaluations data-to-text systems. use our evaluate computer basketball summaries. then show how gold standard can be used validate automated metrics.

10.18653/v1/2020.inlg-1.22 article EN cc-by 2020-01-01

Evaluating factual accuracy in complex data-to-text

OPENALEX - Publications

Craig Thomson Ehud Reiter Barkavi Sundararajan

10.1016/j.csl.2023.101482 article EN Computer Speech & Language 2023-01-05

AI in Energy Digital Twining: A Reinforcement Learning-Based Adaptive Digital Twin Model for Green Cities

OPENALEX - Publications

Lal Verda Çakır Kübra Duran Craig Thomson Matthew Broadbent Berk Canberk

10.1109/icc51166.2024.10622773 article EN ICC 2022 - IEEE International Conference on Communications 2024-06-09

Underreporting of errors in NLG output, and what to do about it

OPENALEX - Publications

Emiel van Miltenburg Miruna Clinciu Ondřej Dušek Dimitra Gkatzia Stephanie Inglis and 6 more

Emiel van Miltenburg, Miruna Clinciu, Ondřej Dušek, Dimitra Gkatzia, Stephanie Inglis, Leo Leppänen, Saad Mahamood, Emma Manning, Schoch, Craig Thomson, Luou Wen. Proceedings of the 14th International Conference on Natural Language Generation. 2021.

10.18653/v1/2021.inlg-1.14 preprint EN cc-by 2021-01-01

Common Flaws in Running Human Evaluation Experiments in NLP

OPENALEX - Publications

Craig Thomson Ehud Reiter Anja Belz

Abstract While conducting a coordinated set of repeat runs human evaluation experiments in NLP, we discovered flaws every single experiment selected for inclusion via systematic process. In this squib, describe the types discovered, which include coding errors (e.g., loading wrong system outputs to evaluate), failure follow standard scientific practice ad hoc exclusion participants and responses), mistakes reported numerical results numbers not matching experimental data). If these problems...

10.1162/coli_a_00508 article EN cc-by-nc-nd Computational Linguistics 2024-01-01

AI in Energy Digital Twining: A Reinforcement Learning-based Adaptive Digital Twin Model for Green Cities

OPENALEX - Publications

Lal Verda Çakır Kübra Duran Craig Thomson Matthew Broadbent Berk Canberk

Digital Twins (DT) have become crucial to achieve sustainable and effective smart urban solutions. However, current DT modelling techniques cannot support the dynamicity of these city environments. This is caused by lack right-time data capturing in traditional approaches, resulting inaccurate high resource energy consumption challenges. To fill this gap, we explore spatiotemporal graphs propose Reinforcement Learning-based Adaptive Twining (RL-AT) mechanism with Deep Q Networks (DQN). By...

10.1109/icc51166.2024.10622773 preprint EN arXiv (Cornell University) 2024-01-28

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

OPENALEX - Publications

Anja Belz Craig Thomson Ehud Reiter Simon Mille

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but particular where they are used for meta-evaluation, that should support same conclusions if repeated. However, reproducibility human evaluations virtually never queried, let alone formally tested, NLP which means their repeatability and results currently an open question. This focused contribution reports our review experiments reported papers over past five years we assessed...

10.18653/v1/2023.findings-acl.226 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

Generation Challenges: Results of the Accuracy Evaluation Shared Task

OPENALEX - Publications

Craig Thomson Ehud Reiter

The Shared Task on Evaluating Accuracy focused techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation this task, using very different approaches techniques. best-performing submissions did encouragingly well at difficult task. However, all automatic struggled to detect errors which are semantically or pragmatically complex (for example, based incorrect computation inference).

10.18653/v1/2021.inlg-1.23 article EN cc-by 2021-01-01

Shared Task on Evaluating Accuracy

OPENALEX - Publications

Ehud Reiter Craig Thomson

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries basketball games produced from box score other game data. welcome submissions based protocols human evaluation, automatic metrics, as well combinations evaluations metrics.

10.18653/v1/2020.inlg-1.28 article EN cc-by 2020-01-01

Studying the Impact of Filling Information Gaps on the Output Quality of Neural Data-to-Text

OPENALEX - Publications

Craig Thomson Zhijie Zhao Somayajulu Sripada

It is unfair to expect neural data-to-text produce high quality output when there are gaps between system input data and information contained in the training text. Thomson et al. (2020) identify narrow Rotowire, a popular dataset. In this paper, we describe study which finds that state-of-the-art produces higher output, according extraction (IE) based metrics, additional carefully selected from newly available source. remains be shown, however, whether IE metrics used correlate well with...

10.18653/v1/2020.inlg-1.6 article EN cc-by 2020-01-01

Comprehension Driven Document Planning in Natural Language Generation Systems

OPENALEX - Publications

Craig Thomson Ehud Reiter Somayajulu Sripada

This paper proposes an approach to NLG system design which focuses on generating output text can be more easily processed by the reader. Ways in cognitive theory might combined with existing techniques are discussed and two simple experiments content ordering presented.

10.18653/v1/w18-6544 article EN cc-by 2018-01-01

Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints

OPENALEX - Publications

Craig Thomson Clément Rebuffel Ehud Reiter Laure Soulier Somayajulu Sripada and 1 more

Neural data-to-text systems lack the control and factual accuracy required to generate useful insightful summaries of multidimensional data. We propose a solution in form data views, where each view describes an entity its attributes along specific dimensions. A sequence views can then be used as high-level schema for document planning, with neural model handling complexities micro-planning surface realization. show that our view-based system retains while offering output tailored based on...

10.18653/v1/2023.inlg-main.16 article EN cc-by 2023-01-01

AI-based traffic analysis in digital twin networks

OPENALEX - Publications

Sarah Al–Shareeda Khayal Huseynov Lal Verda Çakır Craig Thomson Mehmet Özdem and 1 more

In today's networked world, Digital Twin Networks (DTNs) are revolutionizing how we understand and optimize physical networks. These networks, also known as 'Digital (DTNs)' or 'Networks Twins (NDTs),' encompass many from cellular wireless to optical satellite. They leverage computational power AI capabilities provide virtual representations, leading highly refined recommendations for real-world network challenges. Within DTNs, tasks include performance enhancement, latency optimization,...

10.48550/arxiv.2411.00681 preprint EN arXiv (Cornell University) 2024-11-01

Min-max Training: Adversarially Robust Learning Models for Network Intrusion Detection Systems

OPENALEX - Publications

Sam Grierson Craig Thomson Pavlos Papadopoulos Bill Buchanan

Intrusion detection systems are integral to the security of networked for detecting malicious or anomalous network traffic. As traditional approaches becoming less effective, machine learning and deep learning-based intrusion vital research areas improved systems. Past into computer vision using revealed that classifiers themselves vulnerable adversarial attacks, these attacks have been investigated extensively. However, restricted not only domain image recognition. indicated by previous...

10.1109/sin54109.2021.9699157 article EN 2021-12-15

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

OPENALEX - Publications

Craig Thomson Ehud Reiter

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy generated texts, which is intended serve as gold-standard evaluations data-to-text systems. use our evaluate computer basketball summaries. then show how gold standard can be used validate automated metrics

10.48550/arxiv.2011.03992 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Shared Task on Evaluating Accuracy in Natural Language Generation

OPENALEX - Publications

Ehud Reiter Craig Thomson

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts. Participants will measure basketball game summaries produced by NLG systems from box score data.

10.48550/arxiv.2006.12234 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Generation Challenges: Results of the Accuracy Evaluation Shared Task

OPENALEX - Publications

Craig Thomson Ehud Reiter

The Shared Task on Evaluating Accuracy focused techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation this task, using very different approaches techniques. best-performing submissions did encouragingly well at difficult task. However, all automatic struggled to detect errors which are semantically or pragmatically complex (for example, based incorrect computation inference).

10.48550/arxiv.2108.05644 preprint EN other-oa arXiv (Cornell University) 2021-01-01