NFDI4DS | UHH-SEMS - Publication Details

We Need to Consider Disagreement in Evaluation

Evaluation methodologies Annotator disagreement 0202 electrical engineering, electronic engineering, information engineering ANNOTATION, EVALUATION, AGREEMENT, SOFT-LABELS Natural Language Processing (NLP) Ground truth Data annotation 02 engineering and technology 410 004

DOI: 10.18653/v1/2021.bppf-1.3 Publication Date: 2021-07-27T01:42:51Z

Abstract Supplemental Material References Cited by

AUTHORS (8)

Valerio Basile

Michael Fell

Tommaso Fornaciari

Dirk Hovy

Silviu Paun

Barbara Plank

Massimo Poesio

Alexandra Uma

ABSTRACT

Evaluation is of paramount importance in data-driven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). Current evaluation practice largely hinges on the existence of a single “ground truth” against which we can meaningfully compare the prediction of a model. However, this comparison is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We argue that the current methods of adjudication, agreement, and evaluation need serious reconsideration. Some researchers now propose to minimize disagreement and to fix datasets. We argue that this is a gross oversimplification, and likely to conceal the underlying complexity. Instead, we suggest that we need to better capture the sources of disagreement to improve today’s evaluation practice. We discuss three sources of disagreement: from the annotator, the data, and the context, and show how this affects even seemingly objective tasks. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (22)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

We Need to Consider Disagreement in Evaluation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....