With Little Power Comes Great Responsibility
Benchmark (surveying)
Statistical power
DOI:
10.18653/v1/2020.emnlp-main.745
Publication Date:
2020-11-29T14:51:46Z
AUTHORS (6)
ABSTRACT
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by NLP community. Underpowered experiments make it more difficult discern difference between noise and meaningful model improvements, increase chances of exaggerated findings. By meta-analyzing set existing papers datasets, we characterize typical for variety settings conclude that underpowered are common in literature. In particular, several tasks popular GLUE benchmark, small test sets mean most attempted comparisons state art models not be adequately powered. Similarly, based on reasonable assumptions, find design human rating studies detect differences, sort frequently studied. For machine translation, 2000 sentences have approximately 75% differences 1 BLEU point. To improve situation going forward, give overview best practices analysis release series notebooks assist with future analyses.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (3)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....