Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow
Benchmark (surveying)
Sample (material)
Human-in-the-loop
Feedback loop
DOI:
10.48550/arxiv.2302.04434
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, focuses on guiding crowdworkers, an under-explored facet addressing idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations improve quality. Our approach is domain, model, task, metric agnostic, constitutes shift robust, validated, dynamic via human-and-metric-in-the-loop workflows. We evaluate expert review user study with NASA TLX. find decreases effort, frustration, mental, temporal demands crowdworkers analysts, simultaneously increasing the performance both groups 45.8% decrease level artifacts created samples. As product our study, observe samples are adversarial across models, 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot)
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....