Task Ambiguity in Humans and Language Models

Benchmark (surveying)
DOI: 10.48550/arxiv.2212.10711 Publication Date: 2022-01-01
ABSTRACT
Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from combination context, instructions, examples. We investigate how both humans behave in face such task ambiguity by proposing AmbiBench, new benchmark six ambiguously-specified classification tasks. evaluate on AmbiBench seeing well they identify using 1) instructions with varying degrees ambiguity, 2) different numbers labeled find that model scaling (to 175B parameters) training human feedback data enables to approach or exceed accuracy participants tasks, but either one alone is not sufficient. In addition, we show dramatically improve language trained without large-scale finetuning small number ambiguous in-context examples, providing promising direction for teaching generalize ambiguity.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....