Evaluating AI systems under uncertain ground truth: a case study in dermatology

Ground truth Normalization Uncertainty Quantification
DOI: 10.48550/arxiv.2307.02191 Publication Date: 2023-01-01
ABSTRACT
For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this actually not the case and may be uncertain. Unfortunately, largely ignored standard evaluation of models but can have severe consequences such as overestimating future performance. To avoid this, we measure effects uncertainty, which assume decomposes into two main components: annotation uncertainty stems from lack reliable annotations, inherent due to limited observational information. This when estimating by deterministically aggregating e.g., majority voting or averaging. In contrast, propose framework where aggregation done using statistical model. Specifically, frame annotations posterior inference so-called plausibilities, representing distributions over classes classification setting, subject hyper-parameter encoding annotator reliability. Based on model, metric for measuring provide uncertainty-adjusted metrics performance evaluation. We present study applying our skin condition images are provided form differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) previous work ignores Instead, alternative models: probabilistic version IRN Plackett-Luce-based find large portion dataset exhibits significant IRN-based severely over-estimates without providing estimates.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....