Active label cleaning for improved dataset quality under resource constraints

Benchmark (surveying)
DOI: 10.1038/s41467-022-28818-3 Publication Date: 2022-03-04T13:37:47Z
ABSTRACT
Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on assessment model performance. Nevertheless, employing experts remove noise by fully re-annotating large datasets is infeasible resource-constrained settings, such healthcare. This work advocates for data-driven approach prioritising samples re-annotation-which we term "active cleaning". We propose rank instances according estimated correctness labelling difficulty each sample, introduce simulation framework evaluate relabelling efficacy. Our experiments natural images specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact training, evaluation, selection. Crucially, proposed enables correcting up 4 × more effectively than typical random selection realistic conditions, making better use experts' valuable time improving dataset quality.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (55)
CITATIONS (71)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....