How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation
Strengths and weaknesses
DOI:
10.1609/aaai.v35i15.17599
Publication Date:
2022-09-08T20:07:41Z
AUTHORS (2)
ABSTRACT
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of performance is: Are our doing equitable evaluation? In paper, we introduce a task-agnostic method to probe by weighting samples based on their 'difficulty' level. We find can be adversarially attacked performing models may not always the best models. subsequently propose alternate evaluation metrics. Our experiments 10 show changes ranking an overall reduction previously reported performance- thus rectifying overestimation AI systems' capabilities. Inspired behavioral testing principles, further develop prototype visual analytics tool enables leaderboard revamping through customization, end user's focus area. This helps users analyze models' strengths weaknesses, guides them selection suited for application scenario. user study, members various commercial product development teams, covering 5 areas, reduces effort 41% average.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (5)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....