Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation
FOS: Computer and information sciences
Computer Science - Machine Learning
Artificial Intelligence (cs.AI)
Computer Science - Artificial Intelligence
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2502.13576
Publication Date:
2025-02-19
AUTHORS (10)
ABSTRACT
Evaluating models on large benchmarks is very resource-intensive, especially during the period of rapid model evolution. Existing efficient evaluation methods estimate performance target by testing them only a small and static coreset benchmark, which derived from publicly available results source models. These rely assumption that have high prediction consistency with However, we demonstrate it doesn't generalize well in practice. To alleviate inconsistency issue, present TailoredBench, method conducts customized tailored to each model. Specifically, Global-coreset first constructed as probe identify most consistent for an adaptive selection strategy. Afterwards, scalable K-Medoids clustering algorithm proposed extend Native-coreset According predictions Native-coresets, obtain whole benchmark calibrated estimation Comprehensive experiments 5 across over 300 compared best performing baselines, TailoredBench achieves average reduction 31.4% MAE accuracy estimates under same inference budgets, showcasing strong effectiveness generalizability.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....