IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
FOS: Computer and information sciences
Computer Science - Computation and Language
Computation and Language (cs.CL)
DOI:
10.48550/arxiv.2409.18892
Publication Date:
2024-09-27
AUTHORS (10)
ABSTRACT
As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures ability of individual test items differentiate between high and low performers. Inspired by this we propose an ID-induced prompt synthesis framework for evaluating LLMs can continually update refine according model abilities. Our data prioritizes both breadth specificity. It generate prompts that comprehensively evaluate capabilities while revealing meaningful performance differences models, allowing effective discrimination their relative strengths weaknesses across various tasks domains. To produce high-quality data, incorporate a self-correct mechanism into our generalization framework, develop two models predict difficulty score facilitate contributing valuable tools research. We apply generated five SOTA models. achieves average 51.92, accompanied variance 10.06. By contrast, previous works (i.e., SELF-INSTRUCT WizardLM) obtain exceeding 67, below 3.2. The results demonstrate more challenging discriminative compared works. will release dataset over 3,000 carefully crafted research LLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....