ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Benchmark (surveying)
DOI: 10.48550/arxiv.2410.03117 Publication Date: 2024-10-03
ABSTRACT
Reasoning is central to a wide range of intellectual activities, and while the capabilities large language models (LLMs) continue advance, their performance in reasoning tasks remains limited. The processes mechanisms underlying are not yet fully understood, but key elements include path exploration, selection relevant knowledge, multi-step inference. Problems solved through synthesis these components. In this paper, we propose benchmark that focuses on specific aspect ability: direct evaluation To end, design special task where inference specifically focused by largely eliminating exploration implicit knowledge utilization. Our dataset comprises pairs explicit instructions corresponding questions, procedures necessary for solving questions entirely detailed within instructions. This setup allows solve problems solely following provided directives. By constructing require varying numbers steps evaluating responses at each step, enable thorough assessment state-of-the-art LLMs' ability follow ensure robustness our evaluation, multiple distinct tasks. Furthermore, comparing accuracy across tasks, utilizing step-aware metrics, applying separately defined measures complexity, conduct experiments offer insights into limitations LLMs findings have significant implications development highlight areas future research advancing abilities. available \url{https://huggingface.co/datasets/ifujisawa/procbench} code \url{https://github.com/ifujisawa/proc-bench}.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....