NFDI4DS | UHH-SEMS - Publication Details

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Benchmark (surveying)

DOI: 10.48550/arxiv.2410.03117 Publication Date: 2024-10-03

Abstract Supplemental Material References Cited by

AUTHORS (8)

Ippei Fujisawa

Sensho Nobe

Hikaru Seto

Rina Onda

Yoshiaki Uchida

H Ikoma

Pei-Chun Chien

Ryota Kanai

ABSTRACT

Reasoning is central to a wide range of intellectual activities, and while the capabilities large language models (LLMs) continue advance, their performance in reasoning tasks remains limited. The processes mechanisms underlying are not yet fully understood, but key elements include path exploration, selection relevant knowledge, multi-step inference. Problems solved through synthesis these components. In this paper, we propose benchmark that focuses on specific aspect ability: direct evaluation To end, design special task where inference specifically focused by largely eliminating exploration implicit knowledge utilization. Our dataset comprises pairs explicit instructions corresponding questions, procedures necessary for solving questions entirely detailed within instructions. This setup allows solve problems solely following provided directives. By constructing require varying numbers steps evaluating responses at each step, enable thorough assessment state-of-the-art LLMs' ability follow ensure robustness our evaluation, multiple distinct tasks. Furthermore, comparing accuracy across tasks, utilizing step-aware metrics, applying separately defined measures complexity, conduct experiments offer insights into limitations LLMs findings have significant implications development highlight areas future research advancing abilities. available \url{https://huggingface.co/datasets/ifujisawa/procbench} code \url{https://github.com/ifujisawa/proc-bench}.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....