NFDI4DS | UHH-SEMS - Publication Details

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Commonsense reasoning Robustness Benchmark (surveying)

DOI: 10.48550/arxiv.2502.11393 Publication Date: 2025-02-16

Abstract Supplemental Material References Cited by

AUTHORS (9)

Xiaoyuan Li

Moxin Li

Rui Men

Yichang Zhang

Keqin Bao

Wenjie Wang

Fuli Feng

Dayiheng Liu

Junyang Lin

ABSTRACT

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations questions can trigger incorrect responses. Do these truly understand knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting 11,200 cases, by designing and compiling seven types question variants. construct benchmark, propose two-stage method to develop Chinese HellaSwag, finely annotated dataset comprising 12,000 instances across 56 categories. conduct experiments on 41 representative LLMs, revealing that are far from robust Furthermore, varies depending which LLM is tested. This work establishes high-quality with offering valuable insights community reasoning for LLMs.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....