HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning
Commonsense reasoning
Robustness
Benchmark (surveying)
DOI:
10.48550/arxiv.2502.11393
Publication Date:
2025-02-16
AUTHORS (9)
ABSTRACT
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations questions can trigger incorrect responses. Do these truly understand knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting 11,200 cases, by designing and compiling seven types question variants. construct benchmark, propose two-stage method to develop Chinese HellaSwag, finely annotated dataset comprising 12,000 instances across 56 categories. conduct experiments on 41 representative LLMs, revealing that are far from robust Furthermore, varies depending which LLM is tested. This work establishes high-quality with offering valuable insights community reasoning for LLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....