NFDI4DS | UHH-SEMS - Publication Details

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Benchmark (surveying)

DOI: 10.48550/arxiv.2502.14302 Publication Date: 2025-02-20

Abstract Supplemental Material References Cited by

AUTHORS (7)

Shrey Pandit

Jiawei Xu

Junyuan Hong

Zhangyang Wang

Tianlong Chen

Kaidi Xu

Ying Ding

ABSTRACT

Advancements in Large Language Models (LLMs) and their increasing use medical question-answering necessitate rigorous evaluation of reliability. A critical challenge lies hallucination, where models generate plausible yet factually incorrect outputs. In the domain, this poses serious risks to patient safety clinical decision-making. To address this, we introduce MedHallu, first benchmark specifically designed for hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, medically fine-tuned UltraMedical, struggle binary detection task, best model achieving an F1 score as low 0.625 detecting "hard" category hallucinations. Using bidirectional entailment clustering, harder-to-detect hallucinations are semantically closer ground truth. Through experiments, also incorporating domain-specific knowledge introducing "not sure" one answer categories improves precision scores by up 38% relative baselines.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....