MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Benchmark (surveying)
DOI:
10.48550/arxiv.2502.14302
Publication Date:
2025-02-20
AUTHORS (7)
ABSTRACT
Advancements in Large Language Models (LLMs) and their increasing use medical question-answering necessitate rigorous evaluation of reliability. A critical challenge lies hallucination, where models generate plausible yet factually incorrect outputs. In the domain, this poses serious risks to patient safety clinical decision-making. To address this, we introduce MedHallu, first benchmark specifically designed for hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, medically fine-tuned UltraMedical, struggle binary detection task, best model achieving an F1 score as low 0.625 detecting "hard" category hallucinations. Using bidirectional entailment clustering, harder-to-detect hallucinations are semantically closer ground truth. Through experiments, also incorporating domain-specific knowledge introducing "not sure" one answer categories improves precision scores by up 38% relative baselines.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....