Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis (Preprint)

Preprint
DOI: 10.2196/preprints.53164 Publication Date: 2024-05-08T20:06:47Z
ABSTRACT
<sec> <title>BACKGROUND</title> Large language models (LLMs) have raised both interest and concern in the academic community. They offer potential for automating literature search synthesis systematic reviews but raise concerns regarding their reliability, as tendency to generate unsupported (hallucinated) content persist. </sec> <title>OBJECTIVE</title> The aim of study is assess performance LLMs such ChatGPT Bard (subsequently rebranded Gemini) produce references context scientific writing. <title>METHODS</title> replicating results human-conducted was assessed. Using pertaining shoulder rotator cuff pathology, these were tested by providing same inclusion criteria comparing with original review references, serving gold standards. used 3 key metrics: recall, precision, &lt;i&gt;F&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;-score, alongside hallucination rate. Papers considered “hallucinated” if any 2 following information wrong: title, first author, or year publication. <title>RESULTS</title> In total, 11 across 4 fields yielded 33 prompts (3 LLMs×11 reviews), 471 analyzed. Precision rates GPT-3.5, GPT-4, 9.4% (13/139), 13.4% (16/119), 0% (0/104) respectively (&lt;i&gt;P&lt;/i&gt;&amp;lt;.001). Recall 11.9% (13/109) GPT-3.5 13.7% (15/109) failing retrieve relevant papers Hallucination stood at 39.6% (55/139) 28.6% (34/119) 91.4% (95/104) Further analysis nonhallucinated retrieved GPT revealed significant differences identifying various criteria, randomized studies, participant intervention criteria. also noted geographical open-access biases LLMs. <title>CONCLUSIONS</title> Given current performance, it not recommended be deployed primary exclusive tool conducting reviews. Any generated warrant thorough validation researchers. high occurrence hallucinations highlights necessity refining training functionality before confidently using them rigorous purposes.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (31)
CITATIONS (0)