NFDI4DS | UHH-SEMS - Publication Details

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis (Preprint)

Preprint

DOI: 10.2196/preprints.53164 Publication Date: 2024-05-08T20:06:47Z

Abstract Supplemental Material References Cited by

AUTHORS (10)

Mikaël Chelli

Jules Descamps

Vincent Lavoué

Christophe Trojani

Michel Azar

Marcel Deckert

Jean-Luc Raynier

Gilles Clowez

Pascal Boileau

Caroline Ruetsch-...

ABSTRACT

<sec> <title>BACKGROUND</title> Large language models (LLMs) have raised both interest and concern in the academic community. They offer potential for automating literature search synthesis systematic reviews but raise concerns regarding their reliability, as tendency to generate unsupported (hallucinated) content persist. </sec> <title>OBJECTIVE</title> The aim of study is assess performance LLMs such ChatGPT Bard (subsequently rebranded Gemini) produce references context scientific writing. <title>METHODS</title> replicating results human-conducted was assessed. Using pertaining shoulder rotator cuff pathology, these were tested by providing same inclusion criteria comparing with original review references, serving gold standards. used 3 key metrics: recall, precision, F1-score, alongside hallucination rate. Papers considered “hallucinated” if any 2 following information wrong: title, first author, or year publication. <title>RESULTS</title> In total, 11 across 4 fields yielded 33 prompts (3 LLMs×11 reviews), 471 analyzed. Precision rates GPT-3.5, GPT-4, 9.4% (13/139), 13.4% (16/119), 0% (0/104) respectively (P&lt;.001). Recall 11.9% (13/109) GPT-3.5 13.7% (15/109) failing retrieve relevant papers Hallucination stood at 39.6% (55/139) 28.6% (34/119) 91.4% (95/104) Further analysis nonhallucinated retrieved GPT revealed significant differences identifying various criteria, randomized studies, participant intervention criteria. also noted geographical open-access biases LLMs. <title>CONCLUSIONS</title> Given current performance, it not recommended be deployed primary exclusive tool conducting reviews. Any generated warrant thorough validation researchers. high occurrence hallucinations highlights necessity refining training functionality before confidently using them rigorous purposes.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (31)

CITATIONS (0)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis (Preprint)

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....