Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings

Diagnosis, Differential Humans Reproducibility of Results Algorithms
DOI: 10.1148/radiol.232346 Publication Date: 2024-10-15T13:53:11Z
ABSTRACT
Background The burgeoning interest in ChatGPT as a potentially useful tool medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate accuracy, reliability, repeatability differential diagnoses produced by from transcribed radiologic findings. Materials Methods Cases selected radiology textbook series spanning variety imaging modalities, subspecialties, anatomic pathologies were converted into standardized prompts that entered (GPT-3.5 GPT-4 algorithms; April 3 to June 1, 2023). Responses analyzed accuracy via comparison with final diagnosis top provided textbook, which served ground truth. Reliability, defined based on frequency algorithmic hallucination, was assessed through identification factually incorrect statements fabricated references. Comparisons made between algorithms using McNemar test generalized estimating equation model framework. Test-retest measured obtaining 10 independent responses both cases each subspecialty, calculating average pairwise percent agreement Krippendorff α. Results A total 339 collected across multiple subspecialties. overall GPT-3.5 53.7% (182 339) 66.1% (224 339;
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (26)
CITATIONS (10)