Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings
Diagnosis, Differential
Humans
Reproducibility of Results
Algorithms
DOI:
10.1148/radiol.232346
Publication Date:
2024-10-15T13:53:11Z
AUTHORS (10)
ABSTRACT
Background The burgeoning interest in ChatGPT as a potentially useful tool medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate accuracy, reliability, repeatability differential diagnoses produced by from transcribed radiologic findings. Materials Methods Cases selected radiology textbook series spanning variety imaging modalities, subspecialties, anatomic pathologies were converted into standardized prompts that entered (GPT-3.5 GPT-4 algorithms; April 3 to June 1, 2023). Responses analyzed accuracy via comparison with final diagnosis top provided textbook, which served ground truth. Reliability, defined based on frequency algorithmic hallucination, was assessed through identification factually incorrect statements fabricated references. Comparisons made between algorithms using McNemar test generalized estimating equation model framework. Test-retest measured obtaining 10 independent responses both cases each subspecialty, calculating average pairwise percent agreement Krippendorff α. Results A total 339 collected across multiple subspecialties. overall GPT-3.5 53.7% (182 339) 66.1% (224 339;
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (26)
CITATIONS (10)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....