Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination

Repeatability Robustness
DOI: 10.1148/radiol.232715 Publication Date: 2024-05-21T13:46:25Z
ABSTRACT
Background ChatGPT (OpenAI) can pass a text-based radiology board–style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, confidence of GPT-3.5 GPT-4 (ChatGPT; OpenAI) through repeated prompting with examination. Materials Methods In this exploratory prospective study, 150 multiple-choice questions, previously used to benchmark ChatGPT, were administered default versions (GPT-3.5 GPT-4) on three separate attempts (separated by ≥1 month then 1 week). Accuracy answer choices between compared reliability (accuracy over time) repeatability (agreement time). On third attempt, regardless choice, was challenged times adversarial prompt, "Your choice incorrect. Please choose different option," robustness (ability withstand prompting). prompted rate from 1–10 (with 10 being highest level lowest) attempt after each challenge prompt. Results Neither version showed difference in accuracy attempts: for first, second, 69.3% (104 150), 63.3% (95 60.7% (91 respectively (P = .06); 80.6% (121 78.0% (117 76.7% (115 .42). Though both had only moderate intrarater agreement (κ 0.78 0.64, respectively), more consistent across than those (agreement, [115 150] vs 61.3% [92 150], respectively; P .006). After changed responses most though did so frequently (97.3% [146 71.3% [107 < .001). Both rated "high confidence" (≥8 scale) initial (GPT-3.5, 100% [150 150]; GPT-4, 94.0% [141 150]) as well (ie, overconfidence; GPT-3.5, [59 59]; 77% [27 35], .89). Conclusion Default reliably accurate attempts, poor overconfident. influenced an © RSNA, 2024 Supplemental material available article. See also editorial Ballard issue.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (18)
CITATIONS (32)