Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan
Bonferroni correction
Interquartile range
DOI:
10.1007/s11604-024-01673-6
Publication Date:
2024-10-28T09:54:45Z
AUTHORS (8)
ABSTRACT
Abstract Purpose This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy expert-level diagnostic radiology questions. Materials methods We analyzed 146 questions from Japan Radiology Board Examination (2020–2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English GPT-4 DeepL German Chinese GPT-4. Responses generated five times per question set language. Response was compared between languages using one-way ANOVA Bonferroni correction or Mann–Whitney U test. Scores selected a professional service also compared. impact GPT-4’s performance assessed linear regression analysis. Results median scores (interquartile range) for 70 (68–72) (Japanese), 89 (84.5–95.5) (GPT-4 English), 64 (55.5–67) (Chinese), 56 (46.5–67.5) (German). Significant differences found Japanese (p = 0.002) 0.022). counts correct responses across attempts each significantly associated (GPT-4, DeepL) (GPT-4). In subset 31 where translations yielded fewer than originals, professionally better those (13 versus 8 points, p 0.0079). Conclusion exhibits higher when responding English-translated original trend not observed translations. Accuracy improves higher-quality translations, underscoring importance high-quality improving non-English aiding non-native speakers obtaining accurate large models.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (23)
CITATIONS (1)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....