Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study

Dental research
DOI: 10.1371/journal.pone.0317423 Publication Date: 2025-01-29T18:25:52Z
ABSTRACT
This study aims to evaluate the performance of latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions. A total 1490 MCQs from two board review books for United States National Board Dental Examination were selected. evaluated six LLMs as August 2024, ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), Llama 3.1 405b (Meta). χ2 tests performed determine whether there significant differences percentages correct answers among sample each discipline (p < 0.05). Significant observed percentage accurate across questions, (p<0.001). For sample, (85.5%), (84.0%), (83.8%) demonstrated highest accuracy, followed by (78.3%) (77.1%), (72.4%) exhibiting lowest. Newer versions demonstrate superior compared earlier versions. Copilot, Claude, achieved high accuracy on low capable handling limited clinicians students should prioritize most up-to-date when supporting their learning, clinical practice, research.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (27)
CITATIONS (3)