Back to Search Start Over

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

Authors :
Mahmoud, Reema
Shuster, Amir
Kleinman, Shlomi
Arbel, Shimrit
Ianculovici, Clariel
Peleg, Oren
Source :
Journal of Oral & Maxillofacial Surgery (02782391); Mar2025, Vol. 83 Issue 3, p382-389, 8p
Publication Year :
2025

Abstract

While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored. This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement. An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions. The predictor variable was the LLM used — LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA). The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery. No additional covariates were considered. Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ<superscript>2</superscript> tests were used to assess response consistency and error correction, with statistical significance set at P <.05. LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P =.002), LLM 2 (64.83%, P =.001), and LLM 4 (62.18%, P <.001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P <.001). LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02782391
Volume :
83
Issue :
3
Database :
Supplemental Index
Journal :
Journal of Oral & Maxillofacial Surgery (02782391)
Publication Type :
Academic Journal
Accession number :
183034309
Full Text :
https://doi.org/10.1016/j.joms.2024.11.007