Back to Search Start Over

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.

Authors :
Tsai, Chung-You
Hsieh, Shang-Ju
Huang, Hung-Hsiang
Deng, Juinn-Horng
Huang, Yi-You
Cheng, Pai-Yu
Source :
World Journal of Urology. 2024, Vol. 42 Issue 1, p1-9. 9p.
Publication Year :
2024

Abstract

Purpose: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains. Methods: 450 multiple-choice questions from TUBE(2020–2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison. Results: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05–3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making. Conclusions: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
07244983
Volume :
42
Issue :
1
Database :
Academic Search Index
Journal :
World Journal of Urology
Publication Type :
Academic Journal
Accession number :
176793569
Full Text :
https://doi.org/10.1007/s00345-024-04957-8