1. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology.
- Author
-
Warrier A, Singh R, Haleem A, Zaki H, and Eloy JA
- Subjects
- Humans, Language, Otolaryngology methods
- Abstract
Objectives: Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology., Methods: We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023., Results: ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002)., Conclusions: ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses., Level of Evidence: 3 Laryngoscope, 134:3997-4002, 2024., (© 2024 The Authors. The Laryngoscope published by Wiley Periodicals LLC on behalf of The American Laryngological, Rhinological and Otological Society, Inc.)
- Published
- 2024
- Full Text
- View/download PDF