1. Evaluating the Efficacy of Natural Language Processing Artificial Intelligence Models as a Patient Education Tool for Stature Lengthening Surgery and Reconstruction
- Author
-
Jorden L. Xavier, Joseph Khoury, Huai Ming Phen, Austin T. Fragomen, S. Robert Rozbruch, and Mani Kahn
- Subjects
artificial intelligence ,chatgpt ,education ,stature lengthening ,Orthopedic surgery ,RD701-811 - Abstract
Introduction: While traditionally a treatment for congenital deformities, limb-lengthening surgery has recently been utilized as a cosmetic solution for patients with less-than-desirable heights. Despite a growing rise in popularity, stature-lengthening surgery carries a host of social stigmas and ethical concerns. With the emergence of natural language processing artificial intelligence (NLP AI) models such as ChatGPT, patients interested in stature-lengthening surgery may turn to these chatbots for medical information instead of primary care providers. This study seeks to evaluate the quality and readability of NLP AI model responses to common questions regarding stature-lengthening surgery to ensure proper patient education and direction toward a licensed orthopedic surgeon. Materials and Methods: Eleven questions regarding limb-lengthening surgery were posed to ChatGPT version 3.5. All named authors reviewed the AI responses independently and utilized the following scoring systems to assess readability and quality: DISCERN, ChatGPT Response Accuracy Scoring (CRAS) developed by Mika et al., Flesch–Kincaid Reading Ease (FRES), and Flesch–Kincaid Grade Level (FKGL). Cohen’s kappa was utilized to determine the inter-rater reliability between each reviewer’s assigned DISCERN and CRAS scores. Results: The average DISCERN score across all independent reviewers was 55.2 ± 5.63, with an inter-rater reliability of 0.293, indicating good scores with fair agreement, respectively. The average CRAS score across all senior reviewers was 1.60 ± 0.63 with inter-rater reliability being 0.014, indicating none to slight agreement. The average FRES score was 14.8 ± 2.24, and the average FKGL score was 18.9 ± 11.0, indicating a required reading level of a college graduate. Conclusions: The results of our study show that NLP AI models provide generally high-quality responses to patient questions regarding stature-lengthening surgery. However, notable flaws were identified, including a high required education level for proper comprehension as well as an inability to cite source material used for each response.
- Published
- 2024
- Full Text
- View/download PDF