Back to Search
Start Over
A Comparative Study of Large Language Models for Generating Summaries of Breast Cancer Patient-Reported Treatment Toxicities.
- Source :
-
International Journal of Radiation Oncology, Biology, Physics . 2024 Supplement, Vol. 120 Issue 2, pe666-e666. 1p. - Publication Year :
- 2024
-
Abstract
- Recent advances in artificial intelligence such as large language models (LLMs) offer a promising avenue for enhancing clinical documentation and monitoring patient-reported outcomes (PRO). This study aims to compare four leading open-source and proprietary LLMs, Mixtral-8x7B, Llama-2B, Qwen-1.5, and GPT-4, in generating summaries of patient-reported symptoms using the adapted Physician Documentation Quality Index (PDQI). A previously reported web-based application utilizing 35 items from the PRO-CTCAE scale was used to create an interactive form for breast cancer patients to report treatment-related symptoms. The four LLMs were used to generate natural language summaries for four hypothetical patients with non-identifiable patient data. Twelve resident physician raters evaluated the summaries using an abbreviated PDQI questionnaire, rating accuracy, usefulness, comprehensibility, succinctness scored on a 5-point Likert scale. IRB approval was not required in accordance with the NIH 2018 Revised Common Rule Requirements as the study used researcher-generated, non-identifiable data. Forty-seven physician ratings were collected. A repeated measures ANOVA showed significant differences in accuracy among the models (F(2.120, 97.52) = 15.30, p < 0.0001), with Mixtral-8x7b (M = 3.60), GPT-4 (M = 3.78), and Qwen-1.5 (M = 3.62) significantly surpassing Llama-2 (M = 2.77, p ≤ 0.001), with no significant difference between the three. Mixtral-8x7b (M = 3.47) outperformed Llama-2 in usefulness (M = 2.87, p<.05) and outperformed Llama-2 and Qwen in succinctness (p<.05). No significant differences were found in comprehensiveness. Reviewers noted 6 mistakes in Llama-2 summaries, and 1 mistake each in Mixtral-8x7b and Qwen-1.5 summaries. This study demonstrates that the latest open-source LLMs, such as Mixtral-8x7b and Qwen-1.5, can match the performance of their closed-source counterpart, GPT-4, in physician-rated measures of documentation quality and outperform their predecessor, Llama-2. This study highlights the narrowing gap between open-source and proprietary LLMs in medical documentation. These findings may help inform the strategic selection of cost-effective and data-safe LLMs for future clinical research and practice integration, potentially democratizing advanced AI tools for a broader healthcare audience. However, further validation in real-world clinical settings is necessary to assess the impact of these models on patient care efficiency and efficacy. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 03603016
- Volume :
- 120
- Issue :
- 2
- Database :
- Academic Search Index
- Journal :
- International Journal of Radiation Oncology, Biology, Physics
- Publication Type :
- Academic Journal
- Accession number :
- 179876380
- Full Text :
- https://doi.org/10.1016/j.ijrobp.2024.07.1461