Back to Search Start Over

Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.

Authors :
Woo JJ
Yang AJ
Olsen RJ
Hasan SS
Nawabi DH
Nwachukwu BU
Williams RJ 3rd
Ramkumar PN
Source :
Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association [Arthroscopy] 2025 Mar; Vol. 41 (3), pp. 565-573.e6. Date of Electronic Publication: 2024 Nov 07.
Publication Year :
2025

Abstract

Purpose: To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case.<br />Methods: A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response.<br />Results: All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%).<br />Conclusions: RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.<br />Clinical Relevance: Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.<br />Competing Interests: Disclosures The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: P.N.R. has received support from Pacira for consulting, Stryker for consulting, Globus for consulting and royalties, Intelligent Health Analytics, Inc., for ownership and overture for ownership. The remaining authors (J.J.W., A.J.Y., R.J.O., S.S.H., D.H.N., B.U.N., R.J.W.) declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.<br /> (Copyright © 2024 Arthroscopy Association of North America. Published by Elsevier Inc. All rights reserved.)

Details

Language :
English
ISSN :
1526-3231
Volume :
41
Issue :
3
Database :
MEDLINE
Journal :
Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association
Publication Type :
Academic Journal
Accession number :
39521391
Full Text :
https://doi.org/10.1016/j.arthro.2024.10.042