Back to Search Start Over

Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notesResearch in context

Authors :
Xinsong Du
John Novoa-Laurentiev
Joseph M. Plasek
Ya-Wen Chuang
Liqin Wang
Gad A. Marshall
Stephanie K. Mueller
Frank Chang
Surabhi Datta
Hunki Paek
Bin Lin
Qiang Wei
Xiaoyan Wang
Jingqi Wang
Hao Ding
Frank J. Manion
Jingcheng Du
David W. Bates
Li Zhou
Source :
EBioMedicine, Vol 109, Iss , Pp 105401- (2024)
Publication Year :
2024
Publisher :
Elsevier, 2024.

Abstract

Summary: Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. Methods: This study, conducted at Mass General Brigham in Boston, MA, analysed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We developed prompts for two LLMs, Llama 2 and GPT-4, on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud-computing platforms using multiple approaches (e.g., hard prompting, retrieval augmented generation, and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Confusion-matrix-based scores were used for model evaluation. Findings: We used a randomly annotated sample of 4949 note sections from 1969 patients (women: 1046 [53.1%]; age: mean, 76.0 [SD, 13.3] years), filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1996 note sections from 1161 patients (women: 619 [53.3%]; age: mean, 76.5 [SD, 10.2] years) without keyword filtering was utilised. GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models in terms of all evaluation metrics with statistical significance (p

Details

Language :
English
ISSN :
23523964
Volume :
109
Issue :
105401-
Database :
Directory of Open Access Journals
Journal :
EBioMedicine
Publication Type :
Academic Journal
Accession number :
edsdoj.4097d9dc00e4f35b8c9363f5bb0b33b
Document Type :
article
Full Text :
https://doi.org/10.1016/j.ebiom.2024.105401