Back to Search Start Over

Can Language Models Extract Features from Pathology Reports and Stage Tumors like a Physician?

Authors :
Parker, M.I.
Li, B.
Green, W.R.
McBeth, R.
Source :
International Journal of Radiation Oncology, Biology, Physics. 2024 Supplement, Vol. 120 Issue 2, pe650-e650. 1p.
Publication Year :
2024

Abstract

Language models (LM), like GPT-3.5 and GPT-4, have achieved expert-level performance on medical question-answer datasets. However, few studies have compared LMs to physicians in accomplishing real-world clinical tasks. Here we aimed to determine if LMs can extract features relevant to tumor staging from pathology reports (Task 1) and if LMs can "reason out" tumor staging from extracted features (Task 2). We randomly selected 40 breast cancer cases with pathological T and N stages (AJCC 6th edition) from the TCGA-Reports dataset. For task 1, we instructed the LMs (GPT-3.5 and GPT-4) and a physician (Radiation Oncologist) to extract features relevant to tumor staging from each pathology report. We then examined the percent agreement by an exact match of the extracted features between the LMs and the physician. For task 2, we either: (a) instructed the LMs and the physician to determine T and N stages from the extracted features given the AJCC staging criteria ("Reasoning Approach") or (b) inputted the extracted features into a deterministic Python function for staging ("Function Approach"). Finally, we compared the Mathews Correlation Coefficient (MCC) of each approach between the LMs and the physician. Pathological T and N stages could be determined for 33/40 cases by the physician. Analyzing this subset (Task 1), we found that the extracted features were an exact match between the LMs and the physician in a range of 76-100% with GPT-3.5 and 94-100% with GPT-4. Examining the few incorrectly extracted features by the LMs, we discovered that they made repeated, explainable mistakes. For example, LMs occasionally extracted the tumor size of the entire mass instead of the largest focus. In comparing MCCs for final tumor staging (Task 2), we discovered that (a) GPT-4 and the physician performed nearly as well irrespective of approach, while (b) GPT-3.5 had poorer performance overall but slightly improved results with the "Function Approach" (Table 1). In all, the cost and time per case for GPT-3.5 was $0.005 and 4 seconds, and for GPT-4 was $0.1 and 16 seconds. LMs can stage tumors similarly to a physician with minimal cost and time spent. In addition, we developed potential ways of improving LM performance, such as providing additional instructions for common fringe cases or utilizing deterministic Python functions to circumvent LM reasoning. Overall, this study lays the groundwork for a more extensive evaluation of other LMs on real-world clinical tasks involving larger datasets, generating clinical documents, and compiling oncologic histories. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
03603016
Volume :
120
Issue :
2
Database :
Academic Search Index
Journal :
International Journal of Radiation Oncology, Biology, Physics
Publication Type :
Academic Journal
Accession number :
179876347
Full Text :
https://doi.org/10.1016/j.ijrobp.2024.07.1428