1. Identification of patient demographic, clinical, and SARS-CoV-2 genomic factors associated with severe COVID-19 using supervised machine learning: a retrospective multicenter study.
- Author
-
Nirmalarajah, Kuganya, Aftanas, Patryk, Barati, Shiva, Chien, Emily, Crowl, Gloria, Faheem, Amna, Farooqi, Lubna, Jamal, Alainna J., Khan, Saman, Kotwa, Jonathon D., Li, Angel X., Mozafarihashjin, Mohammad, Nasir, Jalees A., Shigayeva, Altynay, Yim, Winfield, Yip, Lily, Zhong, Xi Zoe, Katz, Kevin, Kozak, Robert, and McArthur, Andrew G.
- Subjects
SUPERVISED learning ,COVID-19 ,AMINO acid sequence ,PATIENTS' attitudes ,CORONAVIRUSES - Abstract
Background: Drivers of COVID-19 severity are multifactorial and include multidimensional and potentially interacting factors encompassing viral determinants and host-related factors (i.e., demographics, pre-existing conditions and/or genetics), thus complicating the prediction of clinical outcomes for different severe acute respiratory syndrome coronavirus (SARS-CoV-2) variants. Although millions of SARS-CoV-2 genomes have been publicly shared in global databases, linkages with detailed clinical data are scarce. Therefore, we aimed to establish a COVID-19 patient dataset with linked clinical and viral genomic data to then examine associations between SARS-CoV-2 genomic signatures and clinical disease phenotypes. Methods: A cohort of adult patients with laboratory confirmed SARS-CoV-2 from 11 participating healthcare institutions in the Greater Toronto Area (GTA) were recruited from March 2020 to April 2022. Supervised machine learning (ML) models were developed to predict hospitalization using SARS-CoV-2 lineage-specific genomic signatures, patient demographics, symptoms, and pre-existing comorbidities. The relative importance of these features was then evaluated. Results: Complete clinical data and viral whole genome level information were obtained from 617 patients, 50.4% of whom were hospitalized. Notably, inpatients were older with a mean age of 66.67 years (SD ± 17.64 years), whereas outpatients had a mean age of 44.89 years (SD ± 16.00 years). SHapley Additive exPlanations (SHAP) analyses revealed that underlying vascular disease, underlying pulmonary disease, and fever were the most significant clinical features associated with hospitalization. In models built on the amino acid sequences of functional regions including spike, nucleocapsid, ORF3a, and ORF8 proteins, variants preceding the emergence of variants of concern (VOCs) or pre-VOC variants, were associated with hospitalization. Conclusions: Viral genomic features have limited utility in predicting hospitalization across SARS-CoV-2 diversity. Combining clinical and viral genomic datasets provides perspective on patient specific and virus-related factors that impact COVID-19 disease severity. Overall, clinical features had greater discriminatory power than viral genomic features in predicting hospitalization. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF