1. Identifying cardiovascular disease risk in the U.S. population using environmental volatile organic compounds exposure: A machine learning predictive model based on the SHAP methodology
- Author
-
Qingan Fu, Yanze Wu, Min Zhu, Yunlei Xia, Qingyun Yu, Zhekang Liu, Xiaowei Ma, and Renqiang Yang
- Subjects
Cardiovascular disease ,Volatile organic compounds ,Machine learning ,SHAP ,Environmental exposure ,Environmental pollution ,TD172-193.5 ,Environmental sciences ,GE1-350 - Abstract
Background: Cardiovascular disease (CVD) remains a leading cause of mortality globally. Environmental pollutants, specifically volatile organic compounds (VOCs), have been identified as significant risk factors. This study aims to develop a machine learning (ML) model to predict CVD risk based on VOC exposure and demographic data using SHapley Additive exPlanations (SHAP) for interpretability. Methods: We utilized data from the National Health and Nutrition Examination Survey (NHANES) from 2011 to 2018, comprising 5098 participants. VOC exposure was assessed through 15 urinary metabolite metrics. The dataset was split into a training set (70 %) and a test set (30 %). Six ML models were developed, including Random Forest (RF), Light Gradient Boosting Machine (LightGBM), Decision Tree (DT), Extreme Gradient Boosting (XGBoost), Multi-Layer Perceptron (MLP), and Support Vector Machines (SVM). Model performance was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC), accuracy, balanced accuracy, F1 score, J-index, kappa, Matthew's correlation coefficient (MCC), positive predictive value (PPV), negative predictive value (NPV), sensitivity (sens), specificity (spec) and SHAP was applied to interpret the best-performing model. Results: The RF model exhibited the highest predictive performance with an ROC of 0.8143. SHAP analysis identified age and ATCA as the most significant predictors, with ATCA showing a protective effect against CVD, particularly in older adults and those with hypertension. The study found a significant interaction between ATCA levels and age, indicating that the protective effect of ATCA is more pronounced in older individuals due to increased oxidative stress and inflammatory responses associated with aging. E-values analysis suggested robustness to unmeasured confounders. Conclusions: This study is the first to utilize VOC exposure data to construct an ML model for predicting CVD risk. The findings highlight the potential of combining environmental exposure data with demographic information to enhance CVD risk prediction, supporting the development of personalized prevention and intervention strategies.
- Published
- 2024
- Full Text
- View/download PDF