735 results on '"shapley additive explanations"'
Search Results
2. Predicting interfacial tension in brine-hydrogen/cushion gas systems under subsurface conditions: Implications for hydrogen geo-storage.
- Author
-
Hosseini, Mostafa and Leonenko, Yuri
- Abstract
Underground hydrogen storage (UHS) critically relies on cushion gas to maintain pressure balance during injection and withdrawal cycles, prevent excessive water inflow, and expand storage capacity. Interfacial tension (IFT) between brine and hydrogen/cushion gas mixtures is a key factor affecting fluid dynamics in porous media. This study develops four machine learning models— Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), and Multi-Layer Perceptrons (MLP)—to predict IFT under geo-storage conditions. These models incorporate variables such as pressure, temperature, molality, overall gas density, and gas composition to evaluate the impact of different cushion gases. A group-based data splitting method enhances the realism of our tests by preventing information leakage between training and testing datasets. Shapley Additive Explanations (SHAP) reveal that while the MLP model prioritizes gas composition, the RF model focuses more on operational parameters like pressure and temperature, showing distinct predictive dynamics. The MLP model excels, achieving coefficients of determination (R2) of 0.96, root mean square error (RMSE) of 2.10 mN/m, and average absolute relative deviation (AARD) of 3.25%. This robustness positions the MLP model as a reliable tool for predicting IFT values between brine and hydrogen/cushion gas (es) mixtures beyond the confines of the studied dataset. The findings of this study present a promising approach to optimizing hydrogen geo-storage through accurate predictions of IFTs, offering significant implications for the advancement of energy storage technologies. • ML techniques were used to estimate IFT in H 2 /cushion gas (es)-brine systems. • A novel data splitting method focused on gas composition over sample quantity. • Inputs were pressure, temperature, molality, overall gas density, and gas composition. • The MLP model outperformed other models. • Shapley Additive Explanations (SHAP) approach was used to interpret the results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Explainable machine learning model for predicting the risk of significant liver fibrosis in patients with diabetic retinopathy.
- Author
-
Zhu, Gangfeng, Yang, Na, Yi, Qiang, Xu, Rui, Zheng, Liangjian, Zhu, Yunlong, Li, Junyan, Che, Jie, Chen, Cixiang, Lu, Zenghong, Huang, Li, Xiang, Yi, and Zheng, Tianlei
- Abstract
Background: Diabetic retinopathy (DR), a prevalent complication in patients with type 2 diabetes, has attracted increasing attention. Recent studies have explored a plausible association between retinopathy and significant liver fibrosis. The aim of this investigation was to develop a sophisticated machine learning (ML) model, leveraging comprehensive clinical datasets, to forecast the likelihood of significant liver fibrosis in patients with retinopathy and to interpret the ML model by applying the SHapley Additive exPlanations (SHAP) method. Methods: This inquiry was based on data from the National Health and Nutrition Examination Survey 2005–2008 cohort. Utilizing the Fibrosis-4 index (FIB-4), liver fibrosis was stratified across a spectrum of grades (F0-F4). The severity of retinopathy was determined using retinal imaging and segmented into four discrete gradations. A ten-fold cross-validation approach was used to gauge the propensity towards liver fibrosis. Eight ML methodologies were used: Extreme Gradient Boosting, Random Forest, multilayer perceptron, Support Vector Machines, Logistic Regression (LR), Plain Bayes, Decision Tree, and k-nearest neighbors. The efficacy of these models was gauged using metrics, such as the area under the curve (AUC). The SHAP method was deployed to unravel the intricacies of feature importance and explicate the inner workings of the ML model. Results: The analysis included 5,364 participants, of whom 2,116 (39.45%) exhibited notable liver fibrosis. Following random allocation, 3,754 individuals were assigned to the training set and 1,610 were allocated to the validation cohort. Nine variables were curated for integration into the ML model. Among the eight ML models scrutinized, the LR model attained zenith in both AUC (0.867, 95% CI: 0.855–0.878) and F1 score (0.749, 95% CI: 0.732–0.767). In internal validation, this model sustained its superiority, with an AUC of 0.850 and an F1 score of 0.736, surpassing all other ML models. The SHAP methodology unveils the foremost factors through importance ranking. Conclusion: Sophisticated ML models were crafted using clinical data to discern the propensity for significant liver fibrosis in patients with retinopathy and to intervene early. Practice implications: Improved early detection of liver fibrosis risk in retinopathy patients enhances clinical intervention outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Multi-factors effects analysis of nonlinear vibration of FG-GNPRC membrane using machine learning.
- Author
-
Ni, Zhi, Yang, Jinlong, Fan, Yucheng, Hang, Ziyan, Zeng, Bowen, and Feng, Chuang
- Subjects
- *
ARTIFICIAL neural networks , *MACHINE learning , *COMPOSITE membranes (Chemistry) , *NONLINEAR analysis , *COUPLINGS (Gearing) - Abstract
Functionally graded graphene nanoplatelet reinforced composite (FG-GNPRC) have exhibited significant potential for the development of high-performance and multifunctional structures. In this paper, we present a machine learning (ML) assisted uncertainty analysis of nonlinear vibration of FG-GNPRC membranes under the influence of multi-factor coupling. Effective medium theory (EMT), Mori-Tanaka (MT) model and rule of mixture are utilized to evaluate the effective material properties of the composite membrane. Governing equations are derived via an energy method with the frameworks of the hyperelastic membrane theory, Neo-Hookean constitutive model and the couple dielectric theory. Randomly generated inputs after data pre-processing are fed into governing equations, which are solved by numerical methods for outputs. Three ML models, including artificial neural network (ANN), support vector regression (SVR) and AutoGluon-Tabular (AGT), are adopted to capture the complex relationship between the systematic inputs (i.e., structural dimensions, attributes of GNPs and pores, external electric field, etc.), frequency ratio and dimensionless amplitude of FG-GNPRC membranes. The results demonstrate that all three ML models demonstrate exceptional computational efficiency, and AGT presents higher prediction accuracy compared to the other two models. Based on the Shapley additive explanations (SHAP) approach, the effects of uncertainties of system parameters and multi-factor coupling on the nonlinear vibration of the FG-GNPRC membrane are analyzed. It is found that the uncertainty of structural parameters has the greatest impact on the nonlinear vibration of FG-GNPRC membranes, particularly when the membrane is subjected to a voltage of 10 V and smaller stretching ratio. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Prediction model for the compressive strength of rock based on stacking ensemble learning and shapley additive explanations.
- Author
-
Wu, Luyuan, Li, Jianhui, Zhang, Jianwei, Wang, Zifa, Tong, Jingbo, Ding, Fei, Li, Meng, Feng, Yi, and Li, Hui
- Abstract
Accurately predicting the compressive strength of rock (RCS) is crucial for the construction and maintenance of rock engineering. However, RCS prediction based on single machine learning (ML) algorithms often face issues such as parameter sensitivity and inadequate generalization. To address these challenges, a new (RCS) prediction model based on a stacking ensemble learning method was proposed. This method combines multiple ML algorithms to achieve more accurate and stable prediction results. Firstly, 442 sets of rock mechanics experimental data were collected to form the prediction dataset, and data preprocessing techniques, including missing value imputation and normalization, were applied for data cleaning and standardization. Secondly, nine classic ML algorithms were used to establish RCS prediction models, and the optimal configurations were determined using k-fold cross-validation and Bayesian optimization. The selected base learners were LightGBM, Random Forest, and XGBoost, and the meta-learners were Ridge, Lasso, and Linear Regression. Finally, the models were verified using the testset, and the comparison showed that the proposed stacking models were better than all single models. Notably, the Stacking-LR model exhibited the best predictive accuracy(R2=0.946, MAE=5.59, MAPE=9.94%). Furthermore, the Shapley Additive exPlanations (SHAP) method was introduced to analyze the impact and dependencies of input features on the prediction results. It was found that both Young’s modulus and confining pressure are the most critical parameters influencing RCS and exert a positive impact on the prediction results. This finding is consistent with domain expert knowledge, enhances the model’s interpretability, and provides robust support for the predicted results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. A model for predicting academic performance on standardised tests for lagging regions based on machine learning and Shapley additive explanations.
- Author
-
Suaza-Medina, Mario, Peñabaena-Niebles, Rita, and Jubiz-Diaz, Maria
- Subjects
- *
DATA mining , *STANDARDIZED tests , *EDUCATIONAL quality , *ACADEMIC achievement , *SOCIOECONOMIC factors - Abstract
Data are becoming more important in education since they allow for the analysis and prediction of future behaviour to improve academic performance and quality at educational institutions. However, academic performance is affected by regions' conditions, such as demographic, psychographic, socioeconomic and behavioural variables, especially in lagging regions. This paper presents a methodology based on applying nine classification algorithms and Shapley values to identify the variables that influence the performance of the Colombian standardised test: the Saber 11 exam. This study is innovative because, unlike others, it applies to lagging regions and combines the use of EDM and Shapley values to predict students' academic performance and analyse the influence of each variable on academic performance. The results show that the algorithms with the best accuracy are Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, and Gradient Boosting Machine. According to the Shapley values, the most influential variables are the socioeconomic level index, gender, region, location of the educational institution, and age. For Colombia, the results showed that male students from urban educational institutions over 18 years have the best academic performance. Moreover, there are differences in educational quality among the lagging regions. Students from Nariño have advantages over ones from other departments. The proposed methodology allows for generating public policies better aligned with the reality of lagging regions and achieving equity in access to education. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Dynamic changes in hs-CRP and risk of all-cause mortality among middle-aged and elderly adults: findings from a nationwide prospective cohort and mendelian randomization.
- Author
-
Wang, Zhonghai, Xiong, Feng, Zhang, Quanbo, and Wang, Han
- Abstract
Introduction: The general population experiences mortality rates that are related to high levels of high-sensitivity C-reactive protein (hs-CRP). We aim to assess the linkage of longitudinal trajectories in hs-CRP levels with all-cause mortality in Chinese participants. Methods: We utilized data from the China Health and Retirement Longitudinal Study (CHARLS). The exposures were dynamic changes in the hs-CRP and cumulative hs-CRP from 2012 to 2015, and the outcome was all-cause mortality. All participants were categorized into four trajectories according to hs-CRP levels. Multivariable logistic regression analysis, adjusted for potential confounders, was employed to evaluate the relationship of different trajectories of hs-CRP with mortality risk. A two-sample Mendelian randomization (TSMR) method and SHapley Additive exPlanations (SHAP) for identifying determinants of mortality risk were also employed. Results: The study included 5,445 participants with 233 deaths observed, yielding a mortality proportion of 4.28%. Compared to individuals maintaining low, stable levels of hs-CRP (Class 1), individuals with sustained elevated levels of hs-CRP (Class 4), those experiencing a progressive rise in hs-CRP levels (Class 2), or those transitioning from elevated to reduced hs-CRP levels (Class 3) all faced a significantly heighted death risk, with adjusted Odds Ratios (ORs) ranging from 2.34 to 2.47 across models. Moreover, a non-linear relationship was found between them. Further TSMR analysis also supported these findings. SHAP showed that hs-CRP was the fifth most important determinant of mortality risk. Conclusions: Our study shows all-cause mortality increases with dynamic changes in hs-CRP levels among middle-aged and elderly adults in China, and cumulative hs-CRP shows an L-shaped relationship with all-cause mortality. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Exploring the correlation between DNA methylation and biological age using an interpretable machine learning framework.
- Author
-
Zhou, Sheng, Chen, Jing, Wei, Shanshan, Zhou, Chengxing, Wang, Die, Yan, Xiaofan, He, Xun, and Yan, Pengcheng
- Subjects
- *
DNA methylation , *FEATURE selection , *GENETIC transcription , *PREDICTION models , *METHYLATION - Abstract
DNA methylation plays a significant role in regulating transcription and exhibits a systematic change with age. These changes can be used to predict an individual's age. First, to identify methylation sites associated with biological age; second, to construct a biological age prediction model and preliminarily explore the biological significance of methylation-associated genes using machine learning. A biological age prediction model was constructed using human methylation data through data preprocessing, feature selection procedures, statistical analysis, and machine learning techniques. Subsequently, 15 methylation data sets were subjected to in-depth analysis using SHAP, GO enrichment, and KEGG analysis. XGBoost, LightGBM, and CatBoost identified 15 groups of methylation sites associated with biological age. The cg23995914 locus was identified as the most significant contributor to predicting biological age by calculating SHAP values. Furthermore, GO enrichment and KEGG analyses were employed to initially explore the methylated loci's biological significance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Machine learning-based analysis of nutrient and water uptake in hydroponically grown soybeans.
- Author
-
Dhal, Sambandh Bhusan, Mahanta, Shikhadri, Moore, Janie McClurkin, and Kalafatis, Stavros
- Subjects
- *
SUSTAINABLE agriculture , *SUSTAINABILITY , *AGRICULTURE , *FEATURE selection , *RANDOM forest algorithms - Abstract
Recent advancements in sustainable agriculture have spurred interest in hydroponics as an alternative to conventional farming methods. However, the lack of data-driven approaches in hydroponic growth presents a significant challenge. This study addresses this gap by varying nitrogen, magnesium, and potassium concentrations in hydroponically grown soybeans and conducting essential nutrient profiling across the growth cycle. Statistical techniques like Linear Interpolation are employed to interpolate nutrient data and a feature selection pipeline consisting of chi-squared testing methods, Linear Regression with Recursive Feature Elimination (RFE) and ExtraTreesClassifier have been used to select important nutrients for predicting water uptake using non-parametric regression methods. For different nutrient growth media, i.e. for soybeans grown in Hoagland + Nitrogen and Hoagland + Magnesium media, the Random Forest regressor outperformed other methods in predicting water uptake, achieving testing Mean Squared Error (MSE) scores of 24.55 ( R 2 score 0.63) and 8.23 ( R 2 score 0.81), respectively. Similarly, for soybeans grown in Hoagland + Potassium media, Support Vector Regression demonstrated superior performance with a testing MSE of 4.37 and R 2 score of 0.85. SHapley Additive exPlanations (SHAP) values are examined in each case to elucidate the contributions of individual nutrients to water uptake predictions. This research aims to provide data-driven insights to optimize hydroponic practices for sustainable food production. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Interpretable machine‐learning models for predicting creep recovery of concrete.
- Author
-
Mei, Shengqi, Liu, Xiaodong, Wang, Xingju, and Li, Xufeng
- Subjects
- *
MACHINE learning , *CREEP (Materials) , *FEATURE selection , *DATA recovery , *RANDOM forest algorithms - Abstract
Creep recovery of concrete is essential for accurately assessing the performance of concrete structures over service time. Existing creep recovery models exhibit low accuracy, and the influencing factors of creep recovery remain inadequately elucidated. In this paper, interpretable machine learning (ML) techniques were employed to develop a prediction model for concrete creep recovery. Several ML techniques were selected including random forest (RF), support vector regression (SVR), extreme gradient boosting (XGBoost) and light gradient boosting machine (LGBM). In order to maximize the sample size of the dataset, 109 sets of creep recovery data were collected from existing literatures for model training. Feature selection is utilized to determine the input parameters for ML models, and 12 input variables were selected. The model is fine‐tuned using Bayesian optimization techniques. To ensure the reliability of ML models, 10‐fold cross‐validation and random data splitting were implemented. The results indicate that the ML models exhibited higher accuracy compared to the existing creep recovery model. Among these ML models, LGBM demonstrated superior accuracy, efficiency and stability (with R2 = 0.993, 0.978, and 0.973 for the training, testing, and validation sets, respectively). Shapley additive explanations (SHAP) were employed to interpret the significance of each input parameter on ML model prediction. Duration after unloading, stress magnitude, and ambient relative humidity were the main feature variables influencing concrete creep recovery. Upon comparing the influencing factors, it was discerned that there exists a distinct difference between creep and creep recovery of concrete. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Analyzing risk factors and constructing a predictive model for superficial esophageal carcinoma with submucosal infiltration exceeding 200 micrometers.
- Author
-
Cui, Yutong, Luo, Zichen, Wang, Xiaobo, Liang, Shiqi, Hu, Guangbing, Chen, Xinrui, Zuo, Ji, Zhou, Lu, Guo, Haiyang, and Wang, Xianfei
- Subjects
- *
MACHINE learning , *PRECANCEROUS conditions , *PLATELET lymphocyte ratio , *LOGISTIC regression analysis , *PICKLED foods , *ENDOSCOPIC ultrasonography , *ESOPHAGEAL cancer - Abstract
Objective: Submucosal infiltration of less than 200 μm is considered an indication for endoscopic surgery in cases of superficial esophageal cancer and precancerous lesions. This study aims to identify the risk factors associated with submucosal infiltration exceeding 200 micrometers in early esophageal cancer and precancerous lesions, as well as to establish and validate an accompanying predictive model. Methods: Risk factors were identified through least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression. Various machine learning (ML) classification models were tested to develop and evaluate the most effective predictive model, with Shapley Additive Explanations (SHAP) employed for model visualization. Results: Predictive factors for early esophageal invasion into the submucosa included endoscopic ultrasonography or magnifying endoscopy> SM1(P<0.001,OR = 3.972,95%CI 2.161–7.478), esophageal wall thickening(P<0.001,OR = 12.924,95%CI,5.299–33.96), intake of pickled foods(P=0.04,OR = 1.837,95%CI,1.03–3.307), platelet-lymphocyte ratio(P<0.001,OR = 0.284,95%CI,0.137–0.556), tumor size(P<0.027,OR = 2.369,95%CI,1.128–5.267), the percentage of circumferential mucosal defect(P<0.001,OR = 5.286,95%CI,2.671–10.723), and preoperative pathological type(P<0.001,OR = 4.079,95%CI,2.254–7.476). The logistic regression model constructed from the identified risk factors was found to be the optimal model, demonstrating high efficacy with an area under the curve (AUC) of 0.922 in the training set, 0.899 in the validation set, and 0.850 in the test set. Conclusion: A logistic regression model complemented by SHAP visualizations effectively identifies early esophageal cancer reaching 200 micrometers into the submucosa. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Modeling motorcycle crash-injury severity utilizing explainable data-driven approaches.
- Author
-
Se, Chamroeun, Sunkpho, Jirapon, Wipulanusat, Warit, Tantisevi, Kevin, Champahom, Thanapong, and Ratanavaraha, Vatanavongs
- Subjects
- *
ARTIFICIAL neural networks , *RECURRENT neural networks , *TRAFFIC signs & signals , *ROAD users , *SUPPORT vector machines , *MOTORCYCLING accidents - Abstract
Motorcycle crashes remain a significant public safety concern, requiring diverse analytical approaches to inform countermeasures. This study uses machine learning to analyze injury severity in crashes in Thailand from 2018 to 2020. Traditional and advanced models, including including random forest (RF), support vector machine (SVM), deep neural network (DNN), recurrent neural network (RNN), long short-term memory (LSTM), and eXtreme gradient boosting (XGBoost), were compared. Hyperparameter tuning via GridSearchCV optimized performance. XGBoost, with a tradeoff score of 105.65%, outperformed other models in predicting severe and fatal injuries. SHapley Additive exPlanations (SHAPs) identified significant risk factors including speeding, drunk driving, two-lane roads, unlit conditions, head-on and truck collisions, and nighttime crashes. Conversely, factors such as barrier medians, flashing traffic signals, sideswipes, rear-end crashes, and wet roads were associated with reduced severity. These findings suggest opportunities for integrated infrastructure improvements and expanded rider training and education programs to address behavioral risks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Impact of Enterprise Supply Chain Digitalization on Cost of Debt: A Four-Flows Perspective Analysis Using Explainable Machine Learning Methodology.
- Author
-
Tang, Hongqin, Zhu, Jianping, Li, Nan, and Wu, Weipeng
- Abstract
Rising costs, complex supply chain management, and stringent regulations have created significant financial burdens on business sustainability, calling for new and rapid strategies to help enterprises transform. Supply chain digitalization (SCD) has emerged as a promising approach in the context of digitalization and globalization, with the potential to reduce an enterprise's debt costs. Developing a strategic framework for SCD that effectively reduces the cost of debt (CoD) has become a key academic challenge, critical for ensuring business sustainability. To this end, under the perspective of four flows, SCD is deconstructed into four distinct features: logistics flow digitalization (LFD), product flow digitalization (PFD), information flow digitalization (IFD), and capital flow digitalization (CFD). To precisely measure the four SCD features and the dependent variable, COD, publicly available data from Chinese listed manufacturing enterprises such as annual report texts and financial statement data are collected, and various data mining technologies are also used to conduct data measurement and data processing. To comprehensively investigate the impact pattern of SCD on CoD, we employed the explainable machine learning methodology for data analysis. This methodology involved in-depth data discussions, cross-validation utilizing a series of machine learning models, and the utilization of Shapley additive explanations (SHAP) to explain the results generated by the models. To conduct sensitivity analysis, permutation feature importance (PFI) and partial dependence plots (PDPs) were also incorporated as supplementary explanatory methods, providing additional insights into the model's explainability. Through the aforementioned research processes, the following findings are obtained: SCD can play a role in reducing CoD, but the effects of different SCD features are not exactly the same. Among the four SCD features, LFD, PFD, and IFD have the potential to significantly reduce CoD, with PFD having the most substantial impact, followed by LFD and IFD. In contrast, CFD has a relatively weak impact, and its role is challenging to discern. These findings provide significant guidance for enterprises in furthering their digitalization and supply chain development, helping them optimize SCD strategies more accurately to reduce CoD. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Using explainable AI for enhanced understanding of winter road safety: insights with support vector machines and SHAP.
- Author
-
Shuai, Zehua, Kwon, Tae J., and Xie, Qian
- Subjects
- *
MACHINE learning , *SUPPORT vector machines , *TRAFFIC safety , *ROAD maintenance , *ARTIFICIAL intelligence - Abstract
This study investigates the utility of machine learning (ML) in understanding and mitigating winter road risks. Despite their capability in managing complex data structures, ML models often lack interpretability. We address this issue by integrating Shapley Additive exPlanations (SHAP) with a support vector machine (SVM) model. Utilizing a comprehensive dataset of 231 snowstorm events collected in the city of Edmonton across two winter seasons, the SVM model achieved an accuracy rate of 87.2%. Following model development, a SHAP summary plot was employed to identify the contribution of individual features to collision predictions—an insight not achievable through ML alone. Next, SHAP waterfall plots were used to assess the reliability of individual predictions. The findings enhanced our understanding of the complex SVM model and provided greater insights into the diverse factors affecting winter road safety. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Interpretable multiphasic CT-based radiomic analysis for preoperatively differentiating benign and malignant solid renal tumors: a multicenter study.
- Author
-
Wu, Yaohai, Cao, Fei, Lei, Hanqi, Zhang, Shiqiang, Mei, Hongbing, Ni, Liangchao, and Pang, Jun
- Subjects
- *
MACHINE learning , *KIDNEY tumors , *FEATURE extraction , *RANDOM forest algorithms , *COMPUTED tomography , *NOMOGRAPHY (Mathematics) - Abstract
Background: To develop and compare machine learning models based on triphasic contrast-enhanced CT (CECT) for distinguishing between benign and malignant renal tumors. Materials and Methods: In total, 427 patients were enrolled from two medical centers: Center 1 (serving as the training set) and Center 2 (serving as the external validation set). First, 1781 radiomic features were individually extracted from corticomedullary phase (CP), nephrographic phase (NP), and excretory phase (EP) CECT images, after which 10 features were selected by the minimum redundancy maximum relevance method. Second, random forest (RF) models were constructed from single-phase features (CP, NP, and EP) as well as from the combination of features from all three phases (TP). Third, the RF models were assessed in the training and external validation sets. Finally, the internal prediction mechanisms of the models were explained by the SHapley Additive exPlanations (SHAP) approach. Results: A total of 266 patients with renal tumors from Center 1 and 161 patients from Center 2 were included. In the training set, the AUCs of the RF models constructed from the CP, NP, EP, and TP features were 0.886, 0.912, 0.930, and 0.944, respectively. In the external validation set, the models achieved AUCs of 0.860, 0.821, 0.921, and 0.908, respectively. The "original_shape_Flatness" feature played the most important role in the prediction outcome for the RF model based on EP features according to the SHAP method. Conclusions: The four RF models efficiently differentiated benign from malignant solid renal tumors, with the EP feature-based RF model displaying the best performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. A Two-Level Machine Learning Prediction Approach for RAC Compressive Strength.
- Author
-
Qi, Fei and Li, Hangyu
- Subjects
RECYCLED concrete aggregates ,MINERAL aggregates ,STANDARD deviations ,COMPRESSIVE strength ,REINFORCED concrete - Abstract
Through the use of recycled aggregates, the construction industry can mitigate its environmental impact. A key consideration for concrete structural engineers when designing and constructing concrete structures is compressive strength. This study aims to accurately forecast the compressive strength of recycled aggregate concrete (RAC) using machine learning techniques. We propose a simplified approach that incorporates a two-layer stacked ensemble learning model to predict RAC compressive strength. In this framework, the first layer consists of ensemble models acting as base learners, while the second layer utilizes a random forest (RF) model as the meta-learner. A comparative analysis with four other ensemble learning models demonstrates the superior performance of the proposed stacked model in effectively integrating predictions from the base learners, resulting in enhanced model accuracy. The model achieves a low mean absolute error (MAE) of 2.599 MPa, a root mean squared error (RMSE) of 3.645 MPa, and a high R-squared (R
2 ) value of 0.964. Additionally, a Shapley (SHAP) additive explanation analysis reveals the influence and interrelationships of various input factors on the compressive strength of RAC, aiding design and construction professionals in optimizing raw material content during the RAC design and production process. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
17. A model for predicting academic performance on standardised tests for lagging regions based on machine learning and Shapley additive explanations
- Author
-
Mario Suaza-Medina, Rita Peñabaena-Niebles, and Maria Jubiz-Diaz
- Subjects
CRISP-DM ,Educational data mining ,Lagging region ,Machine learning ,Shapley additive explanations ,Standardised test ,Medicine ,Science - Abstract
Abstract Data are becoming more important in education since they allow for the analysis and prediction of future behaviour to improve academic performance and quality at educational institutions. However, academic performance is affected by regions’ conditions, such as demographic, psychographic, socioeconomic and behavioural variables, especially in lagging regions. This paper presents a methodology based on applying nine classification algorithms and Shapley values to identify the variables that influence the performance of the Colombian standardised test: the Saber 11 exam. This study is innovative because, unlike others, it applies to lagging regions and combines the use of EDM and Shapley values to predict students’ academic performance and analyse the influence of each variable on academic performance. The results show that the algorithms with the best accuracy are Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, and Gradient Boosting Machine. According to the Shapley values, the most influential variables are the socioeconomic level index, gender, region, location of the educational institution, and age. For Colombia, the results showed that male students from urban educational institutions over 18 years have the best academic performance. Moreover, there are differences in educational quality among the lagging regions. Students from Nariño have advantages over ones from other departments. The proposed methodology allows for generating public policies better aligned with the reality of lagging regions and achieving equity in access to education.
- Published
- 2024
- Full Text
- View/download PDF
18. Exploring the correlation between DNA methylation and biological age using an interpretable machine learning framework
- Author
-
Sheng Zhou, Jing Chen, Shanshan Wei, Chengxing Zhou, Die Wang, Xiaofan Yan, Xun He, and Pengcheng Yan
- Subjects
DNA methylation ,Biological age ,GO enrichment analysis ,XGBoost ,Interpretable machine learning ,Shapley Additive exPlanations ,Medicine ,Science - Abstract
Abstract DNA methylation plays a significant role in regulating transcription and exhibits a systematic change with age. These changes can be used to predict an individual’s age. First, to identify methylation sites associated with biological age; second, to construct a biological age prediction model and preliminarily explore the biological significance of methylation-associated genes using machine learning. A biological age prediction model was constructed using human methylation data through data preprocessing, feature selection procedures, statistical analysis, and machine learning techniques. Subsequently, 15 methylation data sets were subjected to in-depth analysis using SHAP, GO enrichment, and KEGG analysis. XGBoost, LightGBM, and CatBoost identified 15 groups of methylation sites associated with biological age. The cg23995914 locus was identified as the most significant contributor to predicting biological age by calculating SHAP values. Furthermore, GO enrichment and KEGG analyses were employed to initially explore the methylated loci’s biological significance.
- Published
- 2024
- Full Text
- View/download PDF
19. Machine learning-based analysis of nutrient and water uptake in hydroponically grown soybeans
- Author
-
Sambandh Bhusan Dhal, Shikhadri Mahanta, Janie McClurkin Moore, and Stavros Kalafatis
- Subjects
Sustainable agriculture ,Hydroponics ,Non-parametric regression ,Machine learning ,Shapley additive explanations ,Medicine ,Science - Abstract
Abstract Recent advancements in sustainable agriculture have spurred interest in hydroponics as an alternative to conventional farming methods. However, the lack of data-driven approaches in hydroponic growth presents a significant challenge. This study addresses this gap by varying nitrogen, magnesium, and potassium concentrations in hydroponically grown soybeans and conducting essential nutrient profiling across the growth cycle. Statistical techniques like Linear Interpolation are employed to interpolate nutrient data and a feature selection pipeline consisting of chi-squared testing methods, Linear Regression with Recursive Feature Elimination (RFE) and ExtraTreesClassifier have been used to select important nutrients for predicting water uptake using non-parametric regression methods. For different nutrient growth media, i.e. for soybeans grown in Hoagland + Nitrogen and Hoagland + Magnesium media, the Random Forest regressor outperformed other methods in predicting water uptake, achieving testing Mean Squared Error (MSE) scores of 24.55 ( $${\text{R}}^{2}$$ R 2 score 0.63) and 8.23 ( $${\text{R}}^{2}$$ R 2 score 0.81), respectively. Similarly, for soybeans grown in Hoagland + Potassium media, Support Vector Regression demonstrated superior performance with a testing MSE of 4.37 and $${\text{R}}^{2}$$ R 2 score of 0.85. SHapley Additive exPlanations (SHAP) values are examined in each case to elucidate the contributions of individual nutrients to water uptake predictions. This research aims to provide data-driven insights to optimize hydroponic practices for sustainable food production.
- Published
- 2024
- Full Text
- View/download PDF
20. Analyzing risk factors and constructing a predictive model for superficial esophageal carcinoma with submucosal infiltration exceeding 200 micrometers
- Author
-
Yutong Cui, Zichen Luo, Xiaobo Wang, Shiqi Liang, Guangbing Hu, Xinrui Chen, Ji Zuo, Lu Zhou, Haiyang Guo, and Xianfei Wang
- Subjects
Endoscopic submucosal dissection ,Machine learning ,Prediction model ,Shapley additive exPlanations ,Risk factors ,Diseases of the digestive system. Gastroenterology ,RC799-869 - Abstract
Abstract Objective Submucosal infiltration of less than 200 μm is considered an indication for endoscopic surgery in cases of superficial esophageal cancer and precancerous lesions. This study aims to identify the risk factors associated with submucosal infiltration exceeding 200 micrometers in early esophageal cancer and precancerous lesions, as well as to establish and validate an accompanying predictive model. Methods Risk factors were identified through least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression. Various machine learning (ML) classification models were tested to develop and evaluate the most effective predictive model, with Shapley Additive Explanations (SHAP) employed for model visualization. Results Predictive factors for early esophageal invasion into the submucosa included endoscopic ultrasonography or magnifying endoscopy> SM1(P
- Published
- 2024
- Full Text
- View/download PDF
21. Machine learning-based predictions and analyses of the creep rupture life of the Ni-based single crystal superalloy
- Author
-
Fan Zou, Pengjie Liu, Yanzhan Chen, and Yaohua Zhao
- Subjects
Creep property prediction ,XGBoost ,Shapley additive explanations ,Sparrow optimization algorithm ,Medicine ,Science - Abstract
Abstract The evaluation of creep rupture life is complex due to its variable formation mechanism. In this paper, machine learning algorithms are applied to explore the creep rupture life span as a function of 27 physical properties to address this issue. By training several classical machine learning models and comparing their prediction performance, XGBoost is finally selected as the predictive model for creep rupture life. Moreover, we introduce an interpretable method, Shapley additive explanations (SHAP), to explain the creep rupture life predicted by the XGBoost model. The SHAP values are then calculated, and the feature importance of the creep rupture life yielded by the XGBoost model is discussed. Finally, the creep fracture life is optimized by using the chaotic sparrow optimization algorithm. We then show that our proposed method can accurately predict and optimize creep properties in a cheaper and faster way than other approaches in the experiments. The proposed method can also be used to optimize the material design across various engineering domains.
- Published
- 2024
- Full Text
- View/download PDF
22. Prediction of Recidivism and Detection of Risk Factors Under Different Time Windows Using Machine Learning Techniques.
- Author
-
Mu, Di, Zhang, Simai, Zhu, Ting, Zhou, Yong, and Zhang, Wei
- Abstract
Following a comprehensive analysis of the initial three generations of prisoner risk assessment tools, the field has observed a notable prominence in the integration of fourth-generation tools and machine learning techniques. However, limited efforts have been made to address the explainability of data-driven prediction models and their connection with treatment recommendations. Our primary objective was to develop predictive models for assessing the likelihood of recidivism among prisoners released from their index incarceration within 1-year, 2-year, and 5-year timeframes. We aimed to enhance interpretability using SHapley Additive exPlanations (SHAP). We collected data from 20,457 in-prison records from February 10, 2005, to August 25, 2021, sourced from a Southwestern China prison's data management system. Recidivism records were officially determined through data mining from an official website and combined identification data from neighboring prisons. We employed five machine learning algorithms, considering sociodemographic, physical health, psychological assessments, criminological characteristics, crime history, social support, and in-prison behaviors as factors. For interpretability, SHAP was applied to reveal feature contributions. Findings indicated that young prisoners accused of larceny, previous convictions, lower fines, and limited family support faced higher reoffending risk. Conversely, middle-aged and senior prisoners with no prior convictions, lower monthly supermarket expenses, and positive psychological test results had lower reoffending risk. We also explored interactions between significant predictive features, such as prisoner age at incarceration initiation and primary accusation, and the duration of current incarceration and cumulative prior incarcerations. Notably, our models consistently exhibited high performance, as shown by AUC on the test dataset across time windows. Interpretability results provided insights into evolving risk factors over time, valuable for intervention with high-risk individuals. These insights, with additional validation, could offer dynamic prisoner information for stakeholders. Moreover, interpretability results can be seamlessly integrated into prison and court management systems as a valuable risk assessment tool. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Explainable machine learning models for estimating daily dissolved oxygen concentration of the Tualatin River.
- Author
-
Shuguang Li, Qasem, Sultan Noman, Band, Shahab S., Ameri, Rasoul, Hao-Ting Pai, and Mehdizadeh, Saeid
- Subjects
- *
MACHINE learning , *STANDARD deviations , *WATER quality monitoring , *WATER quality , *RANDOM forest algorithms - Abstract
Monitoring the quality of river water is of fundamental importance and needs to be taken into consideration when it comes to the research into the hydrological field. In this context, the concentration of the dissolved oxygen (DO) is one of the most significant indicators of the quality of river water. The current study aimed to estimate the minimum, maximum, and mean DO concentrations (DO min, DO max, DO mean) at a gauging station located on Tualatin River, United States. To that end, four machine learning models, such as support vector regression (SVR), multi-layer perceptron (MLP), random forest (RF), and gradient boosting (GB) were established. Root mean square error (RMSE), mean absolute error (MAE), coefficient of correlation (R), and Nash-Sutcliffe efficiency (NSE) metrics were employed to better assess the accuracies of these models. The modeling results demonstrated that the SVR and MLP surpassed the RF and GB models. Despite this, the SVR was concluded to be the best-performing method when used to estimate the DO min, DO max, and DO mean. The best error statistics in the testing phase were related to the SVR model with full (four) inputs to estimate DO mean concentration (RMSE = 0.663 mg/l, MAE = 0.508 mg/l, R = 0.945, NSE = 0.875). Finally, the explainability of the superior models (i.e. SVR models) was conducted using SHapley Additive exPlanations (SHAP) for the first time to estimate DO concentration. In fact, evaluating the explainability of machine learning models can provide useful information about the impact of each of the input estimators used in the procedure of models development. It was concluded that the specific conductance (SC) and followed by water temperature (WT) could provide the most contributions for estimating the DO min, DO max, and DO mean concentrations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow
- Author
-
José T. Moreira-Filho, Dhruv Ranganath, Mike Conway, Charles Schmitt, Nicole Kleinstreuer, and Kamel Mansouri
- Subjects
Chemical grouping ,KNIME workflow ,Machine learning ,Explainable artificial intelligence ,SHapley additive exPlanations ,Feature selection ,Information technology ,T58.5-58.64 ,Chemistry ,QD1-999 - Abstract
Abstract With the increased availability of chemical data in public databases, innovative techniques and algorithms have emerged for the analysis, exploration, visualization, and extraction of information from these data. One such technique is chemical grouping, where chemicals with common characteristics are categorized into distinct groups based on physicochemical properties, use, biological activity, or a combination. However, existing tools for chemical grouping often require specialized programming skills or the use of commercial software packages. To address these challenges, we developed a user-friendly chemical grouping workflow implemented in KNIME, a free, open-source, low/no-code, data analytics platform. The workflow serves as an all-encompassing tool, expertly incorporating a range of processes such as molecular descriptor calculation, feature selection, dimensionality reduction, hyperparameter search, and supervised and unsupervised machine learning methods, enabling effective chemical grouping and visualization of results. Furthermore, we implemented tools for interpretation, identifying key molecular descriptors for the chemical groups, and using natural language summaries to clarify the rationale behind these groupings. The workflow was designed to run seamlessly in both the KNIME local desktop version and KNIME Server WebPortal as a web application. It incorporates interactive interfaces and guides to assist users in a step-by-step manner. We demonstrate the utility of this workflow through a case study using an eye irritation and corrosion dataset. Scientific contributions This work presents a novel, comprehensive chemical grouping workflow in KNIME, enhancing accessibility by integrating a user-friendly graphical interface that eliminates the need for extensive programming skills. This workflow uniquely combines several features such as automated molecular descriptor calculation, feature selection, dimensionality reduction, and machine learning algorithms (both supervised and unsupervised), with hyperparameter optimization to refine chemical grouping accuracy. Moreover, we have introduced an innovative interpretative step and natural language summaries to elucidate the underlying reasons for chemical groupings, significantly advancing the usability of the tool and interpretability of the results.
- Published
- 2024
- Full Text
- View/download PDF
25. A Real-World Study on the Short-Term Efficacy of Amlodipine in Treating Hypertension Among Inpatients
- Author
-
Wang T, Tan J, Xiang S, Zhang Y, Jian C, Jian J, and Zhao W
- Subjects
anti-hypertensive drugs ,hypertension ,machine learning ,shapley additive explanations ,Medicine - Abstract
Tingting Wang,1 Juntao Tan,2 Tiantian Wang,2 Shoushu Xiang,2 Yang Zhang,1 Chang Jian,1 Jie Jian,1 Wenlong Zhao1,3 1College of Medical Informatics, Chongqing Medical University, Chongqing, 400016, People’s Republic of China; 2Operation Management Office, Affiliated Banan Hospital of Chongqing Medical University, Chongqing, 401320, People’s Republic of China; 3Medical Data Science Academy, Chongqing Medical University, Chongqing, People’s Republic of ChinaCorrespondence: Wenlong Zhao, College of Medical Informatics, Chongqing Medical University, 1 Yixueyuan Road, Yuzhong District, Chongqing, 400016, People’s Republic of China, Tel +86-13883163651, Email cqzhaowl@163.comPurpose: Hospitalized hypertensive patients rely on blood pressure medication, yet there is limited research on the sole use of amlodipine, despite its proven efficacy in protecting target organs and reducing mortality. This study aims to identify key indicators influencing the efficacy of amlodipine, thereby enhancing treatment outcomes.Patients and Methods: In this multicenter retrospective study, 870 hospitalized patients with primary hypertension exclusively received amlodipine for the first 5 days after admission, and their medical records contained comprehensive blood pressure records. They were categorized into success (n=479) and failure (n=391) groups based on average blood pressure control efficacy. Predictive models were constructed using six machine learning algorithms. Evaluation metrics encompassed the area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). SHapley Additive exPlanations (SHAP) analysis assessed feature contributions to efficacy.Results: All six machine learning models demonstrated superior predictive performance. Following variable reduction, the model predicting amlodipine efficacy was reconstructed using these algorithms, with the light gradient boosting machine (LightGBM) model achieving the highest overall performance (AUC = 0.803). Notably, amlodipine showed enhanced efficacy in patients with low platelet distribution width (PDW) values, as well as high hematocrit (HCT) and thrombin time (TT) values.Conclusion: This study utilized machine learning to predict amlodipine’s effectiveness in hypertension treatment, pinpointing key factors: HCT, PDW, and TT levels. Lower PDW, along with higher HCT and TT, correlated with enhanced treatment outcomes. This facilitates personalized treatment, particularly for hospitalized hypertensive patients undergoing amlodipine monotherapy.Keywords: anti-hypertensive drugs, hypertension, machine learning, shapley additive explanations
- Published
- 2024
26. Construction and validation of predictive models for intravenous immunoglobulin–resistant Kawasaki disease using an interpretable machine learning approach
- Author
-
Linfan Deng, Jian Zhao, Ting Wang, Bin Liu, Jun Jiang, Peng Jia, Dong Liu, and Gang Li
- Subjects
kawasaki disease ,machine learning ,intravenous immunoglobulin resistance ,shapley additive explanations ,Pediatrics ,RJ1-570 - Abstract
Background Intravenous immunoglobulin (IVIG)-resistant Kawasaki disease is associated with coronary artery lesion development. Purpose This study aimed to explore the factors associated with IVIG-resistance and construct and validate an interpretable machine learning (ML) prediction model in clinical practice. Methods Between December 2014 and November 2022, 602 patients were screened and risk factors for IVIG-resistance investigated. Five ML models are used to establish an optimal prediction model. The SHapley Additive exPlanations (SHAP) method was used to interpret the ML model. Results Na+, hemoglobin (Hb), C-reactive protein (CRP), and globulin were independent risk factors for IVIG-resistance. A nonlinear relationship was identified between globulin level and IVIG-resistance. The XGBoost model exhibited excellent performance, with an area under the receiver operating characteristic curve of 0.821, accuracy of 0.748, sensitivity of 0.889, and specificity of 0.683 in the testing set. The XGBoost model was interpreted globally and locally using the SHAP method. Conclusion Na+, Hb, CRP, and globulin levels were independently associated with IVIG-resistance. Our findings demonstrate that ML models can reliably predict IVIG-resistance. Moreover, use of the SHAP method to interpret the established XGBoost model's findings would provide evidence of IVIG-resistance and guide the individualized treatment of Kawasaki disease.
- Published
- 2024
- Full Text
- View/download PDF
27. Acute Psychological Stress Detection Using Explainable Artificial Intelligence for Automated Insulin Delivery
- Author
-
Mahmoud M. Abdel-Latif, Mudassir M. Rashid, Mohammad Reza Askari, Andrew Shahidehpour, Mohammad Ahmadasas, Minsun Park, Lisa Sharp, Lauretta Quinn, and Ali Cinar
- Subjects
acute psychological stress detection ,galvanic skin response ,type 1 diabetes ,extreme gradient boosting ,explainable machine learning ,Shapley additive explanations ,Applied mathematics. Quantitative methods ,T57-57.97 - Abstract
Acute psychological stress (APS) is a complex and multifactorial phenomenon that affects metabolism, necessitating real-time detection and interventions to mitigate its effects on glycemia in people with type 1 diabetes. This study investigates the detection of APS using physiological variables measured by the Empatica E4 wristband and employs explainable machine learning to evaluate the importance of the physiological signals. The extreme gradient boosting model is developed for classification of APS and non-stress (NS) with weighted training, achieving an overall accuracy of 99.93%. The Shapley additive explanations (SHAP) technique is employed to interpret the global importance of the physiological signals, determining the order of importance for the variables from most to least as galvanic skin response (GSR), heart rate (HR), skin temperature (ST), and motion sensors (accelerometer readings). The increase in GSR and HR are positively correlated with the occurrence of APS as indicated by high positive SHAP values. The SHAP technique is also used to explain the local signal importance for particular instances of misclassified samples. The detection of APS can inform multivariable automated insulin delivery systems to intervene to counteract the APS-induced glycemic excursions in people with type 1 diabetes.
- Published
- 2024
- Full Text
- View/download PDF
28. Interpretable machine learning for the prediction of death risk in patients with acute diquat poisoning
- Author
-
Huiyi Li, Zheng Liu, Wenming Sun, Tiegang Li, and Xuesong Dong
- Subjects
Diquat poisoning ,Risk of death ,Machine learning ,Shapley additive explanations ,Medicine ,Science - Abstract
Abstract The aim of this study was to develop and validate predictive models for assessing the risk of death in patients with acute diquat (DQ) poisoning using innovative machine learning techniques. Additionally, predictive models were evaluated through the application of SHapley Additive ExPlanations (SHAP). A total of 201 consecutive patients from the emergency departments of the First Hospital and Shengjing Hospital of China Medical University admitted for deliberate oral intake of DQ from February 2018 to August 2023 were analysed. The initial clinical data of the patients with acute DQ poisoning were collected. Machine learning methods such as logistic regression, random forest, support vector machine (SVM), and gradient boosting were applied to build the prediction models. The whole sample was split into a training set and a test set at a ratio of 8:2. The performances of these models were assessed in terms of discrimination, calibration, and clinical decision curve analysis (DCA). We also used the SHAP interpretation tool to provide an intuitive explanation of the risk of death in patients with DQ poisoning. Logistic regression, random forest, SVM, and gradient boosting models were established, and the areas under the receiver operating characteristic curves (AUCs) were 0.91, 0.98, 0.96 and 0.94, respectively. The net benefits were similar across all four models. The four machine learning models can be reliable tools for predicting death risk in patients with acute DQ poisoning. Their combination with SHAP provides explanations for individualized risk prediction, increasing the model transparency.
- Published
- 2024
- Full Text
- View/download PDF
29. Epidemiological exploration of the impact of bluetooth headset usage on thyroid nodules using Shapley additive explanations method
- Author
-
Nan Zhou, Wei Qin, Jia-Jin Zhang, Yun Wang, Jian-Sheng Wen, and Yang Mooi Lim
- Subjects
Thyroid nodules ,Bluetooth headsets ,Non-ionizing radiation ,Shapley additive explanations ,Medicine ,Science - Abstract
Abstract With an increasing prevalence of thyroid nodules globally, this study investigates the potential correlation between the use of Bluetooth headsets and the incidence of thyroid nodules, considering the cumulative effects of non-ionizing radiation (NIR) emitted by these devices. In this study, we analyzed 600 valid questionnaires from the WenJuanXing platform using Propensity Score Matching (PSM) and the XGBOOST model, supplemented by SHAP analysis, to assess the risk of thyroid nodules. PSM was utilized to balance baseline characteristic differences, thereby reducing bias. The XGBOOST model was then employed to predict risk factors, with model efficacy measured by the area under the Receiver Operating Characteristic (ROC) curve (AUC). SHAP analysis helped quantify and explain the impact of each feature on the prediction outcomes, identifying key risk factors. Initially, 600 valid questionnaires from the WenJuanXing platform underwent PSM processing, resulting in a matched dataset of 96 cases for modeling analysis. The AUC value of the XGBOOST model reached 0.95, demonstrating high accuracy in differentiating thyroid nodule risks. SHAP analysis revealed age and daily Bluetooth headset usage duration as the two most significant factors affecting thyroid nodule risk. Specifically, longer daily usage durations of Bluetooth headsets were strongly linked to an increased risk of developing thyroid nodules, as indicated by the SHAP analysis outcomes. Our study highlighted a significant impact relationship between prolonged Bluetooth headset use and increased thyroid nodule risk, emphasizing the importance of considering health impacts in the use of modern technology, especially for devices like Bluetooth headsets that are frequently used daily. Through precise model predictions and variable importance analysis, our research provides a scientific basis for the formulation of public health policies and personal health habit choices, suggesting that attention should be paid to the duration of Bluetooth headset use in daily life to reduce the potential risk of thyroid nodules. Future research should further investigate the biological mechanisms of this relationship and consider additional potential influencing factors to offer more comprehensive health guidance and preventive measures.
- Published
- 2024
- Full Text
- View/download PDF
30. Interpretable machine learning model for shear strength estimation of circular concrete‐filled steel tubes.
- Author
-
Mansouri, Ali, Mansouri, Maryam, and Mangalathu, Sujith
- Subjects
MACHINE learning ,CONCRETE-filled tubes ,SHEAR strength ,COMPOSITE columns ,IMPACT strength ,K-nearest neighbor classification ,STEEL tubes ,TRANSVERSE reinforcements - Abstract
Summary: Precise estimation of the shear strength of concrete‐filled steel tubes (CFSTs) is a crucial requirement for the design of these members. The existing design codes and empirical equations are inconsistent in predicting the shear strength of these members. This paper provides a data‐driven approach for the shear strength estimation of circular CFSTs. For this purpose, the authors evaluated and compared the performance of nine machine learning (ML) methods, namely linear regression, decision tree (DT), k‐nearest neighbors (KNN), support vector regression (SVR), random forest (RF), bagging regression (BR), adaptive boosting (AdaBoost), gradient boosting regression tree (GBRT), and extreme gradient boosting (XGBoost) in estimating the shear strength of CFSTs on an experimental database compiled from the results of 230 shear tests on CFSTs in the literature. For each model, hyperparameter tuning was performed by conducting a grid search in combination with k‐fold cross‐validation (CV). Comparing the nine methods in terms of several performance measures showed that the XGBoost model was the most accurate in predicting the shear strength of CFSTs. This model also showed superior accuracy in predicting the shear strength of CFSTs when compared to the formulas provided in design codes and the existing empirical equations. The Shapley Additive exPlanations (SHAP) technique was also used to interpret the results of the XGBoost model. Using SHAP, the features with the greatest impact on the shear strength of CFSTs were found to be the cross‐sectional area of the steel tube, the axial load ratio, and the shear span ratio, in that order. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Estimating Aboveground Biomass of Wetland Plant Communities from Hyperspectral Data Based on Fractional-Order Derivatives and Machine Learning.
- Author
-
Li, Huazhe, Tang, Xiying, Cui, Lijuan, Zhai, Xiajie, Wang, Junjie, Zhao, Xinsheng, Li, Jing, Lei, Yinru, Wang, Jinzhi, Wang, Rumiao, and Li, Wei
- Subjects
- *
MACHINE learning , *WETLANDS monitoring , *CARBON sequestration , *WETLAND plants , *FIELD research - Abstract
Wetlands, as a crucial component of terrestrial ecosystems, play a significant role in global ecological services. Aboveground biomass (AGB) is a key indicator of the productivity and carbon sequestration potential of wetland ecosystems. The current research methods for remote-sensing estimation of biomass either rely on traditional vegetation indices or merely perform integer-order differential transformations on the spectra, failing to fully leverage the information complexity of hyperspectral data. To identify an effective method for estimating AGB of mixed-wetland-plant communities, we conducted field surveys of AGB from three typical wetlands within the Crested Ibis National Nature Reserve in Hanzhong, Shaanxi, and concurrently acquired canopy hyperspectral data with a portable spectrometer. The spectral features were transformed by applying fractional-order differentiation (0.0 to 2.0) to extract optimal feature combinations. AGB prediction models were built using three machine learning models, XGBoost, Random Forest (RF), and CatBoost, and the accuracy of each model was evaluated. The combination of fractional-order differentiation, vegetation indices, and feature importance effectively yielded the optimal feature combinations, and integrating vegetation indices with feature bands enhanced the predictive accuracy of the models. Among the three machine-learning models, the RF model achieved superior accuracy using the 0.8-order differential transformation of vegetation indices and feature bands (R2 = 0.673, RMSE = 23.196, RPD = 1.736). The optimal RF model was visually interpreted using Shapley Additive Explanations, which revealed that the contribution of each feature varied across individual sample predictions. Our study provides methodological and technical support for remote-sensing monitoring of wetland AGB. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow.
- Author
-
Moreira-Filho, José T., Ranganath, Dhruv, Conway, Mike, Schmitt, Charles, Kleinstreuer, Nicole, and Mansouri, Kamel
- Subjects
- *
MACHINE learning , *FEATURE selection , *DATA analytics , *ARTIFICIAL intelligence , *DATA mining - Abstract
With the increased availability of chemical data in public databases, innovative techniques and algorithms have emerged for the analysis, exploration, visualization, and extraction of information from these data. One such technique is chemical grouping, where chemicals with common characteristics are categorized into distinct groups based on physicochemical properties, use, biological activity, or a combination. However, existing tools for chemical grouping often require specialized programming skills or the use of commercial software packages. To address these challenges, we developed a user-friendly chemical grouping workflow implemented in KNIME, a free, open-source, low/no-code, data analytics platform. The workflow serves as an all-encompassing tool, expertly incorporating a range of processes such as molecular descriptor calculation, feature selection, dimensionality reduction, hyperparameter search, and supervised and unsupervised machine learning methods, enabling effective chemical grouping and visualization of results. Furthermore, we implemented tools for interpretation, identifying key molecular descriptors for the chemical groups, and using natural language summaries to clarify the rationale behind these groupings. The workflow was designed to run seamlessly in both the KNIME local desktop version and KNIME Server WebPortal as a web application. It incorporates interactive interfaces and guides to assist users in a step-by-step manner. We demonstrate the utility of this workflow through a case study using an eye irritation and corrosion dataset. Scientific contributions This work presents a novel, comprehensive chemical grouping workflow in KNIME, enhancing accessibility by integrating a user-friendly graphical interface that eliminates the need for extensive programming skills. This workflow uniquely combines several features such as automated molecular descriptor calculation, feature selection, dimensionality reduction, and machine learning algorithms (both supervised and unsupervised), with hyperparameter optimization to refine chemical grouping accuracy. Moreover, we have introduced an innovative interpretative step and natural language summaries to elucidate the underlying reasons for chemical groupings, significantly advancing the usability of the tool and interpretability of the results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. An Optimal Weighted Ensemble Machine Learning Approach to Accurate Estimate the Coastal Boundary Layer Height Using ERA5 Multi‐Variables.
- Author
-
Peng, Kecheng, Xin, Jinyuan, Zhu, Xiaoqian, Xu, Qiang, Wang, Xiaoyuan, Wang, Weifeng, Tan, Yulong, Zhao, Dandan, Jia, Danjie, Cao, Xiaoqun, Ren, Xinbing, Ma, Yongjing, Wang, Guangjie, and Wang, Zifa
- Subjects
MACHINE learning ,MARINE meteorology ,HUMIDITY ,WIND speed ,WEATHER - Abstract
The coastal boundary layer height (CBLH/Coastal‐BLH) is critical in determining the exchange of heat, momentum, and materials between the land and ocean, thereby regulating the local climate and weather change. However, due to the complexity of geographical characteristics and meteorological conditions, accurate estimation of the CBLH remains challenging. Herein, based on continuous high‐resolution measurements of CBL performed from November 2019 to April 2020 in coastal Ningbo city in eastern China, an optimal weighted ensemble model (OWEM) integrating multi‐meteorological variables of the ERA5 reanalysis data sets is constructed and validated to estimate the CBLH. The mean absolute percentage error of the derived CBLH by OWEM is as low as 3%–5%, significantly lower than that of 36%–65% of the ERA5 CBLH products. Furthermore, three categories of different weather scenarios, that is, sunny, cloudy, and rainy, are separately discussed, and OWEM shows greater performance and higher accuracies in comparison with the traditional Least Absolute Shrinkage and Selection Operator, Random Forest, Adaboost, LightGBM, and ensemble model, among which, OWEM under fair weather days behave best, with a robust R2 of 0.97 and a minimum mean absolute error (MAE) of 23 m. Further training results based on wind flow classification, that is, land breeze, sea breeze, and parallel wind, also indicate the outperformance of OWEM than other models, with a relatively large error in parallel wind of 50 m. Subsequent analysis of the Shapley Additive Explanations method strongly correlated with model feature importance, both reveal that thermodynamic factors such as temperature (T2m) and wind velocity (10 m U) are the major factors positively related to estimation accuracy during sunny days. Nevertheless, Relative Humidity dominates on rainy and cloudy days, TP on land breeze days, and dynamic variables like 10 m U and 10 m V on entire types of wind flow weather. In conclusion, the accurate estimation of CBLH from OWEM serves as a feasible and innovative approach, providing technical support for marine meteorology and related engineering applications, for example, onshore wind power, coastal ecological protection, etc. Plain Language Summary: The coastal boundary layer height (CBLH) is fast becoming a key parameter in atmosphere‐ocean interaction studies, and machine learning has been thought of as an effective way to accurate estimate it. In this work, we present a new model called optimal weighted ensemble model (OWEM), which integrates several meteorological variables from reanalysis data sets to estimate the CBLH and make a comparison with the ERA5 BLH under different weather conditions. We also analyze the feature importance that affects estimation accuracy using the Shapley Additive Explanations method and identify several variables as essential factors, to better understand the applicability and characterization of the OWEM for sunny, cloudy, rainy, and different wind flow days. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Machine-Learning-Based Predictive Models for Punching Shear Strength of FRP-Reinforced Concrete Slabs: A Comparative Study.
- Author
-
Xu, Weidong and Shi, Xianying
- Subjects
MACHINE learning ,FIBER-reinforced plastics ,SHEAR strength ,COMPOSITE structures ,SUPPORT vector machines ,CONCRETE slabs - Abstract
This study is focused on the punching strength of fiber-reinforced polymer (FRP) concrete slabs. The mechanical properties of reinforced concrete slabs are often constrained by their punching shear strength at the column connection regions. Researchers have explored the use of fiber-reinforced polymer reinforcement as an alternative to traditional steel reinforcement to address this limitation. However, current codes poorly calculate the punching shear strength of FRP-reinforced concrete slabs. The aim of this study was to create a robust model that can accurately predict its punching shear strength, thus improving the analysis and design of composite structures with FRP-reinforced concrete slabs. In this study, 189 sets of experimental data were collected, and six machine learning models, including linear regression, support vector machine, BP neural network, decision tree, random forest, and eXtreme Gradient Boosting, were constructed and evaluated based on goodness of fit, standard deviation, and root-mean-square error in order to select the most suitable model for this study. The optimal model obtained was compared with the models proposed by codes and the researchers. Finally, a model explainability study was conducted using SHapley Additive exPlanations (SHAP). The results showed that random forests performed best among all machine learning models and outperformed existing models suggested by codes and researchers. The effective depth of the FRP-reinforced concrete slabs was the most important and proportional to the punching shear strength. This study not only provides guidance on the design of FRP-reinforced concrete slabs but also informs future engineering practice. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Revolutionizing engineered cementitious composite materials (ECC): the impact of XGBoost-SHAP analysis on polyvinyl alcohol (PVA) based ECC predictions.
- Author
-
Uddin, Md Nasir, Al-Amin, and Hossain, Shameem
- Subjects
ARTIFICIAL neural networks ,CEMENT composites ,MACHINE learning ,SUPPORT vector machines ,RANDOM forest algorithms - Abstract
Copyright of Low-Carbon Materials & Green Construction is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
36. Interpretable machine learning for the prediction of death risk in patients with acute diquat poisoning.
- Author
-
Li, Huiyi, Liu, Zheng, Sun, Wenming, Li, Tiegang, and Dong, Xuesong
- Subjects
- *
POISONING , *DEATH forecasting , *MACHINE learning , *RECEIVER operating characteristic curves , *DECISION making , *SUPPORT vector machines - Abstract
The aim of this study was to develop and validate predictive models for assessing the risk of death in patients with acute diquat (DQ) poisoning using innovative machine learning techniques. Additionally, predictive models were evaluated through the application of SHapley Additive ExPlanations (SHAP). A total of 201 consecutive patients from the emergency departments of the First Hospital and Shengjing Hospital of China Medical University admitted for deliberate oral intake of DQ from February 2018 to August 2023 were analysed. The initial clinical data of the patients with acute DQ poisoning were collected. Machine learning methods such as logistic regression, random forest, support vector machine (SVM), and gradient boosting were applied to build the prediction models. The whole sample was split into a training set and a test set at a ratio of 8:2. The performances of these models were assessed in terms of discrimination, calibration, and clinical decision curve analysis (DCA). We also used the SHAP interpretation tool to provide an intuitive explanation of the risk of death in patients with DQ poisoning. Logistic regression, random forest, SVM, and gradient boosting models were established, and the areas under the receiver operating characteristic curves (AUCs) were 0.91, 0.98, 0.96 and 0.94, respectively. The net benefits were similar across all four models. The four machine learning models can be reliable tools for predicting death risk in patients with acute DQ poisoning. Their combination with SHAP provides explanations for individualized risk prediction, increasing the model transparency. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Prediction of Cognitive Impairment Risk among Older Adults: A Machine Learning-Based Comparative Study and Model Development.
- Author
-
Li, Jianwei, Li, Jie, Zhu, Huafang, Liu, Mengyu, Li, Tengfei, He, Yeke, Xu, Yuan, Huang, Fen, and Qin, Qirong
- Subjects
- *
COGNITION disorders diagnosis , *COGNITION disorder risk factors , *RISK assessment , *LIFESTYLES , *SELF-evaluation , *COMMUNITY health services , *PREDICTION models , *RESEARCH funding , *BODY mass index , *QUESTIONNAIRES , *PRIMARY health care , *DESCRIPTIVE statistics , *ECONOMIC status , *AGE distribution , *CHRONIC diseases , *NEUROPSYCHOLOGICAL tests , *DIASTOLIC blood pressure , *MACHINE learning , *COMPARATIVE studies , *SOCIODEMOGRAPHIC factors , *ANTHROPOMETRY , *BODY movement , *CONFIDENCE intervals , *SYSTOLIC blood pressure , *EARLY diagnosis , *SENSITIVITY & specificity (Statistics) , *COGNITION , *ALGORITHMS , *EDUCATIONAL attainment , *SOCIAL participation , *ACTIVITIES of daily living , *EVALUATION , *OLD age - Abstract
Introduction: The prevalence of cognitive impairment and dementia in the older population is increasing, and thereby, early detection of cognitive decline is essential for effective intervention. Methods: This study included 2,288 participants with normal cognitive function from the Ma'anshan Healthy Aging Cohort Study. Forty-two potential predictors, including demographic characteristics, chronic diseases, lifestyle factors, anthropometric indices, physical function, and baseline cognitive function, were selected based on clinical importance and previous research. The dataset was partitioned into training, validation, and test sets in a proportion of 60% for training, 20% for validation, and 20% for testing, respectively. Recursive feature elimination was used for feature selection, followed by six machine learning algorithms that were employed for model development. The performance of the models was evaluated using area under the curve (AUC), specificity, sensitivity, and accuracy. Moreover, SHapley Additive exPlanations (SHAP) was conducted to access the interpretability of the final selected model and to gain insights into the impact of features on the prediction outcomes. SHAP force plots were established to vividly show the application of the prediction model at the individual level. Results: The final predictive model based on the Naive Bayes algorithm achieved an AUC of 0.820 (95% CI, 0.773–0.887) on the test set, outperforming other algorithms. The top ten influential features in the model included baseline Mini-Mental State Examination (MMSE), education, self-reported economic status, collective or social activities, Pittsburgh sleep quality index (PSQI), body mass index, systolic blood pressure, diastolic blood pressure, instrumental activities of daily living, and age. The model demonstrated the potential to identify individuals at a higher risk of cognitive impairment within 3 years from older adults. Conclusion: The predictive model developed in this study contributes to the early detection of cognitive impairment in older adults by primary healthcare staff in community settings. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Unravelling Complexity: Investigating the Effectiveness of SHAP Algorithm for Improving Explainability in Network Intrusion System Across Machine and Deep Learning Models.
- Author
-
Vaswani, Lakshya, Harsha, Sai Sri, Jaiswal, Subham, and D., Aju
- Subjects
ARTIFICIAL neural networks ,MACHINE learning ,DEEP learning ,DISTRIBUTED computing ,FEATURE selection ,INTRUSION detection systems (Computer security) - Abstract
According to several studies, it is feasible to significantly raise the detection engine's effectiveness and accuracy by choosing the right features for a threat detection system. New advances like Distributed Computing and Enormous Information have expanded network traffic, and the danger identification framework must proactively gain and dissect the information delivered by the approaching traffic. Nonetheless, not all elements in an enormous dataset help to portray the traffic, therefore restricting and picking few reasonable highlights might accelerate and improve the danger discovery framework's exactness. Deep neural networks enhance the detection rates of intrusion detection models, making machine learning-based intrusion detection systems (IDS's) useful recently. Consumers, however, find it more and more challenging to comprehend the reasoning behind their selections as models become more complex accuracy. Using relevant features from the NSL-KDD dataset, we apply appropriate feature selection mechanisms to implement an intrusion detection system to implement a faster system with increased accuracy. We use Explainable Model (SHAP) to interpret the results of IDS. The interpretation of findings utilizing the Explainable Model (SHAP) for machine learning (ML) and deep learning (DL) models heavily depends on its efficiency. While DL models require more resources, ML models are computationally efficient. Both models, however, gain from SHAP interpretations, which offer perceptions into the significance of features and contributions to predictions. While DL models excel in accuracy, ML models offer efficiency. The decision is based on the particular needs and resources that are available, with SHAP offering greater knowledge of model behavior and feature impact. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics.
- Author
-
Bifarin, Olatomiwa O. and Fernández, Facundo M.
- Abstract
Metabolomics generates complex data necessitating advanced computational methods for generating biological insight. While machine learning (ML) is promising, the challenges of selecting the best algorithms and tuning hyperparameters, particularly for nonexperts, remain. Automated machine learning (AutoML) can streamline this process; however, the issue of interpretability could persist. This research introduces a unified pipeline that combines AutoML with explainable AI (XAI) techniques to optimize metabolomics analysis. We tested our approach on two data sets: renal cell carcinoma (RCC) urine metabolomics and ovarian cancer (OC) serum metabolomics. AutoML, using Auto-sklearn, surpassed standalone ML algorithms like SVM and k-Nearest Neighbors in differentiating between RCC and healthy controls, as well as OC patients and those with other gynecological cancers. The effectiveness of Auto-sklearn is highlighted by its AUC scores of 0.97 for RCC and 0.85 for OC, obtained from the unseen test sets. Importantly, on most of the metrics considered, Auto-sklearn demonstrated a better classification performance, leveraging a mix of algorithms and ensemble techniques. Shapley Additive Explanations (SHAP) provided a global ranking of feature importance, identifying dibutylamine and ganglioside GM-(d34:1) as the top discriminative metabolites for RCC and OC, respectively. Waterfall plots offered local explanations by illustrating the influence of each metabolite on individual predictions. Dependence plots spotlighted metabolite interactions, such as the connection between hippuric acid and one of its derivatives in RCC, and between GM3-(d34:1) and GM3(18:1_16:0) in OC, hinting at potential mechanistic relationships. Through decision plots, a detailed error analysis was conducted, contrasting feature importance for correctly versus incorrectly classified samples. In essence, our pipeline emphasizes the importance of harmonizing AutoML and XAI, facilitating both simplified ML application and improved interpretability in metabolomics data science. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. An interpretable clinical ultrasound-radiomics combined model for diagnosis of stage I cervical cancer.
- Author
-
Xianyue Yang, Chuanfen Gao, Nian Sun, Xiachuan Qin, Xiaoling Liu, and Chaoxue Zhang
- Subjects
CERVICAL cancer ,MACHINE learning ,TRANSVAGINAL ultrasonography ,FEATURE extraction ,DECISION making - Abstract
Objective: The purpose of this retrospective study was to establish a combined model based on ultrasound (US)-radiomics and clinical factors to predict patients with stage I cervical cancer (CC) before surgery. Materials and methods: A total of 209 CC patients who had cervical lesions found by transvaginal sonography (TVS) from the First Affiliated Hospital of Anhui Medical University were retrospectively reviewed, patients were divided into the training set (n = 146) and internal validation set (n = 63), and 52 CC patients from Anhui Provincial Maternity and Child Health Hospital and Nanchong Central Hospital were taken as the external validation set. The clinical independent predictors were selected by univariate and multivariate logistic regression analyses. US-radiomics features were extracted from US images. After selecting the most significant features by univariate analysis, Spearman's correlation analysis, and the least absolute shrinkage and selection operator (LASSO) algorithm, six machine learning (ML) algorithms were used to build the radiomics model. Next, the ability of the clinical, US-radiomics, and clinical US-radiomics combined model was compared to diagnose stage I CC. Finally, the Shapley additive explanations (SHAP) method was used to explain the contribution of each feature. Results: Long diameter of the cervical lesion (L) and squamous cell carcinoma-associated antigen (SCCa) were independent clinical predictors of stage I CC. The eXtreme Gradient Boosting (Xgboost) model performed the best among the six ML radiomics models, with area under the curve (AUC) values in the training, internal validation, and external validation sets being 0.778, 0.751, and 0.751, respectively. In the final three models, the combined model based on clinical features and rad-score showed good discriminative power, with AUC values in the training, internal validation, and external validation sets being 0.837, 0.828, and 0.839, respectively. The decision curve analysis validated the clinical utility of the combined nomogram. The SHAP algorithm illustrates the contribution of each feature in the combined model. Conclusion: We established an interpretable combined model to predict stage I CC. This non-invasive prediction method may be used for the preoperative identification of patients with stage I CC. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. On Evaluating Black-Box Explainable AI Methods for Enhancing Anomaly Detection in Autonomous Driving Systems.
- Author
-
Nazat, Sazid, Arreche, Osvaldo, and Abdallah, Mustafa
- Subjects
- *
AUTONOMOUS vehicles , *ARTIFICIAL intelligence , *INTRUSION detection systems (Computer security) , *INTERNET security - Abstract
The recent advancements in autonomous driving come with the associated cybersecurity issue of compromising networks of autonomous vehicles (AVs), motivating the use of AI models for detecting anomalies on these networks. In this context, the usage of explainable AI (XAI) for explaining the behavior of these anomaly detection AI models is crucial. This work introduces a comprehensive framework to assess black-box XAI techniques for anomaly detection within AVs, facilitating the examination of both global and local XAI methods to elucidate the decisions made by XAI techniques that explain the behavior of AI models classifying anomalous AV behavior. By considering six evaluation metrics (descriptive accuracy, sparsity, stability, efficiency, robustness, and completeness), the framework evaluates two well-known black-box XAI techniques, SHAP and LIME, involving applying XAI techniques to identify primary features crucial for anomaly classification, followed by extensive experiments assessing SHAP and LIME across the six metrics using two prevalent autonomous driving datasets, VeReMi and Sensor. This study advances the deployment of black-box XAI methods for real-world anomaly detection in autonomous driving systems, contributing valuable insights into the strengths and limitations of current black-box XAI methods within this critical domain. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Employing the Interpretable Ensemble Learning Approach to Predict the Bandgaps of the Halide Perovskites.
- Author
-
Ren, Chao, Wu, Yiyuan, Zou, Jijun, and Cai, Bowen
- Subjects
- *
PEROVSKITE , *HALIDES , *STANDARD deviations , *SOLAR cells - Abstract
Halide perovskite materials have broad prospects for applications in various fields such as solar cells, LED devices, photodetectors, fluorescence labeling, bioimaging, and photocatalysis due to their bandgap characteristics. This study compiled experimental data from the published literature and utilized the excellent predictive capabilities, low overfitting risk, and strong robustness of ensemble learning models to analyze the bandgaps of halide perovskite compounds. The results demonstrate the effectiveness of ensemble learning decision tree models, especially the gradient boosting decision tree model, with a root mean square error of 0.090 eV, a mean absolute error of 0.053 eV, and a determination coefficient of 93.11%. Research on data related to ratios calculated through element molar quantity normalization indicates significant influences of ions at the X and B positions on the bandgap. Additionally, doping with iodine atoms can effectively reduce the intrinsic bandgap, while hybridization of the s and p orbitals of tin atoms can also decrease the bandgap. The accuracy of the model is validated by predicting the bandgap of the photovoltaic material MASn1−xPbxI3. In conclusion, this study emphasizes the positive impact of machine learning on material development, especially in predicting the bandgaps of halide perovskite compounds, where ensemble learning methods demonstrate significant advantages. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Value-at-Risk forecasting for the Chinese new energy stock market: an explainable quantile regression neural network method.
- Author
-
Wang, Xiaoxu, Liu, Hui, and Yao, Yinhong
- Abstract
This study utilizes the quantile regression neural network (QRNN) model to forecast the Value-at-Risk (VaR) of ten new energy stock markets by considering ten influencing factors. An out-of-sample analysis shows that the VaRs based on QRNN model pass the backtesting test. The market risk of new energy market is notably higher accompanied by reduced losses and more consistent volatility. Within this market, the methanol emerges as the most stable and the solar exhibits the least stability. Compared to the traditional energy market, the new energy market faces lower market risk during the sample period. Additionally, ESG and CNY/XDR have the greatest impact on VaR prediction. This study has important practical implications for investors, policy makers and new energy market participants. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. 多特征提取的可解释性锂电池健康状态 估计方法研究.
- Author
-
王奥博, 霍为炜, and 贾云旭
- Subjects
STANDARD deviations ,BATTERY management systems ,LITHIUM-ion batteries ,PREDICTION models ,ELECTRONIC data processing ,FEATURE extraction - Abstract
Copyright of Journal of Chongqing University of Technology (Natural Science) is the property of Chongqing University of Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
45. Exploring the Feasibility of Vision-Based Non-Contact Oxygen Saturation Estimation: Considering Critical Color Components and Individual Differences.
- Author
-
Seong, Hyeon Ah, Seok, Chae Lin, and Lee, Eui Chul
- Subjects
OXYGEN saturation ,CONVOLUTIONAL neural networks ,PULSE oximeters ,INDIVIDUAL differences ,OXYGEN in the blood ,STANDARD deviations ,PEARSON correlation (Statistics) - Abstract
The blood oxygen saturation, which indicates the ratio of oxygenated hemoglobin to total hemoglobin in the blood, is closely related to one's health status. Oxygen saturation is typically measured using a pulse oximeter. However, this method can cause skin irritation, and in situations where there is a risk of infectious diseases, the use of such contact-based oxygen saturation measurement devices can increase the risk of infection. Therefore, recently, methods for estimating oxygen saturation using facial or hand images have been proposed. In this paper, we propose a method for estimating oxygen saturation from facial images based on a convolutional neural network (CNN). Particularly, instead of arbitrarily calculating the AC and DC components, which are essential for measuring oxygen saturation, we directly utilized signals obtained from facial images to train the model and predict oxygen saturation. Moreover, to account for the time-consuming nature of accurately measuring oxygen saturation, we diversified the model inputs. As a result, for inputs of 10 s, the Pearson correlation coefficient was calculated as 0.570, the mean absolute error was 1.755%, the root mean square error was 2.284%, and the intraclass correlation coefficient was 0.574. For inputs of 20 s, these metrics were calculated as 0.630, 1.720%, 2.219%, and 0.681, respectively. For inputs of 30 s, they were calculated as 0.663, 2.142%, 2.612%, and 0.646, respectively. This confirms that it is possible to estimate oxygen saturation without calculating the AC and DC components, which heavily influence the prediction results. Furthermore, we analyzed how the trained model predicted oxygen saturation through 'SHapley Additive exPlanations' and found significant variations in the feature contributions among participants. This indicates that, for more accurate predictions of oxygen saturation, it may be necessary to individually select appropriate color channels for each participant. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Predicting temporomandibular disorders in adults using interpretable machine learning methods: a model development and validation study
- Author
-
Yuchen Cui, Fujia Kang, Xinpeng Li, Xinning Shi, Han Zhang, and Xianchun Zhu
- Subjects
temporomandibular disorders ,machine learning ,prediction model ,shapley additive explanations ,random forest ,Biotechnology ,TP248.13-248.65 - Abstract
IntroductionTemporomandibular disorders (TMD) have a high prevalence and complex etiology. The purpose of this study was to apply a machine learning (ML) approach to identify risk factors for the occurrence of TMD in adults and to develop and validate an interpretable predictive model for the risk of TMD in adults.MethodsA total of 949 adults who underwent oral examinations were enrolled in our study. 5 different ML algorithms were used for model development and comparison, and feature selection was performed by feature importance ranking and feature decreasing methods. Several evaluation indexes, including the area under the receiver-operating-characteristic curve (AUC), were used to compare the predictive performance. The precision-recall curve (PR), calibration curve, and decision curve analysis (DCA) further assessed the accuracy and clinical utility of the model.ResultsThe performance of the random forest (RF) model was the best among the 5 ML models. An interpretable RF model was developed with 7 features (gender, malocclusion, unilateral chewing, chewing hard substances, grinding teeth, clenching teeth, and anxiety). The AUCs of the final model on the training set, internal validation set, and external test set were 0.892, 0.854, and 0.857, respectively. Calibration and DCA curves showed high accuracy and clinical applicability of the model.DiscussionAn efficient and interpretable TMD risk prediction model for adults was successfully developed using the ML method. The model not only has good predictive performance, but also enhances the clinical application value of the model through the SHAP method. This model can provide clinicians with a practical and efficient TMD risk assessment tool that can help them better predict and assess TMD risk in adults, supporting more efficient disease management and targeted medical interventions.
- Published
- 2024
- Full Text
- View/download PDF
47. An Investigation of factors Influencing electric vehicles charging Needs: Machine learning approach
- Author
-
Cuthbert Ruseruka, Judith Mwakalonge, Gurcan Comert, Saidi Siuhi, Debbie Indah, Sarah Kasomi, and Tumlumbe Juliana Chengula
- Subjects
Utility of EV Charging Networks ,Electric Vehicles ,Alternative Fuel Vehicles ,SHapley Additive exPlanations ,Transportation and communications ,HE1-9990 - Abstract
Decarbonization of the world is greatly contributed to by the recent technological advancements that have fostered the development of electric vehicles (EVs). The EVs relieve transportation dependence on natural fossil fuels as an energy source. More than 50 % of the petroleum products produced worldwide are estimated to be used in the transportation sector, accounting for more than 90 % of all transportation energy sources. Consequently, studies estimate that the transportation sector produces about 22 % of global carbon dioxide emissions, posing significant environmental issues. Thus, using EVs, particularly on road transport, is expected to reduce environmental pollution. To accelerate EV development and deployment, governments worldwide invest in EV development through various initiatives to make them more affordable. This research aims to investigate the changing needs of EV users to establish factors to be considered in the selection of charging demands using machine learning, using an extreme gradient boosting model. The model reached high accuracy, with an R2-Score of 0.964 to 1.000 across all predicted needs. The model performance is greatly affected by age, median income, education, and car ownership. High values of people with high income, high education, and age between 35–54 years show a positive contribution to the model’s performance, contrary to those with 65+, low income, and low education attainment. The outcomes of this research document factors that influence EV charging needs; therefore, it provides a basis for decision-makers and all stakeholders to decide where to locate EV charging stations for usability, efficiency, sustainability, and social welfare.
- Published
- 2024
- Full Text
- View/download PDF
48. Quantifying seasonal variations in pollution sources with machine learning-enhanced positive matrix factorization
- Author
-
Yaotao Xu, Peng Li, Minghui Zhang, Lie Xiao, Bo Wang, Xiaoming Zhang, Yunqi Wang, and Peng Shi
- Subjects
Positive Matrix Factorization (PMF) ,Machine learning optimization ,Seasonal water quality variations ,SHapley additive exPlanations ,Multi-source pollution identification ,Ecology ,QH540-549.5 - Abstract
As the pace of industrialization and urbanization accelerates, water quality management faces increasing challenges, with traditional methods for pollutant source apportionment often proving inadequate in handling complex environmental data. This study enhances the precision and reliability of pollutant source identification by integrating Positive Matrix Factorization (PMF) models with diverse machine learning techniques. Utilizing data from 17 water quality monitoring stations along the Wuding River from 2017 to 2021, we employed Random Forest (RF), Support Vector Machine (SVM), Elastic Net (EN), and Extreme Gradient Boosting (XGBoost) models to predict the Water Quality Index (WQI) during dry and wet seasons. Results indicate that the RF model exhibited optimal performance in the dry season (R2 = 0.93), while the SVM was superior in the wet season (R2 = 0.94). SHAP (SHapley Additive exPlanations) value analysis identified CODMn and NH3-N as significant influencers on WQI in the dry season, whereas COD, BOD, and TP gained prominence during the wet season. SHAP values reveal the contribution of each feature to the model output, thereby enhancing the model’s transparency and interpretability. Additionally, feature importance identified by machine learning was utilized as weights to optimize the contribution rates predicted by the PMF model. The optimised model was able to identify the contribution of domestic and farm effluent discharges more accurately in the dry season, with a significant increase in the percentage of identification from 19.4 % to 45.4 %, and an increase in the percentage of contribution from agricultural non-point sources and domestic effluent in the rainy season. This research offers a novel perspective on the characteristics of river water pollution and holds significant implications for formulating data-driven environmental management strategies.
- Published
- 2024
- Full Text
- View/download PDF
49. Advancing water quality assessment and prediction using machine learning models, coupled with explainable artificial intelligence (XAI) techniques like shapley additive explanations (SHAP) for interpreting the black-box nature
- Author
-
Randika K. Makumbura, Lakindu Mampitiya, Namal Rathnayake, D.P.P. Meddage, Shagufta Henna, Tuan Linh Dang, Yukinobu Hoshino, and Upaka Rathnayake
- Subjects
Water quality assessment ,Machine learning ,Explainable artificial intelligence ,Shapley additive explanations ,Prediction models ,Technology - Abstract
Water quality assessment and prediction play crucial roles in ensuring the sustainability and safety of freshwater resources. This study aims to enhance water quality assessment and prediction by integrating advanced machine learning models with XAI techniques. Traditional methods, such as the water quality index, often require extensive data collection and laboratory analysis, making them resource-intensive. The weighted arithmetic water quality index is employed alongside machine learning models, specifically RF, LightGBM, and XGBoost, to predict water quality. The models' performance was evaluated using metrics such as MAE, RMSE, R2, and R. The results demonstrated high predictive accuracy, with XGBoost showing the best performance (R2 = 0.992, R = 0.996, MAE = 0.825, and RMSE = 1.381). Additionally, SHAP were used to interpret the model's predictions, revealing that COD and BOD are the most influential factors in determining water quality, while electrical conductivity, chloride, and nitrate had minimal impact. High dissolved oxygen levels were associated with lower water quality index, indicative of excellent water quality, while pH consistently influenced predictions. The findings suggest that the proposed approach offers a reliable and interpretable method for water quality prediction, which can significantly benefit water specialists and decision-makers.
- Published
- 2024
- Full Text
- View/download PDF
50. Explainable machine learning-based fractional vegetation cover inversion and performance optimization – A case study of an alpine grassland on the Qinghai-Tibet Plateau
- Author
-
Xinhong Li, Jianjun Chen, Zizhen Chen, Yanping Lan, Ming Ling, Qinyi Huang, Hucheng Li, Xiaowen Han, and Shuhua Yi
- Subjects
Fractional vegetation cover ,Machine learning ,Underlying surface heterogeneity ,SHapley Additive exPlanations ,Optuna ,Information technology ,T58.5-58.64 ,Ecology ,QH540-549.5 - Abstract
Fractional Vegetation Cover (FVC) serves as a crucial indicator in ecological sustainability and climate change monitoring. While machine learning is the primary method for FVC inversion, there are still certain shortcomings in feature selection, hyperparameter tuning, underlying surface heterogeneity, and explainability. Addressing these challenges, this study leveraged extensive FVC field data from the Qinghai-Tibet Plateau. Initially, a feature selection algorithm combining genetic algorithms and XGBoost was proposed. This algorithm was integrated with the Optuna tuning method, forming the GA-OP combination to optimize feature selection and hyperparameter tuning in machine learning. Furthermore, comparative analyses of various machine learning models for FVC inversion in alpine grassland were conducted, followed by an investigation into the impact of the underlying surface heterogeneity on inversion performance using the NDVI Coefficient of Variation (NDVI-CV). Lastly, the SHAP (Shapley Additive exPlanations) method was employed for both global and local interpretations of the optimal model. The results indicated that: (1) GA-OP combination exhibited favorable performance in terms of computational cost and inversion accuracy, with Optuna demonstrating significant potential in hyperparameter tuning. (2) Stacking model achieved optimal performance in FVC inversion for alpine grassland among the seven models (R2 = 0.867, RMSE = 0.12, RPD = 2.552, BIAS = −0.0005, VAR = 0.014), with the performance ranking as follows: Stacking > CatBoost > XGBoost > LightGBM > RFR > KNN > SVR. (3) NDVI-CV enhanced inversion performance and result reliability by excluding data from highly heterogeneous regions that tended to be either overestimated or underestimated. (4) SHAP revealed the decision-making processes of the Stacking and CatBoost models from both global and local perspectives. This allowed for a deeper exploration of the causality between features and targets. This study developed a high-precision FVC inversion scheme, successfully achieving accurate FVC inversion on the Qinghai-Tibet Plateau. The proposed approach provides valuable references for other ecological parameter inversions.
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.