Descriptor: "Feature Engineering" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Feature Engineering"' showing total 3,142 results

Start Over Descriptor "Feature Engineering" Search Limiters Available in Library Collection

3,142 results on '"Feature Engineering"'

1. Prediction of surface roughness using deep learning and data augmentation

Author: Guo, Miaoxian, Wei, Shouheng, Han, Chentong, Xia, Wanliang, Luo, Chao, and Lin, Zhijian
Published: 2024
Full Text: View/download PDF

2. Feature-based detection of breast cancer using convolutional neural network and feature engineering.

Author: Essa, Hiba Allah, Ismaiel, Ebrahim, and Hinnawi, Mhd Firas Al
Abstract: Breast cancer (BC) is a prominent cause of female mortality on a global scale. Recently, there has been growing interest in utilizing blood and tissue-based biomarkers to detect and diagnose BC, as this method offers a non-invasive approach. To improve the classification and prediction of BC using large biomarker datasets, several machine-learning techniques have been proposed. In this paper, we present a multi-stage approach that consists of computing new features and then sorting them into an input image for the ResNet50 neural network. The method involves transforming the original values into normalized values based on their membership in the Gaussian distribution of healthy and BC samples of each feature. To test the effectiveness of our proposed approach, we employed the Coimbra and Wisconsin datasets. The results demonstrate efficient performance improvement, with an accuracy of 100% and 100% using the Coimbra and Wisconsin datasets, respectively. Furthermore, the comparison with existing literature validates the reliability and effectiveness of our methodology, where the normalized value can reduce the misclassified samples of ML techniques because of its generality. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Integrated Mixed Potential Gas Sensor with Efficient Structure for Discriminative Volatile Organic Compounds Detection.

Author: Lv, Siyuan, Gu, Tianyi, Pu, Qi, Wang, Bin, Jia, Xiaoteng, Sun, Peng, Wang, Lijun, Liu, Fangmeng, and Lu, Geyu
Subjects: *GAS detectors, *ACETIC acid, *GAS engineering, *PENTANE, *ISOPRENE
Abstract: Amid growing interest in the precise detection of volatile organic compounds (VOCs) in industrial field, the demand for highly effective gas sensors is at an all‐time high. However, traditional sensors with their classic single‐output signal, bulky and complex integrated structure when forming array often involve complicated technology and high cost, limiting their widespread adoption. Here, this study introduces a novel approach, employing an integrated YSZ‐based (YSZ: yttria‐stabilized zirconia) mixed potential sensor equipped with a triple‐sensing electrode array, to efficiently detect and differentiate six types of VOCs gases. This innovative sensor integrates NiSb2O6, CuSb2O6, and MgSb2O6 sensing electrodes (SEs), which are sensitive to pentane, isoprene, n‐propanol, acetone, acetic acid, and formaldehyde gases. Through feature engineering based on intuitive spike‐based response values, it accentuates the distinct characteristics of every gas. Eventually, an average classification accuracy of 98.8% and an overall R‐squared error (R2) of 99.3% for concentration regression toward six target gases can be achieved, showcasing the potential to quantitatively distinguish between industrial hazardous VOCs gases. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Enhancing Fire Protection in Electric Vehicle Batteries Based on Thermal Energy Storage Systems Using Machine Learning and Feature Engineering.

Author: Kiasari, Mahmoud M. and Aly, Hamed H.
Subjects: *FIRE protection engineering, *HEAT storage, *ELECTRIC vehicle batteries, *ELECTRIC batteries, *ENERGY storage
Abstract: Thermal Energy Storage (TES) plays a pivotal role in the fire protection of Li-ion batteries, especially for the high-voltage (HV) battery systems in Electrical Vehicles (EVs). This study covers the application of TES in mitigating thermal runaway risks during different battery charging/discharging conditions known as Vehicle-to-grid (V2G) and Grid-to-vehicle (G2V). Through controlled simulations in Simulink, this research models real-world scenarios to analyze the effectiveness of TES in controlling battery conditions under various environmental conditions. This study also integrates Machine Learning (ML) techniques to utilize the produced data by the simulation model and to predict any probable thermal spikes and enhance the system reliability, focusing on crucial factors like battery temperature, current, or State of charge (SoC). Feature engineering is also employed to identify the key parameters among all features that are considered for this study. For a broad comparison among different models, three different ML techniques, logistic regression, support vector machine (SVM), and Naïve Bayes, have been used alongside their hybrid combination to determine the most accurate one for the related topic. This study concludes that SoC is the most significant factor affecting thermal management while grid power consumption has the least impact. Additionally, the findings demonstrate that logistic regression outperforms other methods, with the improving feature to be used in the hybrid models as it can increase their efficiency due to its linearity capture capability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Wearable Sensor-Based Assessments for Remotely Screening Early-Stage Parkinson's Disease.

Author: Johnson, Shane, Kantartjis, Michalis, Severson, Joan, Dorsey, Ray, Adams, Jamie L., Kangarloo, Tairmae, Kostrzebski, Melissa A., Best, Allen, Merickel, Michael, Amato, Dan, Severson, Brian, Jezewski, Sean, Polyak, Steve, Keil, Anna, Cosman, Josh, and Anderson, David
Subjects: *MACHINE learning, *PARKINSON'S disease, *MOBILE health, *MEDICAL screening, *RANDOM forest algorithms
Abstract: Prevalence estimates of Parkinson's disease (PD)—the fastest-growing neurodegenerative disease—are generally underestimated due to issues surrounding diagnostic accuracy, symptomatic undiagnosed cases, suboptimal prodromal monitoring, and limited screening access. Remotely monitored wearable devices and sensors provide precise, objective, and frequent measures of motor and non-motor symptoms. Here, we used consumer-grade wearable device and sensor data from the WATCH-PD study to develop a PD screening tool aimed at eliminating the gap between patient symptoms and diagnosis. Early-stage PD patients (n = 82) and age-matched comparison participants (n = 50) completed a multidomain assessment battery during a one-year longitudinal multicenter study. Using disease- and behavior-relevant feature engineering and multivariate machine learning modeling of early-stage PD status, we developed a highly accurate (92.3%), sensitive (90.0%), and specific (100%) random forest classification model (AUC = 0.92) that performed well across environmental and platform contexts. These findings provide robust support for further exploration of consumer-grade wearable devices and sensors for global population-wide PD screening and surveillance. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Short-Term Forecasts of Energy Generation in a Solar Power Plant Using Various Machine Learning Models, along with Ensemble and Hybrid Methods.

Author: Piotrowski, Paweł and Kopyt, Marcin
Subjects: *SOLAR power plants, *MACHINE learning, *SOLAR energy, *PREDICTION models, *METHODS engineering
Abstract: High-quality short-term forecasts of electrical energy generation in solar power plants are crucial in the dynamically developing sector of renewable power generation. This article addresses the issue of selecting appropriate (preferred) methods for forecasting energy generation from a solar power plant within a 15 min time horizon. The effectiveness of various machine learning methods was verified. Additionally, the effectiveness of proprietary ensemble and hybrid methods was proposed and examined. The research also aimed to determine the appropriate sets of input variables for the predictive models. To enhance the performance of the predictive models, proprietary additional input variables (feature engineering) were constructed. The significance of individual input variables was examined depending on the predictive model used. This article concludes with findings and recommendations regarding the preferred predictive methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Lobish: Symbolic Language for Interpreting Electroencephalogram Signals in Language Detection Using Channel-Based Transformation and Pattern.

Author: Tuncer, Turker, Dogan, Sengul, Tasci, Irem, Baygin, Mehmet, Barua, Prabal Datta, and Acharya, U. Rajendra
Subjects: *SIGNAL processing, *ENGINEERING models, *SIGNAL detection, *K-nearest neighbor classification, *MACHINE learning
Abstract: Electroencephalogram (EEG) signals contain information about the brain's state as they reflect the brain's functioning. However, the manual interpretation of EEG signals is tedious and time-consuming. Therefore, automatic EEG translation models need to be proposed using machine learning methods. In this study, we proposed an innovative method to achieve high classification performance with explainable results. We introduce channel-based transformation, a channel pattern (ChannelPat), the t algorithm, and Lobish (a symbolic language). By using channel-based transformation, EEG signals were encoded using the index of the channels. The proposed ChannelPat feature extractor encoded the transition between two channels and served as a histogram-based feature extractor. An iterative neighborhood component analysis (INCA) feature selector was employed to select the most informative features, and the selected features were fed into a new ensemble k-nearest neighbor (tkNN) classifier. To evaluate the classification capability of the proposed channel-based EEG language detection model, a new EEG language dataset comprising Arabic and Turkish was collected. Additionally, Lobish was introduced to obtain explainable outcomes from the proposed EEG language detection model. The proposed channel-based feature engineering model was applied to the collected EEG language dataset, achieving a classification accuracy of 98.59%. Lobish extracted meaningful information from the cortex of the brain for language detection. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Interpretable Machine Learning‐Assisted High‐Throughput Screening for Understanding NRR Electrocatalyst Performance Modulation between Active Center and C‐N Coordination.

Author: Sun, Jinxin, Chen, Anjie, Guan, Junming, Han, Ying, Liu, Yongjun, Niu, Xianghong, He, Maoshuai, Shi, Li, Wang, Jinlan, and Zhang, Xiuyun
Subjects: CONDUCTION electrons, ELECTROLYTIC reduction, SURFACE reactions, ELECTRONIC structure, MACHINE learning
Abstract: Understanding the correlation between the fundamental descriptors and catalytic performance is meaningful to guide the design of high‐performance electrochemical catalysts. However, exploring key factors that affect catalytic performance in the vast catalyst space remains challenging for people. Herein, to accurately identify the factors that affect the performance of N2 reduction, we apply interpretable machine learning (ML) to analyze high‐throughput screening results, which is also suited to other surface reactions in catalysis. To expound on the paradigm, 33 promising catalysts are screened from 168 carbon‐supported candidates, specifically single‐atom catalysts (SACs) supported by a BC3 monolayer (TM@VB/C‐Nn = 0–3‐BC3) via high‐throughput screening. Subsequently, the hybrid sampling method and XGBoost model are selected to classify eligible and non‐eligible catalysts. Through feature interpretation using Shapley Additive Explanations (SHAP) analysis, two crucial features, that is, the number of valence electrons (Nv) and nitrogen substitution (Nn), are screened out. Combining SHAP analysis and electronic structure calculations, the synergistic effect between an active center with low valence electron numbers and reasonable C‐N coordination (a medium fraction of nitrogen substitution) can exhibit high catalytic performance. Finally, six superior catalysts with a limiting potential lower than −0.4 V are predicted. Our workflow offers a rational approach to obtaining key information on catalytic performance from high‐throughput screening results to design efficient catalysts that can be applied to other materials and reactions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. XAI-PhD: Fortifying Trust of Phishing URL Detection Empowered by Shapley Additive Explanations.

Author: Al-Fayoumi, Mustafa, Alhijawi, Bushra, Al-Haija, Qasem Abu, and Armoush, Rakan
Subjects: SOCIAL engineering (Fraud), PHISHING, ARTIFICIAL intelligence, CYBERTERRORISM, MACHINE learning
Abstract: The rapid growth of the Internet has led to an increased demand for online services. However, this surge in online activity has also brought about a new threat: phishing attacks. Phishing is a type of cyberattack that utilizes social engineering techniques and technological manipulations to steal crucial information from unsuspecting individuals. Consequently, there is a rising necessity to create dependable phishing URL detection models that can effectively identify phishing URLs with enhanced accuracy and reduced prediction overhead. This study introduces XAI-PhD, an innovative phishing detection method that utilizes machine learning (ML) and Shapley additive explanation (SHAP) capabilities. Specifically, XAI-PhD utilizes SHAP to thoroughly analyze the significance of each feature in influencing the decision-making process of the classifier. By selectively incorporating input characteristics based on their SHAP values, only the most crucial attributes are assessed, enabling the development of a highly adaptable and generalized model. XAI-PhD utilizes a lightweight gradient boosting machine as its classifier, and a series of rigorous tests are conducted to assess its performance compared to established baseline methods. The empirical findings unequivocally demonstrate the exceptional effectiveness of XAI-PhD, as evidenced by its remarkable accuracy and F1-score of 99.8% and 99%, respectively. Moreover, XAI-PhD exhibits high computational efficiency, requiring only 1.47 milliseconds and 18.5 microseconds per record to generate accurate predictions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. Machine Learning Enhanced by Feature Engineering for Estimating Snow Water Equivalent.

Author: Čistý, Milan, Danko, Michal, Kohnová, Silvia, Považanová, Barbora, and Trizna, Andrej
Subjects: MACHINE learning, SUPPORT vector machines, RANDOM forest algorithms, MISSING data (Statistics), MACHINE performance
Abstract: This study compares the calculation of snow water equivalent (SWE) using machine learning algorithms with the conventional degree-day method. The study uses machine learning techniques such as LASSO, Random Forest, Support Vector Machines, and CatBoost. It proposes an innovative use of feature engineering (FE) to improve the accuracy and robustness of SWE predictions by machine learning intended for interpolation, extrapolation, or imputation of missing data. The performance of machine learning approaches is evaluated against the traditional degree-day method for predicting SWE. The study emphasizes and demonstrates gains when modeling is enhanced by transforming basic, raw data through feature engineering. The results, verified in a case study from the mountainous region of Slovakia, suggest that machine learning, particularly CatBoost with feature engineering, shows better results in SWE estimation in comparison with the degree-day method, although the authors present a refined application of the degree-day method by utilizing genetic algorithms. Nevertheless, the study finds that the degree-day method achieved accuracy with a Nash–Sutcliffe coefficient of efficiency NSE = 0.59, while the CatBoost technique enhanced with the proposed FE achieved an accuracy NSE = 0.86. The results of this research contribute to refining snow hydrology modeling and optimizing SWE prediction for improved decision-making in snow-dominated regions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Predicting Employee Absence from Historical Absence Profiles with Machine Learning.

Author: Zupančič, Peter and Panov, Panče
Subjects: PERSONNEL management, JOB absenteeism, TECHNOLOGICAL innovations, SICK leave, MACHINE learning
Abstract: In today's dynamic business world, organizations are increasingly relying on innovative technologies to improve the efficiency and effectiveness of their human resource (HR) management. Our study uses historical time and attendance data collected with the MojeUre time and attendance system to predict employee absenteeism, including sick and vacation leave, using machine learning methods. We integrate employee demographic data and the absence profiles on timesheets showing daily attendance patterns as fundamental elements for our analysis. We also convert the absence data into a feature-based format suitable for the machine learning methods used. Our primary goal in this paper is to evaluate how well we can predict sick leave and vacation leave over short- and long-term intervals using tree-based machine learning methods based on the predictive clustering paradigm. This paper compares the effectiveness of these methods in different learning settings and discusses their impact on improving HR decision-making processes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. FN-GNN: A Novel Graph Embedding Approach for Enhancing Graph Neural Networks in Network Intrusion Detection Systems.

Author: Tran, Dinh-Hau and Park, Minho
Subjects: ARTIFICIAL neural networks, GRAPH neural networks, RECURRENT neural networks, CONVOLUTIONAL neural networks, DEEP learning, INTRUSION detection systems (Computer security)
Abstract: With the proliferation of the Internet, network complexities for both commercial and state organizations have significantly increased, leading to more sophisticated and harder-to-detect network attacks. This evolution poses substantial challenges for intrusion detection systems, threatening the cybersecurity of organizations and national infrastructure alike. Although numerous deep learning techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and graph neural networks (GNNs) have been applied to detect various network attacks, they face limitations due to the lack of standardized input data, affecting model accuracy and performance. This paper proposes a novel preprocessing method for flow data from network intrusion detection systems (NIDSs), enhancing the efficacy of a graph neural network model in malicious flow detection. Our approach initializes graph nodes with data derived from flow features and constructs graph edges through the analysis of IP relationships within the system. Additionally, we propose a new graph model based on the combination of the graph neural network (GCN) model and SAGEConv, a variant of the GraphSAGE model. The proposed model leverages the strengths while addressing the limitations encountered by the previous models. Evaluations on two IDS datasets, CICIDS-2017 and UNSW-NB15, demonstrate that our model outperforms existing methods, offering a significant advancement in the detection of network threats. This work not only addresses a critical gap in the standardization of input data for deep learning models in cybersecurity but also proposes a scalable solution for improving the intrusion detection accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. Toward compound fault diagnosis via EMAGAN and large kernel augmented few-shot learning.

Author: Wenchang Xu, Zhexian Zhang, Zhijun Wang, Tianao Wang, Zijian He, and Shijie Dong
Subjects: FAULT diagnosis, INDUSTRIAL safety, ARTIFICIAL intelligence, INDUSTRIAL sites, INDUSTRIAL equipment
Abstract: Bearings are essential in machinery. Damage to them can cause financial losses and safety risks at industrial sites. Therefore, it is necessary to design an accurate diagnostic model. Although many bearing fault diagnosis methods have been proposed recently, they still cannot meet the requirements of high-accurate prediction of bearing faults. There are several challenges in this: 1) In practical settings, gathering sufficient and balanced sample data for training diagnostic network models proves challenging. 2) The damage to bearings in real industrial production sites is not singular, and compound faults are also a huge challenge for diagnostic networks. To address these issues, this study introduces a novel fault diagnosis model called EMALKNet that integrates DCGAN with Efficient Multi-Scale Attention (EMAGAN) and RepLKNet-XL, enhancing the detection and analysis of bearing faults in industrial machinery. This model employs EMAGAN to explore the underlying distribution of raw data, thereby enlarging the fault sample pool and enhancing the model's diagnostic capabilities; The large kernel structure of RepLKNet-XL is different from the current mainstream small kernel and has stronger representation extraction ability. The proposed method has been validated on the Paderborn University dataset and the Huazhong University of Science and Technology dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. Radio Signal Modulation Recognition Method Based on Hybrid Feature and Ensemble Learning: For Radar and Jamming Signals.

Author: Zhou, Yu, Cao, Ronggang, Zhang, Anqi, and Li, Ping
Subjects: *MACHINE learning, *ELECTRONIC modulation, *RANDOM forest algorithms, *SIGNAL classification, *FRACTAL dimensions, *RADAR interference
Abstract: The detection performance of radar is significantly impaired by active jamming and mutual interference from other radars. This paper proposes a radio signal modulation recognition method to accurately recognize these signals, which helps in the jamming cancellation decisions. Based on the ensemble learning stacking algorithm improved by meta-feature enhancement, the proposed method adopts random forests, K-nearest neighbors, and Gaussian naive Bayes as the base-learners, with logistic regression serving as the meta-learner. It takes the multi-domain features of signals as input, which include time-domain features including fuzzy entropy, slope entropy, and Hjorth parameters; frequency-domain features, including spectral entropy; and fractal-domain features, including fractal dimension. The simulation experiment, including seven common signal types of radar and active jamming, was performed for the effectiveness validation and performance evaluation. Results proved the proposed method's performance superiority to other classification methods, as well as its ability to meet the requirements of low signal-to-noise ratio and few-shot learning. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Online Handwriting Recognition Method with a Non-Inertial Reference Frame Based on the Measurement of Linear Accelerations and Differential Geometry: An Alternative to Quaternions.

Author: Abarca Jiménez, Griselda Stephany, Muñoz Garnica, Carmen Caritina, Reyes Barranca, Mario Alfredo, Mares Carreño, Jesús, Vega Blanco, Manuel Vladimir, and Gutiérrez Galicia, Francisco
Subjects: LINEAR acceleration, ACCELERATION measurements, CELL phones, LENGTH measurement, QUATERNIONS
Abstract: This work describes a mathematical model for handwriting devices without a specific reference surface (SRS). The research was carried out on two hypotheses: the first considers possible circular segments that could be made during execution for the reconstruction of the trace, and the second is the combination of lines and circles. The proposed system has no flat reference surface, since the sensor is inside the pencil that describes the trace, not on the surface as in tablets or cell phones. An inertial sensor was used for the measurements, in this case, a commercial Micro-Electro Mechanical sensor of linear acceleration. The tracking device is an IMU sensor and a processing card that allows inertial measurements of the pen during on-the-fly tracing. It is essential to highlight that the system has a non-inertial reference frame. Comparing the two proposed models shows that it is possible to construct shapes from curved lines and that the patterns obtained are similar to what is recognized; this method provides an alternative to quaternion calculus for poorly specified orientation problems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. Feature-based detection of breast cancer using convolutional neural network and feature engineering

Author: Hiba Allah Essa, Ebrahim Ismaiel, and Mhd Firas Al Hinnawi
Subjects: Breast cancer, Biomarkers, Convolutional neural networks, Feature engineering, Gaussian distribution, Medicine, Science
Abstract: Abstract Breast cancer (BC) is a prominent cause of female mortality on a global scale. Recently, there has been growing interest in utilizing blood and tissue-based biomarkers to detect and diagnose BC, as this method offers a non-invasive approach. To improve the classification and prediction of BC using large biomarker datasets, several machine-learning techniques have been proposed. In this paper, we present a multi-stage approach that consists of computing new features and then sorting them into an input image for the ResNet50 neural network. The method involves transforming the original values into normalized values based on their membership in the Gaussian distribution of healthy and BC samples of each feature. To test the effectiveness of our proposed approach, we employed the Coimbra and Wisconsin datasets. The results demonstrate efficient performance improvement, with an accuracy of 100% and 100% using the Coimbra and Wisconsin datasets, respectively. Furthermore, the comparison with existing literature validates the reliability and effectiveness of our methodology, where the normalized value can reduce the misclassified samples of ML techniques because of its generality.
Published: 2024
Full Text: View/download PDF

17. REVOLUTIONIZING FEATURE ENGINEERING FOR ROBUST ENSEMBLE MACHINE LEARNING BY HYBRIDIZING MRMR INSIGHT AND CHI2 INDEPENDENCE

Author: Silpa N, Sangram Keshari Swain, and Maheswara Rao V V R
Subjects: feature engineering, minimum redundancy maximum relevance, chi square, ensemble machine learning, incremental feature selection, Engineering (General). Civil engineering (General), TA1-2040
Abstract: In the realm of data science, dealing with real-world datasets often presents a formidable challenge, primarily due to the sheer volume of features that significantly lack relevance or may be redundant. Effective feature engineering is vital in constructing robust ensemble ML models, where the choice of input features influences overall performance. Towards this, the present research presents a novel framework to feature engineering by hybridizing the MRMR insights and Chi2 independence techniques. MRMR emphasizes feature relevance and non-redundancy, while Chi2 quantifies the independence of features from the target variable. The hybrid framework adheres to the incremental feature engineering approach, with the goal of improving predictive accuracy, model robustness, and adaptability. Through extensive experimentation on employed water quality dataset, the framework illustrates the superiority of hybrid model over using MRMR and Chi2 independently. The results of the proposed HFE-EML exhibit substantial improvements, reaching approximately 99.10% in ensemble machine learning models' performance, reduced overfitting, and enhanced generalization.
Published: 2024
Full Text: View/download PDF

18. Prediction of surface roughness using deep learning and data augmentation

Author: Miaoxian Guo, Shouheng Wei, Chentong Han, Wanliang Xia, Chao Luo, and Zhijian Lin
Subjects: Multi-sensor fusion, Surface quality, Digital signal processing, Feature engineering, Neural network, Parameter optimization, Manufactures, TS1-2301
Abstract: Purpose – Surface roughness has a serious impact on the fatigue strength, wear resistance and life of mechanical products. Realizing the evolution of surface quality through theoretical modeling takes a lot of effort. To predict the surface roughness of milling processing, this paper aims to construct a neural network based on deep learning and data augmentation. Design/methodology/approach – This study proposes a method consisting of three steps. Firstly, the machine tool multisource data acquisition platform is established, which combines sensor monitoring with machine tool communication to collect processing signals. Secondly, the feature parameters are extracted to reduce the interference and improve the model generalization ability. Thirdly, for different expectations, the parameters of the deep belief network (DBN) model are optimized by the tent-SSA algorithm to achieve more accurate roughness classification and regression prediction. Findings – The adaptive synthetic sampling (ADASYN) algorithm can improve the classification prediction accuracy of DBN from 80.67% to 94.23%. After the DBN parameters were optimized by Tent-SSA, the roughness prediction accuracy was significantly improved. For the classification model, the prediction accuracy is improved by 5.77% based on ADASYN optimization. For regression models, different objective functions can be set according to production requirements, such as root-mean-square error (RMSE) or MaxAE, and the error is reduced by more than 40% compared to the original model. Originality/value – A roughness prediction model based on multiple monitoring signals is proposed, which reduces the dependence on the acquisition of environmental variables and enhances the model's applicability. Furthermore, with the ADASYN algorithm, the Tent-SSA intelligent optimization algorithm is introduced to optimize the hyperparameters of the DBN model and improve the optimization performance.
Published: 2024
Full Text: View/download PDF

19. Benchmarking computational methods for single-cell chromatin data analysis

Author: Siyuan Luo, Pierre-Luc Germain, Mark D. Robinson, and Ferdinand von Meyenn
Subjects: Benchmark, ScATAC-seq, Clustering, Feature engineering, Dimensional reduction, Biology (General), QH301-705.5, Genetics, QH426-470
Abstract: Abstract Background Single-cell chromatin accessibility assays, such as scATAC-seq, are increasingly employed in individual and joint multi-omic profiling of single cells. As the accumulation of scATAC-seq and multi-omics datasets continue, challenges in analyzing such sparse, noisy, and high-dimensional data become pressing. Specifically, one challenge relates to optimizing the processing of chromatin-level measurements and efficiently extracting information to discern cellular heterogeneity. This is of critical importance, since the identification of cell types is a fundamental step in current single-cell data analysis practices. Results We benchmark 8 feature engineering pipelines derived from 5 recent methods to assess their ability to discover and discriminate cell types. By using 10 metrics calculated at the cell embedding, shared nearest neighbor graph, or partition levels, we evaluate the performance of each method at different data processing stages. This comprehensive approach allows us to thoroughly understand the strengths and weaknesses of each method and the influence of parameter selection. Conclusions Our analysis provides guidelines for choosing analysis methods for different datasets. Overall, feature aggregation, SnapATAC, and SnapATAC2 outperform latent semantic indexing-based methods. For datasets with complex cell-type structures, SnapATAC and SnapATAC2 are preferred. With large datasets, SnapATAC2 and ArchR are most scalable.
Published: 2024
Full Text: View/download PDF

20. Application of artificial intelligence for feature engineering in education sector and learning science

Author: Chao Wang, Tao Li, Zhicui Lu, Zhenqiang Wang, Tmader Alballa, Somayah Abdualziz Alhabeeb, Maryam Sulaiman Albely, and Hamiden Abd El-Wahed Khalifa
Subjects: Artificial Intelligence, Feature Engineering, Education Sector, Machine Learning, Forecasting, Data set, Engineering (General). Civil engineering (General), TA1-2040
Abstract: This study investigates the utilization of artificial intelligence (AI) for feature engineering in the education sector, highlighting its potential to enhance individualized learning and improve academic outcomes. The correlation analysis, performed using a correlation matrix of the feature set, indicated that specific pairings of characteristics exhibit a strong association, resulting in the ineffectiveness of conventional models. In order to tackle this issue, we utilized three sophisticated machine learning methodologies: Adaptive Lasso (ALasso), Artificial Neural Networks (ANN), and Support Vector Regression (SVR). The ALasso model discovered several influential characteristics, namely Gender (X5), Education (X1), Hours of Work (X4), and Marital Status (X6), that significantly affect salaries. Subsequently, a comparative evaluation of these methods was conducted using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The results demonstrated that SVR outperformed the other techniques, with the most optimal RMSE of 0.595 and MAE of 0.423. These findings emphasize the significance of using data-driven strategies in policymaking and propose further investigation into the use of AI methods in various educational contexts to improve the identification of features and the performance of models.
Published: 2025
Full Text: View/download PDF

21. Research on urban water demand prediction based on machine learning and feature engineering

Author: Dongfei Yan, Yi Tao, Jianqi Zhang, and Huijia Yang
Subjects: data cleaning, data interpolation, feature engineering, machine learning, water demand prediction, water supply system, Water supply for domestic and industrial purposes, TD201-500, River, lake, and water-supply engineering (General), TC401-506
Abstract: Urban water demand prediction is not only the foundation of water resource planning and management, but also an important component of water supply system optimization and scheduling. Therefore, predicting future water demand is of great significance. For univariate time series data, the issue of outliers can be solved through data preprocessing. Then, the data input dimension is increased through feature engineering, and finally, the LightGBM (Light Gradient Boosting Machine) model is used to predict future water demand. The results demonstrate that cubic polynomial interpolation outperforms the Prophet model and the linear method in the context of missing value interpolation tasks. In terms of predicting water demand, the LightGBM model demonstrates excellent forecasting performance and can effectively predict future water demand trends. The evaluation indicators MAPE (mean absolute percentage error) and NSE (Nash–Sutcliffe efficiency coefficient) on the test dataset are 4.28% and 0.94, respectively. These indicators can provide a scientific basis for short-term prediction of water supply enterprises. HIGHLIGHTS Interpolation of raw training data may not necessarily improve the performance of predictive models.; Accurate prediction of univariate data can be achieved through feature engineering and machine learning.;
Published: 2024
Full Text: View/download PDF

22. A semantic-based model with a hybrid feature engineering process for accurate spam detection

Author: Chira N. Mohammed and Ayah M. Ahmed
Subjects: Spam detection, Feature engineering, TF-IDF, Word embeddings, Feature selection, SVM, Electrical engineering. Electronics. Nuclear engineering, TK1-9971, Information technology, T58.5-58.64
Abstract: Abstract Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming techniques. In this study, we propose a novel spam detection model with a comprehensive feature engineering approach that combines term frequency-inverse document frequency (TF-IDF) vectorizer and word embedding features to optimize the feature space. Our contribution lies in integrating semantic-based word embeddings, leveraging pre-existing knowledge to capture the semantic meaning of words and enhance the representation of email texts. To identify the most suitable word embedding technique for our model, we evaluated GloVe, Word2Vec, and FastText. GloVe was selected for its better performance, which is the result of its pre-training on a large and diverse text corpus. Furthermore, the model was evaluated without word embeddings, which did not exhibit the same effectiveness level as our word embedding-based model. Additionally, we utilized the support vector machine as a classifier and hyperparameter tuning technique to identify our model’s most effective parameter values. The proposed model was tested on two datasets. The experimental results showed that our model outperformed the other models discussed in the literature, achieving an accuracy of 99.5% on the SpamAssassin dataset, and 99.28% on the Enron-Spam dataset.
Published: 2024
Full Text: View/download PDF

23. Reading Between the Lines: Machine Learning Ensemble and Deep Learning for Implied Threat Detection in Textual Data

Author: Muhammad Owais Raza, Areej Fatemah Meghji, Naeem Ahmed Mahoto, Mana Saleh Al Reshan, Hamad Ali Abosaq, Adel Sulaiman, and Asadullah Shaikh
Subjects: Implied threat detection, Text mining, Linguistic feature analysis, Feature engineering, Social media, Text classification, Electronic computers. Computer science, QA75.5-76.95
Abstract: Abstract With the increase in the generation and spread of textual content on social media, natural language processing (NLP) has become an important area of research for detecting underlying threats, racial abuse, violence, and implied warnings in the content. The subtlety and ambiguity of language make the development of effective models for detecting threats in text a challenging task. This task is further complicated when the threat is not explicitly conveyed. This study focuses on the task of implied threat detection using an explicitly designed machine-generated dataset with both linguistic and lexical features. We evaluated the performance of different machine learning algorithms on these features including Support Vector Machines, Logistic Regression, Naive Bayes, Decision Tree, and K-nearest neighbors. The ensembling approaches of Adaboost, Random Forest, and Gradient Boosting were also explored. Deep learning modeling was performed using Long Short-Term Memory, Deep Neural Networks (DNN), and Bidirectional Long Short-Term Memory (BiLSTM). Based on the evaluation, it was observed that classical and ensemble models overfit while working with linguistic features. The performance of these models improved when working with lexical features. The model based on logistic regression exhibited superior performance with an F 1 score of 77.13%. While experimenting with deep learning models, DNN achieved an F 1 score of 91.49% while the BiLSTM achieved an F 1 score of 91.61% while working with lexical features. The current study provides a baseline for future research in the domain of implied threat detection.
Published: 2024
Full Text: View/download PDF

24. DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Author: Meysam Ahangaran, Hanzhi Zhu, Ruihui Li, Lingkai Yin, Joseph Jang, Arnav P. Chaudhry, Lindsay A. Farrer, Rhoda Au, and Vijaya B. Kolachalama
Subjects: Machine learning, Data readiness, Data quality measure, Feature engineering, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Background Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. Results The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. Conclusion Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.
Published: 2024
Full Text: View/download PDF

25. An optimized diabetes mellitus detection model for improved prediction of accuracy and clinical decision-making

Author: Turke Althobaiti, Saad Althobaiti, and Mahmoud M. Selim
Subjects: Diabetes mellitus, GBM-DRU network, Feature engineering, Ensemble learning, Clinical decision-making, Engineering (General). Civil engineering (General), TA1-2040
Abstract: Diabetes Mellitus (DM) is an enduring metabolic illness that disturbs many individuals globally. This study addresses the global impact of Diabetes Mellitus (DM) and emphasizes the critical role of accurate DM detection in early diagnosis, effective treatment, and prevention of complications. The research introduces an optimized DM detection model, the GBM-DRU (Gradient Boosting Machine - Data Reduction Unit) network, which integrates feature engineering and ensemble learning techniques to enhance prediction accuracy and support clinical decision-making. The GBM-DRU network combines the powerful gradient boosting machine algorithm with a data reduction unit (DRU) for efficient feature selection, reducing dimensionality and improving computational efficiency. Feature engineering enhances discriminatory power, while ensemble learning methods, including bagging and boosting, improve overall model performance. Rigorous experiments on a comprehensive dataset of DM patients demonstrate that the proposed approach outperforms existing models in terms of accuracy, sensitivity, specificity, and AUC-ROC. The optimized model provides valuable insights into feature importance, aiding clinical decision-making and deepening the understanding of DM risk factors. Therefore, the GBM-DRU network, utilizing feature engineering and ensemble learning, presents a viable approach to precise diagnosis of diabetes mellitus, with favorable implications for patient outcomes, disease control, and public health campaigns. The improved prediction accuracy, feature interpretability, and clinical decision support capabilities of the model may have a beneficial effect on public health campaigns, disease management, and patient outcomes.
Published: 2024
Full Text: View/download PDF

26. An Automated Machine Learning Framework for Adaptive and Optimized Hyperspectral-Based Land Cover and Land-Use Segmentation.

Author: Vali, Ava, Comai, Sara, and Matteucci, Matteo
Subjects: *ENGINEERING models, *DEEP learning, *LAND cover, *REMOTE sensing, *WORKFLOW
Abstract: Hyperspectral imaging holds significant promise in remote sensing applications, particularly for land cover and land-use classification, thanks to its ability to capture rich spectral information. However, leveraging hyperspectral data for accurate segmentation poses critical challenges, including the curse of dimensionality and the scarcity of ground truth data, that hinder the accuracy and efficiency of machine learning approaches. This paper presents a holistic approach for adaptive optimized hyperspectral-based land cover and land-use segmentation using automated machine learning (AutoML). We address the challenges of high-dimensional hyperspectral data through a revamped machine learning pipeline, thus emphasizing feature engineering tailored to hyperspectral classification tasks. We propose a framework that dissects feature engineering into distinct steps, thus allowing for comprehensive model generation and optimization. This framework incorporates AutoML techniques to streamline model selection, hyperparameter tuning, and data versioning, thus ensuring robust and reliable segmentation results. Our empirical investigation demonstrates the efficacy of our approach in automating feature engineering and optimizing model performance, even without extensive ground truth data. By integrating automatic optimization strategies into the segmentation workflow, our approach offers a systematic, efficient, and scalable solution for hyperspectral-based land cover and land-use classification. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. A semantic-based model with a hybrid feature engineering process for accurate spam detection.

Author: Mohammed, Chira N. and Ahmed, Ayah M.
Subjects: SPAM email, SUPPORT vector machines, EMAIL security
Abstract: Detecting spam emails is essential to maintaining the security and integrity of email communication. Existing research has made significant progress in developing effective spam detection models, but challenges remain in improving classification performance and adaptability to evolving spamming techniques. In this study, we propose a novel spam detection model with a comprehensive feature engineering approach that combines term frequency-inverse document frequency (TF-IDF) vectorizer and word embedding features to optimize the feature space. Our contribution lies in integrating semantic-based word embeddings, leveraging pre-existing knowledge to capture the semantic meaning of words and enhance the representation of email texts. To identify the most suitable word embedding technique for our model, we evaluated GloVe, Word2Vec, and FastText. GloVe was selected for its better performance, which is the result of its pre-training on a large and diverse text corpus. Furthermore, the model was evaluated without word embeddings, which did not exhibit the same effectiveness level as our word embedding-based model. Additionally, we utilized the support vector machine as a classifier and hyperparameter tuning technique to identify our model's most effective parameter values. The proposed model was tested on two datasets. The experimental results showed that our model outperformed the other models discussed in the literature, achieving an accuracy of 99.5% on the SpamAssassin dataset, and 99.28% on the Enron-Spam dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Optimizing Tourism Accommodation Offers by Integrating Language Models and Knowledge Graph Technologies.

Author: Cadeddu, Andrea, Chessa, Alessandro, De Leo, Vincenzo, Fenu, Gianni, Motta, Enrico, Osborne, Francesco, Reforgiato Recupero, Diego, Salatino, Angelo, and Secchi, Luca
Subjects: *LANGUAGE models, *NATURAL language processing, *KNOWLEDGE graphs, *CLASSIFICATION, *LANGUAGE acquisition
Abstract: Online platforms have become the primary means for travellers to search, compare, and book accommodations for their trips. Consequently, online platforms and revenue managers must acquire a comprehensive comprehension of these dynamics to formulate a competitive and appealing offerings. Recent advancements in natural language processing, specifically through the development of large language models, have demonstrated significant progress in capturing the intricate nuances of human language. On the other hand, knowledge graphs have emerged as potent instruments for representing and organizing structured information. Nevertheless, effectively integrating these two powerful technologies remains an ongoing challenge. This paper presents an innovative deep learning methodology that combines large language models with domain-specific knowledge graphs for classification of tourism offers. The main objective of our system is to assist revenue managers in the following two fundamental dimensions: (i) comprehending the market positioning of their accommodation offerings, taking into consideration factors such as accommodation price and availability, together with user reviews and demand, and (ii) optimizing presentations and characteristics of the offerings themselves, with the intention of improving their overall appeal. For this purpose, we developed a domain knowledge graph covering a variety of information about accommodations and implemented targeted feature engineering techniques to enhance the information representation within a large language model. To evaluate the effectiveness of our approach, we conducted a comparative analysis against alternative methods on four datasets about accommodation offers in London. The proposed solution obtained excellent results, significantly outperforming alternative methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. A Novel Artificial Intelligence Prediction Process of Concrete Dam Deformation Based on a Stacking Model Fusion Method.

Author: Wu, Wenyuan, Su, Huaizhi, Feng, Yanming, Zhang, Shuai, Zheng, Sen, Cao, Wenhan, and Liu, Hongchen
Subjects: CONCRETE dams, ARTIFICIAL intelligence, HYDRAULIC structures, DEFORMATIONS (Mechanics), MULTIPLE intelligences, COMPOSITE columns, MACHINE learning
Abstract: Deformation effectively represents the structural integrity of concrete dams and acts as a clear indicator of their operational performance. Predicting deformation is critical for monitoring the safety of hydraulic structures. To this end, this paper proposes an artificial intelligence-based process for predicting concrete dam deformation. Initially, using the principles of feature engineering, the preprocessing of deformation safety monitoring data is conducted. Subsequently, employing a stacking model fusion method, a novel prediction process embedded with multiple artificial intelligence algorithms is developed. Moreover, three new performance indicators—a superiority evaluation indicator, an accuracy evaluation indicator, and a generalization evaluation indicator—are introduced to provide a comprehensive assessment of the model's effectiveness. Finally, an engineering example demonstrates that the ensemble artificial intelligence method proposed herein outperforms traditional statistical models and single machine learning models in both fitting and predictive accuracy, thereby providing a scientific and effective foundation for concrete dam deformation prediction and safety monitoring. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Investigation of the Electrical Impedance Signal Behavior in Rolling Element Bearings as a New Approach for Damage Detection.

Author: Becker-Dombrowsky, Florian Michael, Schink, Johanna, Frischmuth, Julian, and Kirchner, Eckhard
Subjects: ROLLER bearings, ELECTRIC impedance, ALTERNATING currents, FATIGUE testing machines, ANGLES
Abstract: The opportunities of impedance-based condition monitoring for rolling bearings have been shown earlier by the authors: Changes in the impedance signal and the derived features enable the detection of pitting damages. Localizing and measuring the pitting length in the raceway direction is possible. Furthermore, the changes in features behavior are physically explainable. These investigations were focused on a single bearing type and only one load condition. Different bearing types and load angles were not considered yet. Thus, the impedance signals and their features of different bearing types under different load angles are investigated and compared. The signals are generated in fatigue tests on a rolling bearing test rig with conventional integrated vibration analysis based on structural borne sound. The rolling bearing impedance is gauged using an alternating current measurement bridge. Significant changes in the vibration signals mark the end of the fatigue tests. Therefore, comparing the response time of the impedance can be compared to the vibration signal response time. It can be shown that the rolling bearing impedance is an instrument for condition monitoring, independently from the bearing type. In case of pure radial loads, explicit changes in the impedance signal are detectable, which indicate a pitting damage. Under combined loads, the signal changes are detectable as well, but not as significant as under radial load. Damage-indicating signal changes occur later compared to pure radial loads, but nevertheless enable an early detection. Therefore, the rolling bearing impedance is an instrument for pitting damage detection, independently from bearing type and load angle. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. Enhancing Stock Price Prediction Accuracy Through Ensemble Learning Strategies: A Comparative Study.

Author: Nagar, Shashikant and Mathur, Kirti
Subjects: LEARNING strategies, MARKET volatility, RANDOM forest algorithms, ECONOMIC indicators, FINANCIAL markets
Abstract: This research explores the effectiveness of ensemble learning techniques, including Random Forest, Gradient Boosting, and Stacking, in improving the accuracy and reliability of stock price predictions. Leveraging a dataset of daily trading data for Amicorp Inc. spanning three years, we conducted a comprehensive analysis to investigate the impact of feature engineering, model performance, robustness, and adaptability to varying market conditions. Our findings reveal that feature engineering significantly enhances model performance, with models incorporating additional financial indicators consistently outperforming those without. Among the ensemble methods evaluated, the Random Forest ensemble emerged as the top performer, demonstrating its superiority with the lowest prediction errors. Furthermore, the model displayed robustness in volatile market conditions and resistance to outliers. Market regime analysis highlighted the adaptability of ensemble methods, with consistent performance across bull, bear, and sideways markets. Practical implications were exemplified through a strategic trading strategy based on Random Forest predictions, achieving favorable risk-adjusted returns. These results contribute valuable insights to researchers and practitioners seeking to employ ensemble learning in stock price prediction, underlining its potential for enhancing forecasting accuracy in real-world financial markets. [ABSTRACT FROM AUTHOR]
Published: 2024

32. Effective interpretable learning for large-scale categorical data.

Author: Zhang, Yishuo, Zaidi, Nayyar, Zhou, Jiahui, Wang, Tao, and Li, Gang
Subjects: MACHINE learning, ARTIFICIAL neural networks, ENGINEERING models, BAYESIAN analysis, SKEWNESS (Probability theory), DEEP learning
Abstract: Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. wide learning). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or ANN (aka. deep learning). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. factorized learning). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, WD-Learning, xdeepFM, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—WBDF-Learning, based on the combination of wide, deep, factorization, and a newly introduced component named Broad Interaction network (BIN). BIN is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with wide, deep and factorized parts. We denote the learning of BIN parameters as broad learning. Additionally, the parameters of BIN are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from BIN, which can facilitate learning and provides a framework for knowledge-guided machine learning. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, WBDF framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Improving the Automatic Detection of Dropout Risk in Middle and High School Students: A Comparative Study of Feature Selection Techniques.

Author: Zapata-Medina, Daniel, Espinosa-Bedoya, Albeiro, and Jiménez-Builes, Jovani Alberto
Subjects: *FEATURE selection, *MIDDLE school students, *HIGH school students, *METAHEURISTIC algorithms, *SCHOOL dropouts, *MACHINE learning
Abstract: The dropout rate in underdeveloped and emerging countries is a pressing social issue, as highlighted by studies conducted by The Organization for Economic Co-operation and Development. This study compares five feature selection techniques to address this challenge and improve the automatic detection of dropout risk. The methodological design involves three distinct phases: data preparation, feature selection, and model evaluation utilizing machine learning algorithms. The results demonstrate that (1) the top features identified by feature selection techniques, i.e., those constructed through feature engineering, proved to be among the most effective in classifying student dropout; (2) the F-score of the best model increased by 5% with feature selection techniques; and (3) depending on the type of feature selection, the performance of the machine learning algorithm can vary, potentially increasing or decreasing based on the sensitivity of features with higher noise. At the same time, metaheuristic algorithms demonstrated significant precision improvements, but there was a risk of increasing errors and reducing recall. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. Deciphering the microbial landscape of lower respiratory tract infections: insights from metagenomics and machine learning.

Author: Jiahuan Li, Anying Xiong, Junyi Wang, Xue Wu, Lingling Bai, Lei Zhang, Xiang He, and Guoping Li
Subjects: RESPIRATORY infections, MACHINE learning, ACINETOBACTER baumannii, CANDIDA albicans, ASPERGILLUS, METAGENOMICS, METHYLOBACTERIUM
Abstract: Background: Lower respiratory tract infections represent prevalent ailments. Nonetheless, current comprehension of the microbial ecosystems within the lower respiratory tract remains incomplete and necessitates further comprehensive assessment. Leveraging the advancements in metagenomic next-generation sequencing (mNGS) technology alongside the emergence of machine learning, it is now viable to compare the attributes of lower respiratory tract microbial communities among patients across diverse age groups, diseases, and infection types. Method: We collected bronchoalveolar lavage fluid samples from 138 patients diagnosed with lower respiratory tract infections and conducted mNGS to characterize the lung microbiota. Employing various machine learning algorithms, we investigated the correlation of key bacteria in patients with concurrent bronchiectasis and developed a predictive model for hospitalization duration based on these identified key bacteria. Result: We observed variations in microbial communities across different age groups, diseases, and infection types. In the elderly group, Pseudomonas aeruginosa exhibited the highest relative abundance, followed by Corynebacterium striatum and Acinetobacter baumannii. Methylobacterium and Prevotella emerged as the dominant genera at the genus level in the younger group, while Mycobacterium tuberculosis and Haemophilus influenzae were prevalent species. Within the bronchiectasis group, dominant bacteria included Pseudomonas aeruginosa, Haemophilus influenzae, and Klebsiella pneumoniae. Significant differences in the presence of Pseudomonas phage JBD93 were noted between the bronchiectasis group and the control group. In the group with concomitant fungal infections, the most abundant genera were Acinetobacter and Pseudomonas, with Acinetobacter baumannii and Pseudomonas aeruginosa as the predominant species. Notable differences were observed in the presence of Human gammaherpesvirus 4, Human betaherpesvirus 5, Candida albicans, Aspergillus oryzae, and Aspergillus fumigatus between the group with concomitant fungal infections and the bacterial group. Machine learning algorithms were utilized to select bacteria and clinical indicators associated with hospitalization duration, confirming the excellent performance of bacteria in predicting hospitalization time. Conclusion: Our study provided a comprehensive description of the microbial characteristics among patients with lower respiratory tract infections, offering insights from various perspectives. Additionally, we investigated the advanced predictive capability of microbial community features in determining the hospitalization duration of these patients. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. Wearable-Based Integrated System for In-Home Monitoring and Analysis of Nocturnal Enuresis.

Author: Lee, Sangyeop, Moon, Junhyung, Lee, Yong Seung, Shin, Seung-chul, and Lee, Kyoungwoo
Subjects: *ENURESIS, *PATIENTS' families, *PATIENT monitoring, *HEART beat, *MACHINE learning, *DEEP learning
Abstract: Nocturnal enuresis (NE) is involuntary bedwetting during sleep, typically appearing in young children. Despite the potential benefits of the long-term home monitoring of NE patients for research and treatment enhancement, this area remains underexplored. To address this, we propose NEcare, an in-home monitoring system that utilizes wearable devices and machine learning techniques. NEcare collects sensor data from an electrocardiogram, body impedance (BI), a three-axis accelerometer, and a three-axis gyroscope to examine bladder volume (BV), heart rate (HR), and periodic limb movements in sleep (PLMS). Additionally, it analyzes the collected NE patient data and supports NE moment estimation using heuristic rules and deep learning techniques. To demonstrate the feasibility of in-home monitoring for NE patients using our wearable system, we used our datasets from 30 in-hospital patients and 4 in-home patients. The results show that NEcare captures expected trends associated with NE occurrences, including BV increase, HR increase, and PLMS appearance. In addition, we studied the machine learning-based NE moment estimation, which could help relieve the burdens of NE patients and their families. Finally, we address the limitations and outline future research directions for the development of wearable systems for NE patients [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Enhancing Personalized Recommendations: A Study on the Efficacy of Multi-Task Learning and Feature Integration.

Author: Wang, Qinyong, Jin, Enman, Zhang, Huizhong, Chen, Yumeng, Yue, Yinggao, Dorado, Danilo B., Hu, Zhongyi, and Xu, Minghai
Subjects: *RECOMMENDER systems, *STANDARD deviations
Abstract: Personalized recommender systems play a crucial role in assisting users in discovering items of interest from vast amounts of information across various domains. However, developing accurate personalized recommender systems remains challenging due to the need to balance model architectures, input feature combinations, and fusion of heterogeneous data sources. This study investigates the impacts of these factors on recommendation performance using the MovieLens and Book Recommendation datasets. Six models, including single-task neural networks, multi-task learning, and baselines, were evaluated with various input feature combinations using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The multi-task learning approach achieved significantly lower RMSE and MAE by effectively leveraging heterogeneous data sources for personalized recommendations through a shared neural network architecture. Furthermore, incorporating user data and content data progressively enhanced performance compared to using only item identifiers. The findings highlight the importance of advanced model architectures and fusing heterogeneous data sources for high-quality recommendations, providing valuable insights for designing effective recommender systems across diverse domains. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Feature engineering of environmental covariates improves plant genomic-enabled prediction.

Author: Montesinos-López, Osval A., Crespo-Herrera, Leonardo, Saint Pierre, Carolina, Cano-Paez, Bernabe, Huerta-Prado, Gloria Isabel, Mosqueda-González, Brandon Alejandro, Ramos-Pulido, Sofia, Gerard, Guillermo, Alnowibet, Khalid, Fritsche-Neto, Roberto, Montesinos-López, Abelardo, and Crossa, José
Subjects: ENVIRONMENTAL engineering, PREDICTION models, FORECASTING
Abstract: Introduction: Because Genomic selection (GS) is a predictive methodology, it needs to guarantee high-prediction accuracies for practical implementations. However, since many factors affect the prediction performance of this methodology, its practical implementation still needs to be improved in many breeding programs. For this reason, many strategies have been explored to improve the prediction performance of this methodology. Methods: When environmental covariates are incorporated as inputs in the genomic prediction models, this information only sometimes helps increase prediction performance. For this reason, this investigation explores the use of feature engineering on the environmental covariates to enhance the prediction performance of genomic prediction models. Results and discussion: We found that across data sets, feature engineering helps reduce prediction error regarding only the inclusion of the environmental covariates without feature engineering by 761.625% across predictors. These results are very promising regarding the potential of feature engineering to enhance prediction accuracy. However, since a significant gain in prediction accuracy was observed in only some data sets, further research is required to guarantee a robust feature engineering strategy to incorporate the environmental covariates. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. A scoping review of machine learning for sepsis prediction- feature engineering strategies and model performance: a step towards explainability.

Author: Bomrah, Sherali, Uddin, Mohy, Upadhyay, Umashankar, Komorowski, Matthieu, Priya, Jyoti, Dhar, Eshita, Hsu, Shih-Chang, and Syed-Abdul, Shabbir
Abstract: Background: Sepsis, an acute and potentially fatal systemic response to infection, significantly impacts global health by affecting millions annually. Prompt identification of sepsis is vital, as treatment delays lead to increased fatalities through progressive organ dysfunction. While recent studies have delved into leveraging Machine Learning (ML) for predicting sepsis, focusing on aspects such as prognosis, diagnosis, and clinical application, there remains a notable deficiency in the discourse regarding feature engineering. Specifically, the role of feature selection and extraction in enhancing model accuracy has been underexplored. Objectives: This scoping review aims to fulfill two primary objectives: To identify pivotal features for predicting sepsis across a variety of ML models, providing valuable insights for future model development, and To assess model efficacy through performance metrics including AUROC, sensitivity, and specificity. Results: The analysis included 29 studies across diverse clinical settings such as Intensive Care Units (ICU), Emergency Departments, and others, encompassing 1,147,202 patients. The review highlighted the diversity in prediction strategies and timeframes. It was found that feature extraction techniques notably outperformed others in terms of sensitivity and AUROC values, thus indicating their critical role in improving sepsis prediction models. Conclusion: Key dynamic indicators, including vital signs and critical laboratory values, are instrumental in the early detection of sepsis. Applying feature selection methods significantly boosts model precision, with models like Random Forest and XG Boost showing promising results. Furthermore, Deep Learning models (DL) reveal unique insights, spotlighting the pivotal role of feature engineering in sepsis prediction, which could greatly benefit clinical practice. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. Machine Learning-Based Anomaly Detection for Securing In-Vehicle Networks.

Author: Alfardus, Asma and Rawat, Danda B.
Subjects: ANOMALY detection (Computer security), DEEP learning, ELECTRONIC equipment, ELECTRIC vehicles, MACHINERY, ENGINEERING
Abstract: In-vehicle networks (IVNs) are networks that allow communication between different electronic components in a vehicle, such as infotainment systems, sensors, and control units. As these networks become more complex and interconnected, they become more vulnerable to cyber-attacks that can compromise safety and privacy. Anomaly detection is an important tool for detecting potential threats and preventing cyber-attacks in IVNs. The proposed machine learning-based anomaly detection technique uses deep learning and feature engineering to identify anomalous behavior in real-time. Feature engineering involves selecting and extracting relevant features from the data that are useful for detecting anomalies. Deep learning involves using neural networks to learn complex patterns and relationships in the data. Our experiments show that the proposed technique have achieved high accuracy in detecting anomalies and outperforms existing state-of-the-art methods. This technique can be used to enhance the security of IVNs and prevent cyber-attacks that can have serious consequences for drivers and passengers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. An optimized diabetes mellitus detection model for improved prediction of accuracy and clinical decision-making.

Author: Althobaiti, Turke, Althobaiti, Saad, and Selim, Mahmoud M.
Subjects: CLINICAL decision support systems, DIABETES, DECISION making, BOOSTING algorithms, PREDICTION models, FEATURE selection, STOCHASTIC learning models
Abstract: Diabetes Mellitus (DM) is an enduring metabolic illness that disturbs many individuals globally. This study addresses the global impact of Diabetes Mellitus (DM) and emphasizes the critical role of accurate DM detection in early diagnosis, effective treatment, and prevention of complications. The research introduces an optimized DM detection model, the GBM-DRU (Gradient Boosting Machine - Data Reduction Unit) network, which integrates feature engineering and ensemble learning techniques to enhance prediction accuracy and support clinical decision-making. The GBM-DRU network combines the powerful gradient boosting machine algorithm with a data reduction unit (DRU) for efficient feature selection, reducing dimensionality and improving computational efficiency. Feature engineering enhances discriminatory power, while ensemble learning methods, including bagging and boosting, improve overall model performance. Rigorous experiments on a comprehensive dataset of DM patients demonstrate that the proposed approach outperforms existing models in terms of accuracy, sensitivity, specificity, and AUC-ROC. The optimized model provides valuable insights into feature importance, aiding clinical decision-making and deepening the understanding of DM risk factors. Therefore, the GBM-DRU network, utilizing feature engineering and ensemble learning, presents a viable approach to precise diagnosis of diabetes mellitus, with favorable implications for patient outcomes, disease control, and public health campaigns. The improved prediction accuracy, feature interpretability, and clinical decision support capabilities of the model may have a beneficial effect on public health campaigns, disease management, and patient outcomes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Heart disease prediction using ML through enhanced feature engineering with association and correlation analysis.

Author: Lakshmanarao, Annemneedi, Krishna, Thotakura Venkata Sai, Kiran, Tummala Srinivasa Ravi, krishna, Chinta Venkata Murali, Ushanag, Samsani, and Supriya, Nandikolla
Subjects: HEART diseases, STATISTICAL correlation, MACHINE learning, SUPPORT vector machines, K-nearest neighbor classification, CLASSIFICATION algorithms
Abstract: Heart disease remains a prevalent and critical health concern globally. This paper addresses the critical task of heart disease prediction through the utilization of advanced machine learning techniques. Our approach focuses on the enhancement of feature engineering by incorporating a novel integration of association and correlation analyses. A heart disease dataset from Kaggle was used for the experiments. Association analysis was applied to the categorical and binary features in the dataset. Correlation analysis was applied to the numerical features in the dataset. Based on the insights from association analysis and correlation analysis, a new dataset was created with combinations of features. Later, newly created features are integrated with the original dataset, and classification algorithms are applied. Five machine learning (ML) classifiers, namely decision tree, k-nearest neighbors (KNN), random forest, XG-Boost, and support vector machine (SVM), were applied to the final dataset and achieved a good accuracy rate for heart disease detection. By systematically exploring associations and relationships with categorical, binary, and numerical features, this paper unveils innovative insights that contribute to a more comprehensive understanding of the heart disease dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Prediction of the Properties of Vibro-Centrifuged Variatropic Concrete in Aggressive Environments Using Machine Learning Methods.

Author: Beskopylny, Alexey N., Stel'makh, Sergey A., Shcherban', Evgenii M., Razveeva, Irina, Kozhakin, Alexey, Pembek, Anton, Kondratieva, Tatiana N., Elshaeva, Diana, Chernil'nik, Andrei, and Beskopylny, Nikita
Subjects: MACHINE learning, SMART structures, REINFORCED concrete, COMPOSITE columns, CONCRETE, ARTIFICIAL intelligence, COMPRESSIVE strength
Abstract: In recent years, one of the most promising areas in modern concrete science and the technology of reinforced concrete structures is the technology of vibro-centrifugation of concrete, which makes it possible to obtain reinforced concrete elements with a variatropic structure. However, this area is poorly studied and there is a serious deficiency in both scientific and practical terms, expressed in the absence of a systematic knowledge of the life cycle management processes of vibro-centrifuged variatropic concrete. Artificial intelligence methods are seen as one of the most promising methods for improving the process of managing the life cycle of such concrete in reinforced concrete structures. The purpose of the study is to develop and compare machine learning algorithms based on ridge regression, decision tree and extreme gradient boosting (XGBoost) for predicting the compressive strength of vibro-centrifuged variatropic concrete using a database of experimental values obtained under laboratory conditions. As a result of laboratory tests, a dataset of 664 samples was generated, describing the influence of aggressive environmental factors (freezing–thawing, chloride content, sulfate content and number of wetting–drying cycles) on the final strength characteristics of concrete. The use of analytical techniques to extract additional knowledge from data contributed to improving the resulting predictive properties of machine learning models. As a result, the average absolute percentage error (MAPE) for the best XGBoost algorithm was 2.72%, mean absolute error (MAE) = 1.134627, mean squared error (MSE) = 4.801390, root-mean-square error (RMSE) = 2.191208 and R2 = 0.93, which allows to conclude that it is possible to use "smart" algorithms to improve the life cycle management process of vibro-centrifuged variatropic concrete, by reducing the time required for the compressive strength assessment of new structures. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. UNDERSTANDING COLLEGE STUDENTS' SATISFACTION WITH CHATGPT: AN EXPLORATORY AND PREDICTIVE MACHINE LEARNING APPROACH USING FEATURE ENGINEERING.

Author: Pabreja, Kavita and Pabreja, Nishtha
Subjects: CHATGPT, MACHINE learning, SATISFACTION, COLLEGE students, ARTIFICIAL intelligence
Abstract: Artificial Intelligence (AI) technologies are continually improving and becoming more pervasive in many facets of our lives. ChatGPT is one such cutting-edge artificial intelligence application, and it has received a lot of worldwide media attention, specifically from educationists, technologists, and learners. It is imperative to understand and evaluate the impact of ChatGPT on computer science students as it directly and holistically influences them. A quantitative instrumental case study explores ChatGPT's impact on early adopters in education. A survey of undergraduate computer science students at a state university of Delhi was conducted to get insight into their opinion on adopting this revolutionising technology for their education, career, and overall satisfaction. An end-to-end data science approach is applied to encompass exploratory and predictive modelling with feature engineering solutions. Results reveal the most influential features contributing to students' satisfaction in adopting ChatGPT for their day-to-day chores concerning their social life, education, and career. The Linear Support Vector classifier, a machine learning algorithm for predicting the satisfaction or dissatisfaction in students' shows an accuracy score of 72.73% and 97.72%, respectively. The AUC for this multiclass prediction model is convincing and is 0.74, 0.71, and 0.96 for satisfied, neutral, and dissatisfied classes, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Integrated Mixed Potential Gas Sensor with Efficient Structure for Discriminative Volatile Organic Compounds Detection

Author: Siyuan Lv, Tianyi Gu, Qi Pu, Bin Wang, Xiaoteng Jia, Peng Sun, Lijun Wang, Fangmeng Liu, and Geyu Lu
Subjects: feature engineering, integrated gas sensor, new device structure, pattern recognition, volatile organic compounds detection, Science
Abstract: Abstract Amid growing interest in the precise detection of volatile organic compounds (VOCs) in industrial field, the demand for highly effective gas sensors is at an all‐time high. However, traditional sensors with their classic single‐output signal, bulky and complex integrated structure when forming array often involve complicated technology and high cost, limiting their widespread adoption. Here, this study introduces a novel approach, employing an integrated YSZ‐based (YSZ: yttria‐stabilized zirconia) mixed potential sensor equipped with a triple‐sensing electrode array, to efficiently detect and differentiate six types of VOCs gases. This innovative sensor integrates NiSb2O6, CuSb2O6, and MgSb2O6 sensing electrodes (SEs), which are sensitive to pentane, isoprene, n‐propanol, acetone, acetic acid, and formaldehyde gases. Through feature engineering based on intuitive spike‐based response values, it accentuates the distinct characteristics of every gas. Eventually, an average classification accuracy of 98.8% and an overall R‐squared error (R2) of 99.3% for concentration regression toward six target gases can be achieved, showcasing the potential to quantitatively distinguish between industrial hazardous VOCs gases.
Published: 2024
Full Text: View/download PDF

45. A machine learning model based on CHAT-23 for early screening of autism in Chinese children

Author: Hengyang Lu, Heng Zhang, Yi Zhong, Xiang-Yu Meng, Meng-Fei Zhang, and Ting Qiu
Subjects: autism spectrum disorder, CHAT-23, early screening, feature engineering, machine learning, Chinese children, Pediatrics, RJ1-570
Abstract: IntroductionAutism spectrum disorder (ASD) is a neurodevelopmental condition that significantly impacts the mental, emotional, and social development of children. Early screening for ASD typically involves the use of a series of questionnaires. With answers to these questionnaires, healthcare professionals can identify whether a child is at risk for developing ASD and refer them for further evaluation and diagnosis. CHAT-23 is an effective and widely used screening test in China for the early screening of ASD, which contains 23 different kinds of questions.MethodsWe have collected clinical data from Wuxi, China. All the questions of CHAT-23 are regarded as different kinds of features for building machine learning models. We introduce machine learning methods into ASD screening, using the Max-Relevance and Min-Redundancy (mRMR) feature selection method to analyze the most important questions among all 23 from the collected CHAT-23 questionnaires. Seven mainstream supervised machine learning models were built and experiments were conducted.ResultsAmong the seven supervised machine learning models evaluated, the best-performing model achieved a sensitivity of 0.909 and a specificity of 0.922 when the number of features was reduced to 9. This demonstrates the model's ability to accurately identify children for ASD with high precision, even with a more concise set of features.DiscussionOur study focuses on the health of Chinese children, introducing machine learning methods to provide more accurate and effective early screening tests for autism. This approach not only enhances the early detection of ASD but also helps in refining the CHAT-23 questionnaire by identifying the most relevant questions for the diagnosis process.
Published: 2024
Full Text: View/download PDF

46. GastroVRG: Enhancing early screening in gastrointestinal health via advanced transfer features

Author: Mohammad Shariful Islam, Mohammad Abu Tareq Rony, and Tipu Sultan
Subjects: Gastrointestinal malignancies, CNN, Feature engineering, Transfer learning, DL, Cybernetics, Q300-390, Electronic computers. Computer science, QA75.5-76.95
Abstract: The accurate classification of endoscopic images is a challenging yet critical task in medical diagnostics, which directly affects the treatment and management of Gastrointestinal diseases. Misclassification can lead to incorrect treatment plans, adversely affecting patient outcomes. To address this challenge, our research aimed to develop a reliable computational model to improve the accuracy of classifying conditions of esophagitis and polyps. We focused on a subset of the Kvasir v1 secondary dataset, comprising 2000 endoscopic images evenly distributed across two classes: esophagitis and polyp. The goal was to leverage the strengths of both Machine Learning(ML) and Deep Learning(DL) to create a model that not only predicts with high accuracy but also integrates seamlessly into clinical workflows. To this end, we introduced a novel VRG-based ensemble image feature extraction technique, combining the powers of VGG, RF, and GB models to synthesize a robust feature set conducive to high-precision classification. The ensemble approach demonstrated a best-in-class performance with the GB model achieving an outstanding 99.73% accuracy in detecting esophagitis and polyps. The practical implications of these results are substantial, indicating that our method can significantly improve diagnostic accuracy in real-world settings, reduce the rate of misdiagnosis, and contribute to the efficient and effective treatment of patients, ultimately enhancing the quality of healthcare services. With the successful application of our proposed method to a controlled dataset, future work involves deploying the model in clinical environments and expanding its application to a broader spectrum of Gastrointestinal conditions across multi-class datasets.
Published: 2024
Full Text: View/download PDF

47. Political social media bot detection: Unveiling cutting-edge feature selection and engineering strategies in machine learning model development

Author: Zineb Ellaky and Faouzia Benabbou
Subjects: Feature selection, Feature engineering, Online social networks, Political bots detection, Machine learning, Social media security, Science
Abstract: Over time, social media bots (SMBs), specifically political SMBs, have played a crucial role in influencing and spreading misinformation, manipulating public opinion, and harassing and intimidating users of online social networks (OSNs). This article aims to study previous works on the detection and analysis of political SMB activities and address critical challenges that significantly impact the effectiveness of SMB detection models. These challenges include feature engineering, feature selection (FS), and model implementation. Over 33 features were extracted from the Twibot-20 dataset, including content, user information, network, behavior, and temporal features. Various FS techniques are explored and compared to select the optimal features, comprising basic, filter, wrapper, embedded, and hybrid. The optimal features are then employed to train multiple machine-learning algorithms. To balance the dataset, the synthetic minority oversampling technique coupled with edited nearest neighbors (Smote-ENN) is used. The results showed an improvement in model performance, from an initial Area Under the Curve (AUC) of 90.40 % and accuracy of 81.60 % using the original set to a score of 99.50 % for the test set and 100 % for the training set in all used metrics. Decision Trees, Random Forest, Gradient Boosting, Adaboost, XGB, and Extra Trees emerge as the most effective for detecting political SMBs.
Published: 2024
Full Text: View/download PDF

48. Deep learning-based electricity theft prediction in non-smart grid environments

Author: Sheikh Muhammad Saqib, Tehseen Mazhar, Muhammad Iqbal, Tariq Shahazad, Ahmad Almogren, Khmaies Ouahada, and Habib Hamam
Subjects: Deep learning, Feature engineering, Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Random-Under-Sampler (RUS), Synthetic Minority Over-Sampling Technique (SMOTE), Science (General), Q1-390, Social sciences (General), H1-99
Abstract: In developing countries, smart grids are nonexistent, and electricity theft significantly hampers power supply. This research introduces a lightweight deep-learning model using monthly customer readings as input data. By employing careful direct and indirect feature engineering techniques, including Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), UMAP (Uniform Manifold Approximation and Projection), and resampling methods such as Random-Under-Sampler (RUS), Synthetic Minority Over-sampling Technique (SMOTE), and Random-Over-Sampler (ROS), an effective solution is proposed. Previous studies indicate that models achieve high precision, recall, and F1 score for the non-theft (0) class, but perform poorly, even achieving 0 %, for the theft (1) class. Through parameter tuning and employing Random-Over-Sampler (ROS), significant improvements in accuracy, precision (89 %), recall (94 %), and F1 score (91 %) for the theft (1) class are achieved. The results demonstrate that the proposed model outperforms existing methods, showcasing its efficacy in detecting electricity theft in non-smart grid environments.
Published: 2024
Full Text: View/download PDF

49. Interpretable machine learning boosting the discovery of targeted organometallic compounds with optimal bandgap

Author: Taehyun Park, JunHo Song, Jinyoung Jeong, Seungpyo Kang, Joonchul Kim, Joonghee Won, Jungim Han, and Kyoungmin Min
Subjects: Organometallic compounds, High-throughput calculations, Feature engineering, Machine learning, Active learning, Materials of engineering and construction. Mechanics of materials, TA401-492
Abstract: Organometallic compounds (OMCs) have attracted tremendous attention in various fields, such as photovoltaic cell and high-k dielectric application, due to their beneficial properties. Despite their potential, the progression of OMCs into industrial applications is hindered by the limited databases available for their properties and the absence of efficient surrogate models. To address this, in this study, optimally selected feature-based surrogate models for predicting the electronic properties of OMCs are constructed via various multiscale features and extensive database. To this end, high-throughput calculation was performed to obtain electronic properties of more than 18k materials generally known as organometallics, augmenting around 12k organic materials obtained from the public open data set, OMDB-GAP1. For generating features closely related to OMCs, descriptors encapsulating the information ranging local to global, also other widely-used composition-, structure-based features (more than 3.5k in total) were employed. Among these descriptors, we identified 48 critical features that elucidates the physicochemical underpinnings of OMCs’ properties, suggesting their impact on the properties of OMCs. The light gradient boosting machine model achieved high-accuracy predictions across the entire database with just 1 % of the total descriptors, sufficiently compared to the entire sets (decreased of around 0.01 by R2 score and 0.01 eV by MAE). Furthermore, the efficacy of active learning process was demonstrated to find OMCs with optimal properties rapidly. As a result, expected improvement outperforms other methods by identifying 69 % of the target materials only searching 46 % of the total search space. Our constructed platform with a high-throughput calculated database can pave the way for the rapid screening of OMCs for the targeted industrial application, and suggest a comprehensive grasp of the intrinsic properties of OMCs and related compounds.
Published: 2024
Full Text: View/download PDF

50. Optimized ANN for LiFePO4 battery charge estimation using principal components based feature generation

Author: Chaitali Mehta, Amit V. Sant, and Paawan Sharma
Subjects: Battery, Feature engineering, Principal component analysis, Artificial neural networks, Optimizers, Transportation engineering, TA1001-1280, Renewable energy sources, TJ807-830
Abstract: Electric vehicles (EVs) have gained prominence in the present energy transition scenario. Widespread adoption of EVs necessitates an accurate State of Charge estimation (SoC) algorithm. Integrating predictive SoC estimations with smart charging strategies not only optimizes charging efficiency and grid reliability but also extends battery lifespan while continuously enhancing the accuracy of SoC predictions, marking a crucial milestone in sustainable electric vehicle technology. In this research study, machine learning methods, particularly Artificial Neural Networks (ANN), are employed for SoC estimation of LiFePO4 batteries, resulting in efficient and accurate estimation algorithms. The investigation first focuses on developing a custom-designed battery pack with 12 V, 4 Ah capacity with a facility for real-time data collection through a dedicated hardware setup. The voltage, current and open-circuit voltage of the battery are monitored with computerized battery analyzer. The battery temperature is sensed with a DHT22 temperature sensor interfaced with Raspberry Pi. Principal components are derived for the collected battery data set and analyzed for feature engineering. Three principal components were generated as input parameters for the developed ANN. Early Stopping for the ANN was also implemented to achieve faster convergence of the ANN. While considering eleven combinations for ten different optimizers loss function is minimized. Comparative analysis of hyperparameter tuning and optimizer selection revealed that the Adafactor optimizer with specific settings produced the best results with an RMSE value of 0.4083 and an R2 Score of 0.9998. The proposed algorithm was also implemented for two different types of datasets, a UDDS drive cycle and a standard cell-level dataset. The results obtained were in line with the results obtained with the ANN model developed based on the data collected from the developed experimental setup.
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,142 results on '"Feature Engineering"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources