4,276 results on '"imbalanced data"'
Search Results
2. Knowledge distillation with resampling for imbalanced data classification: Enhancing predictive performance and explainability stability
- Author
-
Fujiwara, Kazuki
- Published
- 2024
- Full Text
- View/download PDF
3. Suppressed possibilistic fuzzy c-means clustering based on shadow sets for noisy data with imbalanced sizes
- Author
-
Yu, Haiyan, Li, Honglei, Xu, Xiaoyu, Gao, Qian, and Lan, Rong
- Published
- 2024
- Full Text
- View/download PDF
4. Explainable domain adaptation for imbalanced occupancy estimation
- Author
-
Mahamoodally, Naailah, Dridi, Jawher, and Amayri, Manar
- Published
- 2024
- Full Text
- View/download PDF
5. Generalization classification regularization generative adversarial network for machinery fault diagnostics under data imbalance
- Author
-
Lin, Cuiying, Kong, Yun, Huang, Guoyu, Han, Qinkai, Dong, Mingming, Liu, Hui, and Chu, Fulei
- Published
- 2025
- Full Text
- View/download PDF
6. Semi-supervised suppressed possibilistic Gustafsan-Kessel clustering algorithm based on local information and knowledge propagation
- Author
-
Yu, Haiyan, Liu, Junnan, and Gong, Kaiming
- Published
- 2025
- Full Text
- View/download PDF
7. A neighborhood rough sets-based ensemble method, with application to software fault prediction
- Author
-
Jiang, Feng, Hu, Qiang, Yang, Zhiyong, Liu, Jinhuan, and Du, Junwei
- Published
- 2025
- Full Text
- View/download PDF
8. LLMOverTab: Tabular data augmentation with language model-driven oversampling
- Author
-
Isomura, Tokimasa, Shimizu, Ryotaro, and Goto, Masayuki
- Published
- 2025
- Full Text
- View/download PDF
9. Meta-task interpolation-based data augmentation for imbalanced health status recognition of complex equipment
- Author
-
Li, Jinyuan, Wan, Wenqing, Feng, Yong, and Chen, Jinglong
- Published
- 2025
- Full Text
- View/download PDF
10. Evolutionary multistage multitasking method for feature selection in imbalanced data
- Author
-
Ding, Weiping, Yao, Hongcheng, Huang, Jiashuang, Hou, Tao, and Geng, Yu
- Published
- 2025
- Full Text
- View/download PDF
11. Financial risk assessment of imbalanced data based on nonlinear causal time-series network
- Author
-
Li, Xiaoyang, Li, Weimin, Yu, Xiao, Han, Zhongming, and Jin, Qun
- Published
- 2025
- Full Text
- View/download PDF
12. Exploring the Generalizability of Transfer Learning for Camera Trap Animal Image Classification
- Author
-
Ramesh, Keshav, Darwish, Mahmoud, Zibli, Ahmed Sharafath Ahamed, Miller, Nikita Christ, Sajun, Ali Reza, Zualkernan, Imran, Habib, Altaf, Gardner, Andrew, Ghosh, Ashish, Editorial Board Member, Meo, Rosa, editor, and Silvestri, Fabrizio, editor
- Published
- 2025
- Full Text
- View/download PDF
13. Data Science for Insurance Fraud Detection: A Review
- Author
-
Banulescu-Radu, Denisa, Kougblenou, Yannick, and Dionne, Georges, editor
- Published
- 2025
- Full Text
- View/download PDF
14. Fair Latent Representation Learning with Adaptive Reweighing
- Author
-
Majumdar, Puspita, Sharma, Raghav, Bhattacharya, Rohit, Prajesh, Balraj, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
15. CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels
- Author
-
Alsuhaibani, Abdullah, Razzak, Imran, Jameel, Shoaib, Wang, Xianzhi, Xu, Guandong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Barhamgi, Mahmoud, editor, Wang, Hua, editor, and Wang, Xin, editor
- Published
- 2025
- Full Text
- View/download PDF
16. Modeling of a Novel Correlation-Weighted Elman Neural Network for Building Automation System
- Author
-
Kannan, R., Suresh, S., Bhuvanesh, A., Sivasankari, N., Nandu Krishna, S., Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Shrivastava, Vivek, editor, Bansal, Jagdish Chand, editor, and Panigrahi, B. K., editor
- Published
- 2025
- Full Text
- View/download PDF
17. Insider Threat Detection Based on User and Entity Behavior Analysis with a Hybrid Model
- Author
-
Song, Yue, Yuan, Jianting, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Mouha, Nicky, editor, and Nikiforakis, Nick, editor
- Published
- 2025
- Full Text
- View/download PDF
18. SC-WGAN: GAN-Based Oversampling Method for Network Intrusion Detection
- Author
-
Bai, Wuxia, Wang, Kailong, Chen, Kai, Li, Shenghui, Li, Bingqian, Zhang, Ning, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bai, Guangdong, editor, Ishikawa, Fuyuki, editor, Ait-Ameur, Yamine, editor, and Papadopoulos, George A., editor
- Published
- 2025
- Full Text
- View/download PDF
19. FedIBD: a federated learning framework in asynchronous mode for imbalanced data: FedIBD: a federated learning framework in asynchronous mode...: Y. Hou et al.
- Author
-
Hou, Yingwei, Li, Haoyuan, Guo, Zihan, Wu, Weigang, Liu, Rui, and You, Linlin
- Abstract
With the development of edge computing and Internet of Things (IoT), the computing power of edge devices continues to increase, and the data obtained is more specific and private. Methods based on Federated Learning (FL) can help utilize the data that exists widely on edge devices in a privacy-preserving way and train a shareable global model collaboratively. However, the imbalanced data from edge devices pose a huge challenge to FL, as data features extracted from uneven, biased, and incomplete samples complicate the model aggregation process required to achieve well-performing models. To support FL on imbalanced data, a new asynchronous FL framework, named FedIBD: Federated learning framework in Asynchronous mode for Imbalanced Data, is proposed. FedIBD not only considers the temporal inconsistency in asynchronous learning but also measures the informative differences in imbalanced data to support FL in asynchronous and heterogeneous environments. Compared with the existing synchronous and asynchronous FL methods, FedIBD can achieve significantly better performance in terms of accuracy, communication time and cost on imbalanced data. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
20. Double fuzzy relaxation local information C-Means clustering: Double fuzzy relaxation local information C-Means clustering: Y. Gao et al.
- Author
-
Gao, Yunlong, Zheng, Xingshen, Wu, Qinting, Zhang, Jiahao, Cao, Chao, and Pan, Jinyan
- Abstract
Fuzzy c-means clustering (FCM) has gained widespread application because of its ability to capture uncertain information in data effectively. However, attributed to the prior assumption of identical distribution, traditional FCM is sensitive to noise and cluster size. Modified methods incorporating local spatial information can enhance the robustness to noise. However, they tend to balance cluster sizes, resulting in poor performance when dealing with imbalanced data. Modified methods learning the statistical characteristics of data are feasible to handle imbalanced data. However, they are often sensitive to noise due to the ignorance of local information. Aiming at the lack of method that can simultaneously alleviate the sensitivity to noise and cluster size, a double fuzzy relaxation local information c-means clustering algorithm (DFRLICM) is proposed in this paper. Firstly, sample relaxation is introduced to explore potential clustering results and enhance inter-class separability. Secondly, to cooperate with the relaxation, we design fuzzy weights to record the imbalance situation of data clusters, enhancing the capability of algorithm in dealing with imbalanced data. Thirdly, we introduce fuzzy factor to account for the preservation of local structures in data and improve the robustness of algorithm. Finally, we integrate the three elements into a unified model framework to achieve the combination optimization of robustness to noise and insensitivity to cluster size simultaneously. Extensive experiments are conducted and the results demonstrate that the proposed algorithm indeed achieves a balance between robustness to noise and insensitivity to cluster size. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
21. A comprehensive case study on the performance of machine learning methods on the classification of solar panel electroluminescence images.
- Author
-
Song, Xinyi, Odongo, Kennedy, Pascual, Francis G., and Hong, Yili
- Subjects
CONVOLUTIONAL neural networks ,SOLAR cells ,DEEP learning ,SOLAR panels ,ENERGY harvesting - Abstract
Photovoltaics (PV) are widely used to harvest solar energy, an important form of renewable energy. Photovoltaic arrays consist of multiple solar panels constructed from solar cells. Solar cells in the field are vulnerable to various defects, and electroluminescence (EL) imaging provides effective and nondestructive diagnostics to detect those defects. We use multiple traditional machine learning and modern deep learning models to classify EL solar cell images into different functional/defective categories. Because of the asymmetry in the number of functional versus defective cells, an imbalanced label problem arises in the EL image data. The current literature lacks insights on which methods and metrics to use for model training and prediction. In this article, we comprehensively compare different machine learning and deep learning methods under different performance metrics on the classification of solar cell EL images from monocrystalline and polycrystalline modules. We provide a comprehensive discussion on different metrics. Our results provide insights and guidelines for practitioners in selecting prediction methods and performance metrics. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
22. Wind turbine blade icing diagnosis based on k-means clustering and label propagation algorithm.
- Author
-
Bai, Xinjian, Xu, Yuchen, Liu, Yongqian, Meng, Hang, and Davaasuren, Tsogtgerel
- Subjects
WIND turbine blades ,FATIGUE life ,FAULT diagnosis ,WIND power plants ,PROBLEM solving - Abstract
Wind turbine blade icing seriously affects power generation performance and fatigue life, and effective diagnosis of blade icing is critical for mitigating icing effects. Current diagnostic methods are greatly affected by unbalanced data, especially for small sample data. To effectively solve the above problems, a novel diagnostic method combining k-means clustering with a label propagation algorithm is proposed. Specifically, k-means clustering handles unlabeled SCADA data, generating initial pseudo-labels. Then, the label propagation algorithm refines these pseudo-labels, enhancing labeling accuracy and overall classification performance. Finally, the effectiveness of the proposed method is validated using four different type classifiers for two wind farms. The results show that the proposed method improves the average diagnostic accuracy by 3.38% compared to models that eliminate unlabeled data and with a 4.2% improvement in small sample scenarios. The results demonstrate that the method exhibits high accuracy and significant generalization ability in diagnosing blade icing, offering practical benefits for data analysis and fault diagnosis. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
23. Clustering and classification for dry bean feature imbalanced data.
- Author
-
Lee, Chou-Yuan, Wang, Wei, and Huang, Jian-Qiong
- Abstract
The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. AEGAN-Pathifier: a data augmentation method to improve cancer classification for imbalanced gene expression data.
- Author
-
Zhang, Qiaosheng, Wei, Yalong, Hou, Jie, Li, Hongpeng, and Zhong, Zhaoman
- Subjects
- *
GENERATIVE adversarial networks , *DATA augmentation , *DEEP learning , *AUTOENCODER , *TUMOR classification - Abstract
Background: Cancer classification has consistently been a challenging problem, with the main difficulties being high-dimensional data and the collection of patient samples. Concretely, obtaining patient samples is a costly and resource-intensive process, and imbalances often exist between samples. Moreover, expression data is characterized by high dimensionality, small samples and high noise, which could easily lead to struggles such as dimensionality catastrophe and overfitting. Thus, we incorporate prior knowledge from the pathway and combine AutoEncoder and Generative Adversarial Network (GAN) to solve these difficulties. Results: In this study, we propose an effective and efficient deep learning method, named AEGAN, which combines the capabilities of AutoEncoder and GAN to generate synthetic samples of the minority class in imbalanced gene expression data. The proposed data balancing technique has been demonstrated to be useful for cancer classification and improving the performance of classifier models. Additionally, we integrate prior knowledge from the pathway and employ the pathifier algorithm to calculate pathway scores for each sample. This data augmentation approach, referred to as AEGAN-Pathifier, not only preserves the biological functionality of the data but also possesses dimensional reduction capabilities. Through validation with various classifiers, the experimental results show an improvement in classifier performance. Conclusion: AEGAN-Pathifier shows improved performance on the imbalanced datasets GSE25066, GSE20194, BRCA and Liver24. Results from various classifiers indicate that AEGAN-Pathifier has good generalization capability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Enhancing credit card fraud detection: highly imbalanced data case.
- Author
-
Breskuvienė, Dalia and Dzemyda, Gintautas
- Subjects
ARTIFICIAL neural networks ,CREDIT card fraud ,FEATURE selection ,INFORMATION storage & retrieval systems ,FRAUD investigation ,BIG data - Abstract
In the contemporary landscape, fraud is a widespread challenge in today's financial landscape, requiring innovative methods and technologies to detect and prevent losses from the sophisticated tactics used by fraudsters. This paper emphasizes the main issues in fraud detection and suggests a novel feature selection method called FID-SOM (feature selection for imbalanced data using SOM). Feature selection can significantly improve classification performance. Given the inherent imbalance in fraud detection data, feature selection must be done with an enhanced focus. To accomplish this task, we use Self-Organizing maps, which are a special type of artificial neural network. FID-SOM is designed to address the challenge of dimensionality reduction in scenarios characterized by highly imbalanced data. It has been specifically designed to efficiently process and analyze vast and complex datasets commonly encountered in the financial sector, showcasing adaptability to the dynamic nature of big data environments. The uniqueness of the proposed method is in forming a new dataset containing the Best-Matching Units of the trained SOM as vectors of attributes corresponding to the initial features. These attributes are sorted based on variance in descending order. By keeping the required number of attributes that hold the highest percentage of variability, we select features corresponding to those attributes for further analysis. The proposed FID-SOM method has demonstrated its ability to perform on par with, if not surpass, existing methodologies. It also shows innovative potential. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. A Lightweight Kernel Density Estimation and Adaptive Synthetic Sampling Method for Fault Diagnosis of Rotating Machinery with Imbalanced Data.
- Author
-
Lu, Wenhao, Wang, Wei, Qin, Xuefei, and Cai, Zhiqiang
- Abstract
Rotating machinery is widely used across various industries, making its reliable operation crucial for industrial production. However, in real-world settings, intelligent fault diagnosis faces challenges due to imbalanced fault data and the complexity of neural network models. These challenges are particularly pronounced when defining decision boundaries accurately and managing limited computational resources in real-time machine monitoring. To address these issues, this study presents KDE-ADASYN-based MobileNet with SENet (KAMS), a lightweight convolutional neural network designed for fault diagnosis in rotating machinery. KAMS effectively handles data imbalances commonly found in industrial applications and is optimized for real-time monitoring. The model employs the Kernel Density Estimation Adaptive Synthetic Sampling (KDE-ADASYN) algorithm for oversampling to balance the data, applies fast Fourier transform (FFT) to convert time-domain signals into frequency-domain signals, and utilizes a 1D-MobileNet network enhanced with a Squeeze-and-Excitation (SE) block for feature extraction and fault diagnosis. Experimental results across datasets with varying imbalance ratios demonstrate that KAMS achieves excellent performance, maintaining nearly 90% accuracy even on highly imbalanced datasets. Comparative experiments further demonstrate that KAMS not only delivers exceptional diagnostic performance but also significantly reduces network parameters and computational resource requirements. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Detecting Aggression in Language: From Diverse Data to Robust Classifiers.
- Author
-
Wawer, Aleksander, Mykowiecka, Agnieszka, and Żuk, Bartosz
- Subjects
POLISH language ,HATE speech ,CYBERBULLYING ,LANGUAGE & languages - Abstract
The automatic detection of aggressive language is a difficult challenge. Currently, three datasets are available in Polish, enabling the training of machine learning models to recognise different types of linguistic aggression. In this paper, we address the issues of the transferability of knowledge between datasets and training a single model that works best on all types of aggression. Due to data imbalance, we experiment with two loss functions dedicated to training on imbalanced data: Weighted Cross-Entropy and Focal loss. Using the Polish language HerBERT model, we present the results of experiments in the Cross-dataset scenario and the model results using the combined data. Our results show that (1) combining diverse types of linguistic aggression during training leads to a better-performing classifier and (2) Weighted Cross-Entropy outperforms other tested loss functions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. Filtering Useful App Reviews Using Naïve Bayes—Which Naïve Bayes?
- Author
-
Ataei, Pouya, Regula, Sri, Staegemann, Daniel, and Malgaonkar, Saurabh
- Abstract
App reviews provide crucial feedback for software maintenance and evolution, but manually extracting useful reviews from vast volumes is time-consuming and challenging. This study investigates the effectiveness of six Naïve Bayes variants for automatically filtering useful app reviews. We evaluated these variants on datasets from five popular apps, comparing their performance in terms of accuracy, precision, recall, F-measure, and processing time. Our results show that Expectation Maximization-Multinomial Naïve Bayes with Laplace smoothing performed best overall, achieving up to 89.2% accuracy and 0.89 F-measure. Complement Naïve Bayes with Laplace smoothing demonstrated particular effectiveness for imbalanced datasets. Generally, incorporating Laplace smoothing and Expectation Maximization improved performance, albeit with increased processing time. This study also examined the impact of data imbalance on classification performance. Our findings suggest that these advanced Naïve Bayes variants hold promise for filtering useful app reviews, especially when dealing with limited labeled data or imbalanced datasets. This research contributes to the body of evidence around app review mining and provides insights for enhancing software maintenance and evolution processes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Influence of Explanatory Variable Distributions on the Behavior of the Impurity Measures Used in Classification Tree Learning.
- Author
-
Gajowniczek, Krzysztof and Dudziński, Marcin
- Subjects
- *
BETA distribution , *INTERACTIVE learning , *DISTRIBUTION (Probability theory) , *MACHINE learning , *VALUES (Ethics) , *LOGISTIC regression analysis - Abstract
The primary objective of our study is to analyze how the nature of explanatory variables influences the values and behavior of impurity measures, including the Shannon, Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja, and Kapur entropies. Our analysis aims to use these measures in the interactive learning of decision trees, particularly in the tie-breaking situations where an expert needs to make a decision. We simulate the values of explanatory variables from various probability distributions in order to consider a wide range of variability and properties. These probability distributions include the normal, Cauchy, uniform, exponential, and two beta distributions. This research assumes that the values of the binary responses are generated from the logistic regression model. All of the six mentioned probability distributions of the explanatory variables are presented in the same graphical format. The first two graphs depict histograms of the explanatory variables values and their corresponding probabilities generated by a particular model. The remaining graphs present distinct impurity measures with different parameters. In order to examine and discuss the behavior of the obtained results, we conduct a sensitivity analysis of the algorithms with regard to the entropy parameter values. We also demonstrate how certain explanatory variables affect the process of interactive tree learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. TMS: Ensemble Deep Learning Model for Accurate Classification of Monkeypox Lesions Based on Transformer Models with SVM.
- Author
-
Abdelrahim, Elsaid Md., Hashim, Hasan, Atlam, El-Sayed, Osman, Radwa Ahmed, and Gad, Ibrahim
- Subjects
- *
CONVOLUTIONAL neural networks , *CLINICAL decision support systems , *TRANSFORMER models , *MONKEYPOX , *DEEP learning - Abstract
Background/Objectives:The emergence of monkeypox outside its endemic region in Africa has raised significant concerns within the public health community due to its rapid global dissemination. Early clinical differentiation of monkeypox from similar diseases, such as chickenpox and measles, presents a challenge. The Monkeypox Skin Lesion Dataset (MSLD) used in this study comprises monkeypox skin lesions, which were collected primarily from publicly accessible sources. The dataset contains 770 original images captured from 162 unique patients. The MSLD includes four distinct class labels: monkeypox, measles, chickenpox, and normal. Methods: This paper presents an ensemble model for classifying the monkeypox dataset, which includes transformer models and support vector machine (SVM). The model development process begins with an evaluation of seven convolutional neural network (CNN) architectures. The proposed model is developed by selecting the top four models based on evaluation metrics for performance. The top four CNN architectures, namely EfficientNetB0, ResNet50, MobileNet, and Xception, are used for feature extraction. The high-dimensional feature vectors extracted from each network are then concatenated and optimized before being inputted into the SVM classifier. Results: The proposed ensemble model, in conjunction with the SVM classifier, achieves an accuracy of 95.45b%. Furthermore, the model demonstrates high precision (95.51%), recall (95.45%), and F1 score (95.46%), indicating its effectiveness in identifying monkeypox lesions. Conclusions: The results of the study show that the proposed hybrid framework achieves robust diagnostic performance in monkeypox detection, offering potential utility for enhanced disease monitoring and outbreak management. The model's high diagnostic accuracy and computational efficiency indicate that it can be used as an additional tool for clinical decision support. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data.
- Author
-
Hayes, Nicole, Merkurjev, Ekaterina, and Wei, Guo-Wei
- Subjects
- *
TRANSFORMER models , *DRUG discovery , *CLASS size , *STATISTICAL correlation , *ALGORITHMS - Abstract
Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman–Bence–Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman–Bence–Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Predicting crash occurrence at intersections in Texas: an opportunity for machine learning.
- Author
-
Charm, Theodore, Wang, Haoqi, Zuniga-Garcia, Natalia, Ahmed, Mostaq, and Kockelman, Kara M.
- Subjects
- *
STANDARD deviations , *TRAFFIC accidents , *SIGNALIZED intersections , *RANDOM forest algorithms , *MACHINE learning - Abstract
This paper studies the frequency of traffic crashes at intersections across Texas by employing Zero-inflated Negative Binomial (ZINB) and Negative Binomial-Lindley (NB-L) generalized linear models, as well as various tree-based machine learning (ML) methods, namely Random Forests (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Bayesian Additive Regression Trees (BART) to predict the frequency of crashes at intersections. Official crash reports from 2010 through 2019 were linked to Texas' over 700,000 intersections. RF provided best prediction performance (using R-square and Root Mean Square Error metrics) while serving well for highly imbalanced crash data (with many zero cases). Sensitivity analysis highlights the practical significance of signalized intersection, annual average daily traffic, number of lanes at intersection approach, and other covariates. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Porphyry-type mineral prospectivity mapping with imbalanced data via prior geological transfer learning.
- Author
-
Mantilla-Dulcey, Ana, Goyes-Peñafiel, Paul, Báez-Rodríguez, Rosana, and Khurama, Sait
- Abstract
[Display omitted] • Implementation of geological concepts with deep learning for mineral prospectivity. • Exploration of transfer learning techniques via prior geological knowledge. • Weighted loss function to address class and sampling imbalanced data issues. • Pretext geological and feature data extraction task to enhance model prediction. Mineral prospectivity mapping is crucial for identifying areas with economically valuable minerals. Therefore, several methods based on machine learning have been applied to predict the likelihood of mineral occurrences, especially deep learning (DL), which provides a flexible and precise approach to the use of continuous data. It allows the approximation of predictive variables with probability values related to new ore targets. However, in the early stages of mineral exploration, DL-based methods face a challenge related to class and sampling imbalance due to scarce mineral deposits, resulting in a lack of enough samples to train, limiting the model's predictive ability. This work proposed a detailed and systematic framework to address imbalanced data issues with prior geological transfer learning and a weighted loss function. We exploited the abundant pixel information of input variables to develop a pretext geological classification and a feature data extraction task as an initializer for the trainable variables of the neural network. The proposed workflow was tested in a porphyry-rich Yukon (Canada) region and overperformed other state-of-the-art classification algorithms such as random forest, support vector machines, and logistic regression. Moreover, our results were contrasted against different geological reports, where our mineral prospectivity map was coherent with regional and local potential assessments of porphyry-type mineral occurrences. The quantitative metrics with a validation dataset suggested that the proposed method can effectively predict mineral prospective areas in different imbalanced data scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. 진공 패드 압력 이상 탐지를 위한 데이터 불균형 처리 기법 적용 및 그 효과 분석.
- Author
-
김지연, 김기환, 하상현, 강영진, and 정석찬
- Subjects
MANUFACTURING processes ,LEAD time (Supply chain management) ,ROBOTS ,ACQUISITION of data ,INDUSTRIAL robots - Abstract
As the use of industrial robots increases globally, it is becoming increasingly important to detect potential operational issues early to ensure stable and efficient operations. In particular, since robots perform critical tasks in the production process, failure to detect abnormal operations in time can lead to decreased productivity, equipment damage, and even production shutdowns. However, the collection of abnormal data is challenging, leading to data imbalance issues, which pose a major obstacle in building effective detection models. To address this issue, this paper applies data imbalance handling techniques such as SMOTE, ADASYN, undersampling, and SMOTE-Tomek to improve the performance of the anomaly detection model for the vacuum pad pressure of a delta robot. A performance comparison using XGBoost and LightGBM models showed that the XGBoost model with the ADASYN technique exhibited significant improvements in accuracy, precision, recall, F1-Score, and the confusion matrix. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Improved KD-tree based imbalanced big data classification and oversampling for MapReduce platforms.
- Author
-
Sleeman IV, William C., Roseberry, Martha, Ghosh, Preetam, Cano, Alberto, and Krawczyk, Bartosz
- Subjects
MACHINE learning ,WEBSITES ,CLASSIFICATION algorithms ,SKEWNESS (Probability theory) ,WEB services ,BIG data - Abstract
In the era of big data, it is necessary to provide novel and efficient platforms for training machine learning models over large volumes of data. The MapReduce approach and its Apache Spark implementation are among the most popular methods that provide high-performance computing for classification algorithms. However, they require dedicated implementations that will take advantage of such architectures. Additionally, many real-world big data problems are plagued by class imbalance, posing challenges to the classifier training step. Existing solutions for alleviating skewed distributions do not work well in the MapReduce environment. In this paper, we propose a novel KD-tree based classifier, together with a variation of the SMOTE algorithm dedicated to the Spark platform. Our algorithms offer excellent predictive power and can work simultaneously with binary and multi-class imbalanced data. Exhaustive experiments conducted using the Amazon Web Service platform showcase the high efficiency and flexibility of our proposed algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Siamese Network-Based Lightweight Framework for Tomato Leaf Disease Recognition.
- Author
-
Thuseethan, Selvarajah, Vigneshwaran, Palanisamy, Charles, Joseph, and Wimalasooriya, Chathrie
- Subjects
CROP losses ,DEEP learning ,PLANT diseases ,TOMATOES ,RECOGNITION (Psychology) - Abstract
In this paper, a novel Siamese network-based lightweight framework is proposed for automatic tomato leaf disease recognition. This framework achieves the highest accuracy of 96.97% on the tomato subset obtained from the PlantVillage dataset and 95.48% on the Taiwan tomato leaf disease dataset. Experimental results further confirm that the proposed framework is effective with imbalanced and small data. The backbone network integrated with this framework is lightweight with approximately 2.9629 million trainable parameters, which is second to SqueezeNet and significantly lower than other lightweight deep networks. Automatic tomato disease recognition from leaf images is vital to avoid crop losses by applying control measures on time. Even though recent deep learning-based tomato disease recognition methods with classical training procedures showed promising recognition results, they demand large labeled data and involve expensive training. The traditional deep learning models proposed for tomato disease recognition also consume high memory and storage because of a high number of parameters. While lightweight networks overcome some of these issues to a certain extent, they continue to show low performance and struggle to handle imbalanced data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Practical guideline to efficiently detect insurance fraud in the era of machine learning: A household insurance case.
- Author
-
Banulescu‐Radu, Denisa and Yankol‐Schalck, Meryem
- Subjects
INSURANCE crimes ,FRAUD ,INSURANCE policies ,FRAUD investigation ,MACHINE learning - Abstract
Identifying insurance fraud is a difficult task due to the complex nature of the fraud itself, the diversity of techniques employed, the rarity of fraud cases observed in data sets, and the relatively limited allocation of human, financial, and time resources to carry out investigations. The aim of this paper is to provide a clean and well structured study on modeling fraud on home insurance contracts, using real French data from 2013 to 2017. Several methods are developed to identify risk factors and unusual customer behaviors. Traditional econometric models as well as new machine‐learning algorithms with good predictive performance and high operational efficiency are tested, while maintaining method interpretability. Each methodology is evaluated on the basis of adequate performance measures and the issue of imbalanced databases is also addressed. Finally, specific methods are applied to interpret the results of the machine‐learning methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Long-Term or Short-Term? Prediction of Ship Detention Duration Based on Machine Learning.
- Author
-
Deng, Qingyue and Wan, Zheng
- Subjects
LETTERS of intent ,TANKERS ,POLLUTION prevention ,RANDOM forest algorithms ,DATA distribution - Abstract
The prevalence of ship deficiencies continues to be a significant issue. Data from the Tokyo Memorandum of Understanding reveals that ship detentions in 2023 surged by more than 80% compared with the previous year. The significant number of detained ships not only disrupts ships' daily operations but also strains port resources, leading to increased additional costs. In light of this issue, predicting the duration of ship detention becomes crucial, as accurate predictions can assist port managers in resource allocation and provide shipping companies with critical information for operational planning. This study is the first to predict ship detention duration, specifically distinguishing between long-term and short-term detained ships. Initially, key deficiency types influencing the ship detention duration were identified using an improved entropy weight–grey relational analysis. Subsequently, in consideration of the imbalance in data distribution between long-term and short-term detentions, a random forest model capable of handling imbalanced data was applied to classify these two types. The study found that fire safety, propulsion and auxiliary machinery, and pollution prevention are the three most critical deficiency types impacting detention duration; and the random forest model sampled and processed from the data level possessed the best model performance, achieving prediction accuracies of 0.71, 0.72, and 0.85 for bulk carriers, containers, and oil tankers, respectively. This research offers a comprehensive analysis of ship detention duration, making a significant contribution to both the theoretical understanding and practical applications in the maritime industry. Accurately predicting ship detention duration provides valuable insights for stakeholders, enabling them to anticipate potential detention scenarios and thus supporting shipping companies in effective fleet management while assisting port authorities in the optimal allocation of berth resources. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Mises-Fisher similarity-based boosted additive angular margin loss for breast cancer classification.
- Author
-
Alirezazadeh, P., Dornaika, F., and Charafeddine, J.
- Abstract
To enhance the accuracy of breast cancer diagnosis, current practices rely on biopsies and microscopic examinations. However, this approach is known for being time-consuming, tedious, and costly. While convolutional neural networks (CNNs) have shown promise for their efficiency and high accuracy, training them effectively becomes challenging in real-world learning scenarios such as class imbalance, small-scale datasets, and label noises. Angular margin-based softmax losses, which concentrate on the angle between features and classifiers embedded in cosine similarity at the classification layer, aim to regulate feature representation learning. Nevertheless, the cosine similarity’s lack of a heavy tail impedes its ability to compactly regulate intra-class feature distribution, limiting generalization performance. Moreover, these losses are constrained to target classes when margin penalties are applied, which may not always optimize effectiveness. Addressing these hurdles, we introduce an innovative approach termed MF-BAM (Mises-Fisher Similarity-based Boosted Additive Angular Margin Loss), which extends beyond traditional cosine similarity and is anchored in the von Mises-Fisher distribution. MF-BAM not only penalizes the angle between deep features and their corresponding target class weights but also considers angles between deep features and weights associated with non-target classes. Through extensive experimentation on the BreaKHis dataset, MF-BAM achieves outstanding accuracies of 99.92%, 99.96%, 100.00%, and 98.05% for magnification levels of ×40, ×100, ×200, and ×400, respectively. Furthermore, additional experiments conducted on the BACH dataset for breast cancer classification, as well as on the LFW and YTF datasets for face recognition, affirm the generalization capability of our proposed loss function. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Addressing overfitting in classification models for transport mode choice prediction: a practical application in the Aburrá Valley, Colombia.
- Author
-
Salazar-Serna, Kathleen, Barona, Sergio A., García, Isabel C., Cadavid, Lorena, and Franco, Carlos J.
- Subjects
- *
K-nearest neighbor classification , *DATA distribution , *RANDOM forest algorithms , *MACHINE learning , *DATA reduction , *CHOICE of transportation , *DIMENSION reduction (Statistics) - Abstract
Overfitting poses a significant limitation in mode choice prediction using classification models, often worsened by the proliferation of features from encoding categorical variables. While dimensionality reduction techniques are widely utilized, their effects on travel-mode choice models’ performance have yet to be comparatively studied. This research compares the impact of dimensionality reduction methods (PCA, CATPCA, FAMD, LDA) on the performance of multinomial models and various supervised learning classifiers (XGBoost, Random Forest, Naive Bayes, K-Nearest Neighbors, Multinomial Logit) for predicting travel mode choice. Utilizing survey data from the Aburrá Valley in Colombia, we detail the process of analyzing derived dimensions and selecting optimal models for both overall and class-specific predictions. Results indicate that dimension reduction enhances predictive power, particularly for less common transport modes, providing a strategy to address class imbalance without modifying data distribution. This methodology deepens understanding of travel behavior, offering valuable insights for modelers and policymakers in developing regions with similar characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Compound facial expressions recognition approach using DCGAN and CNN.
- Author
-
Ullah, Sana, Ou, Jie, Xie, Yuanlun, and Tian, Wenhong
- Subjects
CONVOLUTIONAL neural networks ,FACIAL expression & emotions (Psychology) ,EMOTION recognition ,GENERATIVE adversarial networks ,DATA augmentation ,DEEP learning - Abstract
Facial expression recognition (FER) technology has numerous applications in various fields such as health, entertainment and gaming, transportation, advertising and marketing, education, and many more. The recognition system based on seven basic expressions of emotion cannot satisfy the requirement for compound expression recognition as compound expressions have complex features due to a combination of basic emotions. Deep learning-based Recognition of compound emotions has recently emerged as a significant subject of academic research. Employing deep learning models to compound emotion recognition is facing imbalanced images in the dataset. To overcome the problem of imbalanced images in the dataset, the deep convolutional generative adversarial network (DCGAN) architecture has been utilized to balance the imbalanced dataset. The study proposes a novel approach for compound emotion recognition based on deep convolutional generative adversarial network (DCGAN) and convolutional neural network (CNN), which has been validated using the Compound Facial Expression Emotions (CFEE) and Real-world Affective Faces Database (RAFDB) datasets. Through utilizing the proposed approach, it achieved a significantly high accuracy of 72.0% for the CFEE dataset and 65.4% for RAF-DB for compound facial expressions of emotions. The AU-ROC/ AU-PR curves and ablation experiments also validated the performance of the proposed approach. The performance of the proposed model is compared with the state-of-the-art advanced models on the balanced datasets, and for both the CFEE and RAF-DB datasets, the accuracy of compound emotion recognition improved considerably. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. A Novel Method for 3D Lung Tumor Reconstruction Using Generative Models.
- Author
-
Najafi, Hamidreza, Savoji, Kimia, Mirzaeibonehkhater, Marzieh, Moravvej, Seyed Vahid, Alizadehsani, Roohallah, and Pedrammehr, Siamak
- Subjects
- *
GENERATIVE adversarial networks , *REINFORCEMENT learning , *LUNG tumors , *LUNG cancer , *EARLY detection of cancer - Abstract
Background: Lung cancer remains a significant health concern, and the effectiveness of early detection significantly enhances patient survival rates. Identifying lung tumors with high precision is a challenge due to the complex nature of tumor structures and the surrounding lung tissues. Methods: To address these hurdles, this paper presents an innovative three-step approach that leverages Generative Adversarial Networks (GAN), Long Short-Term Memory (LSTM), and VGG16 algorithms for the accurate reconstruction of three-dimensional (3D) lung tumor images. The first challenge we address is the accurate segmentation of lung tissues from CT images, a task complicated by the overwhelming presence of non-lung pixels, which can lead to classifier imbalance. Our solution employs a GAN model trained with a reinforcement learning (RL)-based algorithm to mitigate this imbalance and enhance segmentation accuracy. The second challenge involves precisely detecting tumors within the segmented lung regions. We introduce a second GAN model with a novel loss function that significantly improves tumor detection accuracy. Following successful segmentation and tumor detection, the VGG16 algorithm is utilized for feature extraction, preparing the data for the final 3D reconstruction. These features are then processed through an LSTM network and converted into a format suitable for the reconstructive GAN. This GAN, equipped with dilated convolution layers in its discriminator, captures extensive contextual information, enabling the accurate reconstruction of the tumor's 3D structure. Results: The effectiveness of our method is demonstrated through rigorous evaluation against established techniques using the LIDC-IDRI dataset and standard performance metrics, showcasing its superior performance and potential for enhancing early lung cancer detection. Conclusions:This study highlights the benefits of combining GANs, LSTM, and VGG16 into a unified framework. This approach significantly improves the accuracy of detecting and reconstructing lung tumors, promising to enhance diagnostic methods and patient results in lung cancer treatment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Research on bearing fault diagnosis method based on cjbm with semi-supervised and imbalanced data.
- Author
-
Li, Sai, Peng, Yanfeng, Bin, Guangfu, Shen, Yiping, Guo, Yong, Li, Baoqing, Jiang, Yongzheng, and Fan, Chao
- Abstract
Data-driven intelligent methods have been widely used in bearing fault diagnosis. However, it is observed that previous studies on bearing fault diagnosis always assume that the label samples are sufficient and that the number of normal and fault samples is the same or similar, which is challenging to meet in practical engineering applications. This assumption reduces the accuracy and stability of the semi-supervised imbalanced bearing data fault diagnosis model in practical working conditions. The complex training and weak interpretation problems of transfer learning methods are analyzed, and a center jumping boosting machine method for bearing intelligent fault recognition with semi-supervised and imbalanced data is proposed. First, a modified density peak clustering (DPC) algorithm is used to classify unlabeled samples and select subsamples, and aiming at the DPC problem, a γ DPC algorithm based on the γ jumping phenomenon is proposed to determine the number of clusters and intercept distance automatically. Second, combined with the synthetic minority oversampling technique, some minority class samples are added to achieve a balanced bearing dataset. Then, a few known faults are used to assign pseudo-labels to unknown samples. Finally, to diagnose the new data and reduce the amount of calculation in actual production, the balanced data after processing are used to train the bottom light gradient boosting machine model to solve intelligent classification and recognition of bearing vibration data. In addition, by using three bearing datasets with different balance ratios and comparing them with other methods, the superiority of the proposed method is verified in bearing condition identification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. A rolling bearing fault diagnosis method for imbalanced data based on multi-scale self-attention mechanism and novel loss function.
- Author
-
Qiang Ruiru and Zhao Xiaoqiang
- Abstract
Deep learning methods are widely used in the field of rolling bearing fault diagnosis and produce good results when faced with datasets with roughly equal numbers of normal and faulty samples. However, real-world data often has a serious imbalance, with the number of fault samples being significantly less than the number of normal samples. This dataset imbalance challenges the performance of traditional deep learning methods. To address this problem, this paper proposes an efficient imbalanced data rolling bearing fault diagnosis method. The method consists of two parts: a deep learning network based on a multi-scale self-attention mechanism and a novel loss function. In terms of the deep learning network, firstly, the one-dimensional vibration signal isconverted into a two-dimensional image through the Gramian angular field. This conversion maximises the inherent feature extraction capability of the network. Subsequently, the multi-scale learning capability of the network is enhanced by implementing different expansion rates for the head of the multi-scale self-attention mechanism. This nuanced approach allows the network to capture the underlying information more efficiently. Finally, the inclusion of Ghost bottlenecks and feature pyramid networks (FPNs) helps to optimise network efficiency and improve generalisation performance. A novel loss function is also proposed to make the method more suitable for imbalanced data. During thetraining process, the classification of samples in each class is analysed using the recall metric of imbalanced classification and the real-time recall is used as a weight toweaken the dominance of the majority class. This weighting ensures the adaptability of the method to imbalanced datasets. The proposed method is evaluated using rolling bearing datasets from Case Western Reserve University, USA, and Southeast University, China. Comparison results with other state-of-the-art deep learning methods show that the proposed method has a robust performance when dealing with imbalanced data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.
- Author
-
Wang, Linjie, Li, Wei, Zhou, Fanghui, Yu, Kun, Feng, Chaolu, and Zhao, Dazhe
- Subjects
- *
IRREGULAR sampling (Signal processing) , *GENE expression , *RNA sequencing , *DATA reduction , *DATA analysis - Abstract
Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Enhancing Cover Management Factor Classification Through Imbalanced Data Resolution.
- Author
-
Nguyen, Kieu Anh and Chen, Walter
- Subjects
RANDOM forest algorithms ,LAND cover ,SOIL erosion ,MACHINE learning ,EDUCATIONAL outcomes - Abstract
This study addresses the persistent challenge of class imbalance in land use and land cover (LULC) classification within the Shihmen Reservoir watershed in Taiwan, where LULC is used to map the Cover Management factor (C-factor). The dominance of forests in the LULC categories leads to an imbalanced dataset, resulting in poor prediction performance for minority classes when using machine learning techniques. To overcome this limitation, we applied the Synthetic Minority Over-sampling Technique (SMOTE) and the 90-model SMOTE-variants package in Python to balance the dataset. Due to the multi-class nature of the data and memory constraints, 42 models were successfully used to create a balanced dataset, which was then integrated with a Random Forest algorithm for C-factor classification. The results show a marked improvement in model accuracy across most SMOTE variants, with the Selected Synthetic Minority Over-sampling Technique (Selected_SMOTE) emerging as the best-performing method, achieving an overall accuracy of 0.9524 and a sensitivity of 0.6892. Importantly, the previously observed issue of poor minority class prediction was resolved using the balanced dataset. This study provides a robust solution to the class imbalance issue in C-factor classification, demonstrating the effectiveness of SMOTE variants and the Random Forest algorithm in improving model performance and addressing imbalanced class distributions. The success of Selected_SMOTE underscores the potential of balanced datasets in enhancing machine learning outcomes, particularly in datasets dominated by a majority class. Additionally, by addressing imbalance in LULC classification, this research contributes to Sustainable Development Goal 15, which focuses on the protection, restoration, and sustainable use of terrestrial ecosystems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. 改进的邻近加权合成过采样技术.
- Author
-
邢胜, 王晓兰, 沈家星, 朱美玲, 曹永青, and 何玉林
- Abstract
Copyright of Journal of Shenzhen University Science & Engineering is the property of Editorial Department of Journal of Shenzhen University Science & Engineering and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
48. Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset.
- Author
-
Lokapitasari Belluano, Poetri Lestari, Rahma, Reyna Aprilia, Darwis, Herdianti, and Manga, Abdul Rachman
- Subjects
SKIN cancer ,MACHINE learning ,STATISTICAL sampling ,DATA analysis ,ACCURACY - Abstract
This study aims to analyze the performance of various ensemble machine learning methods, such as Adaboost, Bagging, and Stacking, in the context of skin cancer classification using the skin cancer MNIST dataset. We also evaluate the impact of handling dataset imbalance on the classification model's performance by applying imbalanced data methods such as random under sampling (RUS), random over sampling (ROS), synthetic minority over-sampling technique (SMOTE), and synthetic minority over-sampling technique with edited nearest neighbor (SMOTEENN). The research findings indicate that Adaboost is effective in addressing data imbalance, while imbalanced data methods can significantly improve accuracy. However, the selection of imbalanced data methods should be carefully tailored to the dataset characteristics and clinical objectives. In conclusion, addressing data imbalance can enhance skin cancer classification accuracy, with Adaboost being an exception that shows a decrease in accuracy after applying imbalanced data methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Automatic classification of transportation modes using smartphone sensors: addressing imbalanced data and enhancing training with focal loss and artificial bee colony algorithm.
- Author
-
Xu, Xiaoyu
- Abstract
The growing interest in utilizing smartphone sensors to differentiate various transportation modes stems from its potential benefits across multiple sectors, including health monitoring, transportation planning, and geo-specific utilities. This research presents a model that uses smartphone accelerometers, magnetometers, and gyroscope sensor data to classify transportation and vehicular modes. To tackle the issue posed by imbalanced data, we suggest a training approach incorporating focal loss, which selectively samples minority class examples and enables the model to focus on more complex instances. Our model surpasses other machine learning models and achieves impressive results on an imbalanced dataset obtained from the HTC company. This dataset includes data from 224 volunteers collected over two years, comprising 8311 h and 100 GB of data. We suggest using the artificial bee colony (ABC) algorithm to improve the training process further. This algorithm is adept at comprehensively exploring the search space, thereby assisting in determining appropriate initial weights. Such an approach helps accelerate the convergence process during training and mitigates the issue of initialization sensitivity often linked with gradient-dependent training techniques such as backpropagation. In our research, we have conducted tests on the dataset to pinpoint the most effective values for critical parameters. Furthermore, we perform ablation studies to evaluate the effects of focal loss and the ABC algorithm on the model's efficacy, illustrating their individual and combined beneficial impacts. This research holds significant implications for the practical application of mobile sensing in transportation, offering a robust tool for enhancing various services and systems related to mobility and urban planning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Feature Selection based Improved Seagull Optimization for Imbalanced Data Classification.
- Author
-
Somashekar, Thatikonda and Pelluri, Sudha
- Subjects
METAHEURISTIC algorithms ,FEATURE selection ,SUPPORT vector machines ,CORONAVIRUSES ,WRAPPERS - Abstract
Feature selection is a technique that involves selecting relevant features from the original set to increase the wrapper model's performance, efficiency, and interpretability. However, imbalanced data classification is a primary challenge, where the number of instances in different classes are not evenly distributed. This imbalance leads to biased models due to the focus on dominant features of the majority class, which minimizes the effectiveness and classification accuracy for the minority class. This research proposes an Improved Seagull Optimization Algorithm (ISOA) for imbalanced data classification. The SOA is improved by employing Modified Tent (MTent), non-linear inertia weight, and double helix formula, also enhancing seagull population, convergence speed, and optimization accuracy. Initially, the University of California, Irvine (UCI) and Knowledge Extraction based on Evolutionary Learning (KEEL) are used to evaluate the ISOA performance. The min-max normalization is used to normalize data which enhances the generalization performance. Then, the Remodelled Normalized Euclidean Matrix Weight Synthetic Minority Oversampling Technique (RNED-WSMOTE) is employed for balancing the imbalanced data. Further, ISOA is employed to choose appropriate features and a Support Vector Machine (SVM) is performed to classify the different classes. When compared to the existing methods like Binary Memory-based Sand Cat Swarm Optimization (BMSCSO), Coronavirus Herd Immunity Optimizer (CHIO) with Greedy Crossover (GC), and Binary Snake Optimizer (BSO), the ISOA accomplishes a better average classification accuracy of 0.9964 and 0.9812 for Diagnostic and Cleveland datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.