1. Entropy difference and kernel-based oversampling technique for imbalanced data learning.
- Author
-
Wu, Xu, Yang, Youlong, and Ren, Lingyu
- Subjects
ENTROPY (Information theory) ,DATA mining ,SUPPORT vector machines ,DATA distribution ,MACHINE learning - Abstract
Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF