1. A majority affiliation based under-sampling method for class imbalance problem.
- Author
-
Xie, Ying, Huang, Xian, Qin, Feng, Li, Fagen, and Ding, Xuyang
- Subjects
- *
K-nearest neighbor classification , *VECTOR data , *PERFORMANCES - Abstract
Class imbalance poses difficulties in training a classifier that perform well on minority classes, especially when there is a high imbalance ratio and significant class overlap. Existing data-level methods often suffer from problems like information loss and overfitting. To address these problems, we introduce a novel majority affiliation based under-sampling method (MAUS). The MAUS method employs a support vector data description model to capture the distribution of the minority class, thereby forming a hyper-sphere to establish a majority affiliation for each sample. The high-dimensional hyper-sphere constructed through all minority class samples avoids the problem of overfitting. Leveraging the majority affiliation in conjunction with the k-nearest neighbor algorithm, MAUS is capable of identifying region of class overlap and subsequently removing majority samples within these regions that negatively impact classification performance. This selective removal process minimizes excessive information loss at classification boundaries while alleviating the issue of class overlap. Furthermore, by removing those majority samples that are situated far from the classification boundary, MAUS reduces the imbalance ratio to the expected value, resulting in the attainment of a balanced dataset. To validate the effectiveness of our method, we conducted extensive experiments comparing it with state-of-the-art methods on 30 publicly available datasets. The results indicate that our approach outperforms existing methods on most of datasets and classifiers. • Based on hyper-sphere built by all minority samples, majority affiliation quantifies the probability of a sample belonging to the majority class. • Each k-nearest neighbor majority sample within the overlapping region that adversely impacts the classification performance were removed. • Constrained by the expected imbalance ratio, majority sample with the largest weight is removed in turn, ultimately yielding a balanced dataset. • Two sets of experiments were conducted on 30 real-world datasets to evaluate its performance against 13 state-of-the-art resampling methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF