1. Relevant information undersampling to support imbalanced data classification
- Author
-
Andrés Marino Álvarez-Meza, Genaro Daza-Santacoloma, J. Hoyos-Osorio, Germán Castellanos-Domínguez, and Álvaro-Ángel Orozco-Gutierrez
- Subjects
Structure (mathematical logic) ,0209 industrial biotechnology ,Computer science ,Cognitive Neuroscience ,02 engineering and technology ,computer.software_genre ,Imbalanced data ,Computer Science Applications ,Statistical classification ,020901 industrial engineering & automation ,Sampling distribution ,Artificial Intelligence ,Undersampling ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Cluster analysis ,computer ,Relevant information - Abstract
Traditional classification algorithms suppose that the sample distribution among classes is balanced. Yet, such an assumption leads to biased performance over the majority class. This paper proposes a Relevant Information-based UnderSampling (RIUS) approach to select the most relevant examples from the majority class to improve the classification performance for imbalanced data scenarios. RIUS builds on the information-preservation principle that extracts the majority class’s underlying structure with fewer samples. Additionally, we couple our RIUS approach to the well-known Clustering-based Undersampling algorithm (CBUS) to enhance the data representation, and named this RIUS enhancement as CRIUS. Experimental results show that RIUS and CRIUS reveal the data’s relevant structure and reduce the loss of information by selecting the most informative instances.
- Published
- 2021
- Full Text
- View/download PDF