Back to Search Start Over

Multi-class imbalanced big data classification on Spark.

Authors :
Sleeman IV, William C.
Krawczyk, Bartosz
Source :
Knowledge-Based Systems. Jan2021, Vol. 212, pN.PAG-N.PAG. 1p.
Publication Year :
2021

Abstract

Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. • First complete framework for learning from multi-class imbalanced big data. • Informative multi-class sampling methods that use instance-level characteristics. • Novel oversampling modification dedicated to MapReduce environments. • Code and data repository for reproducibility and applications of proposed methods. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09507051
Volume :
212
Database :
Academic Search Index
Journal :
Knowledge-Based Systems
Publication Type :
Academic Journal
Accession number :
147777295
Full Text :
https://doi.org/10.1016/j.knosys.2020.106598