Back to Search Start Over

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study.

Authors :
Werner de Vargas, Vitor
Schneider Aranda, Jorge Arthur
dos Santos Costa, Ricardo
da Silva Pereira, Paulo Ricardo
Victória Barbosa, Jorge Luis
Source :
Knowledge & Information Systems; Jan2023, Vol. 65 Issue 1, p31-57, 27p
Publication Year :
2023

Abstract

Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies—illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance—with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02191377
Volume :
65
Issue :
1
Database :
Complementary Index
Journal :
Knowledge & Information Systems
Publication Type :
Academic Journal
Accession number :
161327873
Full Text :
https://doi.org/10.1007/s10115-022-01772-8