Back to Search Start Over

A comparative study on the effect of data imbalance on software defect prediction.

Authors :
Liu, Yanbin
Zhang, Wen
Qin, Guangjie
Zhao, Jiangpeng
Source :
Procedia Computer Science; 2022, Vol. 214, p1603-1616, 14p
Publication Year :
2022

Abstract

In the current stage, software defect prediction is suffering the imbalanced data problem. Traditional methods are insensitive to defect-prone modules and tend to predict defect-prone modules as defect-free modules. To deal with this problem, sampling techniques are adopted to rebalance the defect-prone and defect-free data to train the predictive model in order to improve the performance. However, it is not clear on the combined effect of the sampling techniques and the machine learning classifiers on the performance of software defect prediction. The intent of the paper is to study the performance impact on defect prediction incurred by different combinations of sampling techniques and machine learning classifiers. Specifically, we investigate three types of sampling techniques as resampling, spread subsampling and SMOTE (Synthetic Minority Over-sampling Technique), and five types of machine learning classifiers as C4.5, naive Bayes, logistic regression, support vector machine and deep learning to study their combined effect on defect prediction. By using the Friedman test and Nemenyi test, we find that there isn't an optimal method among all the 12 combinations in defect prediction. However, support vector machine and deep learning have produced the best performance stably among all the investigated projects. With ANOVA analysis, we find that the sampling techniques have great impact on the outcomes of defect prediction because they produce different data distributions for model training. Nevertheless, the sampling proportion has significant impacts on TPR (True Positive Ratio) and FPR (False Positive Ratio) while it can merely influence the AUC (Area under Curve) and Balance of logistic regression. We explain the experimental results in the paper. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18770509
Volume :
214
Database :
Supplemental Index
Journal :
Procedia Computer Science
Publication Type :
Academic Journal
Accession number :
160691222
Full Text :
https://doi.org/10.1016/j.procs.2022.11.349