Back to Search Start Over

Formal Analysis and Estimation of Chance in Datasets Based on Their Properties.

Authors :
Taha, Abdel Aziz
Papariello, Luca
Bampoulidis, Alexandros
Knoth, Petr
Lupu, Mihai
Source :
IEEE Transactions on Knowledge & Data Engineering; Dec2022, Vol. 34 Issue 12, p5784-5795, 12p
Publication Year :
2022

Abstract

Machine learning research, particularly in genomics, is often based on wide shaped datasets, i.e. datasets having a large number of features, but a small number of samples. Such configurations raise the possibility of chance influence (the increase of measured accuracy due to chance correlations) on the learning process and the evaluation results. Prior research underlined the problem of generalization of models obtained based on such data. In this paper, we investigate the influence of chance on prediction and show its significant effects on wide shaped datasets. First, we empirically demonstrate how significant the influence of chance in such datasets is by showing that prediction models trained on thousands of randomly generated datasets can achieve high accuracy. This is the case even when using cross-validation. We then provide a formal analysis of chance influence and design formal chance influence estimators based on the dataset parameters, namely its sample size, the number of features, the number of classes and the class distribution. Finally, we provide an in-depth discussion of the formal analysis including applications of the findings and recommendations on chance influence mitigation. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10414347
Volume :
34
Issue :
12
Database :
Complementary Index
Journal :
IEEE Transactions on Knowledge & Data Engineering
Publication Type :
Academic Journal
Accession number :
160692093
Full Text :
https://doi.org/10.1109/TKDE.2021.3068009