Back to Search Start Over

Understanding overfitting in random forest for probability estimation: a visualization and simulation study

Authors :
Lasai Barreñada
Paula Dhiman
Dirk Timmerman
Anne-Laure Boulesteix
Ben Van Calster
Source :
Diagnostic and Prognostic Research, Vol 8, Iss 1, Pp 1-14 (2024)
Publication Year :
2024
Publisher :
BMC, 2024.

Abstract

Abstract Background Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study. Methods For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000). Results The visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size. Conclusions Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.

Details

Language :
English
ISSN :
23977523
Volume :
8
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Diagnostic and Prognostic Research
Publication Type :
Academic Journal
Accession number :
edsdoj.0a3450ab9244d92851909742ecf816b
Document Type :
article
Full Text :
https://doi.org/10.1186/s41512-024-00177-1