Back to Search Start Over

Some combinatorics of data leakage induced by clusters.

Authors :
Guignard, Fabian
Ginsbourger, David
Levy Häner, Lilia
Herrera, Juan Manuel
Source :
Stochastic Environmental Research & Risk Assessment. Jul2024, Vol. 38 Issue 7, p2815-2828. 14p.
Publication Year :
2024

Abstract

Data leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
14363240
Volume :
38
Issue :
7
Database :
Academic Search Index
Journal :
Stochastic Environmental Research & Risk Assessment
Publication Type :
Academic Journal
Accession number :
178276337
Full Text :
https://doi.org/10.1007/s00477-024-02715-1