1. Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation.
- Author
-
Pereira, João Luiz Junho, Smith-Miles, Kate, Muñoz, Mario Andrés, and Lorena, Ana Carolina
- Subjects
MACHINE learning ,SUPERVISED learning ,METAHEURISTIC algorithms ,CLASSIFICATION algorithms ,ALGORITHMS - Abstract
Whenever a new supervised machine learning (ML) algorithm or solution is developed, it is imperative to evaluate the predictive performance it attains for diverse datasets. This is done in order to stress test the strengths and weaknesses of the novel algorithms and provide evidence for situations in which they are most useful. A common practice is to gather some datasets from public benchmark repositories for such an evaluation. But little or no specific criteria are used in the selection of these datasets, which is often ad-hoc. In this paper, the importance of gathering a diverse benchmark of datasets in order to properly evaluate ML models and really understand their capabilities is investigated. Leveraging from meta-learning studies evaluating the diversity of public repositories of datasets, this paper introduces an optimization method to choose varied classification and regression datasets from a pool of candidate datasets. The method is based on maximum coverage, circular packing, and the meta-heuristic Lichtenberg Algorithm for ensuring that diverse datasets able to challenge the ML algorithms more broadly are chosen. The selections were compared experimentally with a random selection of datasets and with clustering by k-medoids and proved to be more effective regarding the diversity of the chosen benchmarks and the ability to challenge the ML algorithms at different levels. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF