Back to Search Start Over

A fast and flexible instance selection algorithm adapted to non-trivial database sizes

Authors :
Rachid Harba
Frédéric Ros
Serge Guillaume
Marco Pintore
Laboratoire Pluridisciplinaire de Recherche en Ingénierie des Systèmes, Mécanique et Energétique (PRISME)
Université d'Orléans (UO)-Ecole Nationale Supérieure d'Ingénieurs de Bourges (ENSI Bourges)
aucun
PILA
Information – Technologies – Analyse Environnementale – Procédés Agricoles (UMR ITAP)
Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA)-Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)
Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)
Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)-Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA)
Source :
Intelligent Data Analysis, Intelligent Data Analysis, IOS Press, 2015, 19 (3), pp.631-658. ⟨10.3233/IDA-150736⟩
Publication Year :
2015
Publisher :
IOS Press, 2015.

Abstract

International audience; In this paper, a new instance selection algorithm is proposed in the context of classification to manage non-trivial database sizes. The algorithm is hybrid and runs with only a few parameters that directly control the balance between the three objectives of classification, i.e. errors, storage requirements and runtime. It comprises different mechanisms involving neighborhood and stratification algorithms that specifically speed up the runtime without significantly degrading efficiency. Instead of applying an IS (Instance Selection) algorithm to the whole database, IS is applied to strata deriving from the regions, each region representing a set of patterns selected from the original training set. The application of IS is conditioned by the purity of each region (i.e. the extent to which different categories of patterns are mixed in the region) and the stratification strategy is adapted to the region components. For each region, the number of delivered instances is firstly limited via the use of an iterative process that takes into account the boundary complexity, and secondly optimized by removing the superfluous ones. The sets of instances determined from all the regions are put together to provide an intermediate instance set that undergoes a dedicated filtering process to deliver the final set. Experiments performed with various synthetic and real data sets demonstrate the advantages of the proposed approach.

Details

ISSN :
15714128 and 1088467X
Volume :
19
Database :
OpenAIRE
Journal :
Intelligent Data Analysis
Accession number :
edsair.doi.dedup.....6b68a4067dfa67f6da9be21abf1a30f6
Full Text :
https://doi.org/10.3233/ida-150736