1. Unsupervised Variable Selection Using a Genetic Algorithm: An Application to Textual Data
- Author
-
Rammal, Abbas, Ezukwoke, Kenneth, Hoayek, Anis, Batton-Hubert, Mireille, Centre de Recherche en Sciences et Technologies de l'Information et de la Communication - EA 3804 (CRESTIC), Université de Reims Champagne-Ardenne (URCA), Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes (LIMOS), Ecole Nationale Supérieure des Mines de St Etienne (ENSM ST-ETIENNE)-Centre National de la Recherche Scientifique (CNRS)-Université Clermont Auvergne (UCA)-Institut national polytechnique Clermont Auvergne (INP Clermont Auvergne), Université Clermont Auvergne (UCA)-Université Clermont Auvergne (UCA), École des Mines de Saint-Étienne (Mines Saint-Étienne MSE), Institut Mines-Télécom [Paris] (IMT), Département Génie mathématique et industriel (FAYOL-ENSMSE), Ecole Nationale Supérieure des Mines de St Etienne (ENSM ST-ETIENNE)-Institut Henri Fayol, Institut Henri Fayol (FAYOL-ENSMSE), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Institut Montpelliérain Alexander Grothendieck (IMAG), and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
- Subjects
[INFO.INFO-MO]Computer Science [cs]/Modeling and Simulation - Abstract
International audience; Microelectronics production failure analysis is an important step in improving product quality and development. Indeed, the understanding of the failure mechanisms and therefore the implementation of corrective actions on the cause of the failure depend on the results of these analysis. These analysis are saved under textual features format. Then such data need first to be pre-processed and vectorized (converted to numeric). Second, to overcome the curse of dimensionality caused by the vectorisation process, a dimension reduction is applied. We are first interested in studying the potential of using an unsupervised variable selection technique to identify the variables that best demonstrate discrimination in the separation and compactness of groups of textual data. Variable selection has been approached by several variable or feature selection methods. Some of them have not been adapted for use in large data sets or are difficult to tune, and others require additional information. This work investigates the potential of using a genetic algorithm to find, in an unsupervised way, the variables allowing the best discrimination of the classes, to select variables correlated to particular textual groups. The proosed genetic algorithm uses a combination of the K-means clustering and validity indices as a fitness function for optimization. Such a function improves both compactness and class separation. Experiments on textual datasets demonstrate the effectiveness of the proposed method of variable selection which allows better discrimination of textual classes compared to the use of K-means clustering on all data variables.
- Published
- 2022
- Full Text
- View/download PDF