Back to Search Start Over

A critical comparison of feature selection algorithms for improved classification accuracy

Authors :
Snyman, W.
Van Vuuren, P.A.
10732926 - Van Vuuren, Pieter Andries (Supervisor)
Publication Year :
2019
Publisher :
North-West University (South Africa), 2019.

Abstract

MEng (Computer and Electronic Engineering), North-West University, Potchefstroom Campus Feature selection is crucial for increasing the performance of predictive models in both classification accuracy and model training time. For high-dimensional data, feature selection becomes ever more necessary to select adequate features which complement the predictive model of choice. Filter, wrapper and embedded feature selection techniques are among the most popular algorithms to solve the feature selection conundrum of irrelevant and redundant features reducing the performance of classification models. It is also common to find hybrid techniques which combines filter, wrapper and embedded techniques to construct more robust feature selection algorithms. This study is dedicated to reveal the ongoing improvement in the field of feature selection and to dissect six different feature selection algorithms for a detailed insight into their success for high dimensional data, specifically gene expression microarrays. The six algorithms for this study are: (i) three filter methods mRMR (min-Redundancy Max-Relevance), FCBF# (Fast Correlation Based Filter) and ORFS (Orthogonal Relevance Feature Selection), (ii) two wrapper methods FRBPSO (Fuzzy Rule Based Particle Swarm Optimisation) and SVM-RFE (Support Vector Machine-Recursive Feature Elimination), and (iii) one embedded method SBMLR (Sparse Multinomial Logistic Regression via Bayesian L1 regularisation). The three filter methods are adapted into suitable hybrid techniques and multiple associative measures are explored to determine the best performance per algorithm. All algorithms include the pre-processing techniques MDL discretisation and SIS to explore their improvements and shortcomings. The performance per algorithm is based on their ability to improve classification accuracy with the least amount of features possible and compared to one another. After comparison, the algorithms best suited for classification improvement, computation speed advantage and feature removal capability are revealed. Thereafter, a case study involving plant foliage features where the amount of features greatly outnumber the number of samples, denoted by p >> n, is used to compliment the findings. The use of pre-processing techniques proved to be and SVM-RFE (Support Vector Machine-Recursive Feature Elimination), and (iii) one embedded method SBMLR (Sparse Multinomial Logistic Regression via Bayesian L1 regularisation). The three filter methods are adapted into suitable hybrid techniques and multiple associative measures are explored to determine the best performance per algorithm. All algorithms include the pre-processing techniques MDL discretisation and SIS to explore their improvements and shortcomings. The performance per algorithm is based on their ability to improve classification accuracy with the least amount of features possible and compared to one another. After comparison, the algorithms best suited for classification improvement, computation speed advantage and feature removal capability are revealed. Thereafter, a case study involving plant foliage features where the amount of features greatly outnumber the number of samples, denoted by p >> n, is used to compliment the findings. The use of pre-processing techniques proved to be crucial regarding improved classification accuracy and reduced computation time. Out of all six algorithms, mRMR, and SVM-RFE proved the most promising. Masters

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.od......1399..a7316f92521fe775dca8e3fde56d0168