Back to Search
Start Over
The revival of the Gini importance?
- Source :
- Bioinformatics, 34(21): 3711-3718, Bioinformatics
- Publication Year :
- 2018
- Publisher :
- Oxford University Press (OUP), 2018.
-
Abstract
- Motivation Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information Supplementary data are available at Bioinformatics online.
- Subjects :
- 0301 basic medicine
Statistics and Probability
Clustering high-dimensional data
Computer science
Feature selection
Correlation and dependence
computer.software_genre
01 natural sciences
Biochemistry
Measure (mathematics)
Machine Learning
Set (abstract data type)
Reduction (complexity)
010104 statistics & probability
03 medical and health sciences
Gene Frequency
0101 mathematics
Molecular Biology
Genome
Original Papers
Computer Science Applications
Random forest
Computational Mathematics
Variable (computer science)
030104 developmental biology
Computational Theory and Mathematics
Key (cryptography)
Data mining
Data and Text Mining
computer
Algorithms
Software
Genome-Wide Association Study
Subjects
Details
- ISSN :
- 13674811 and 13674803
- Volume :
- 34
- Database :
- OpenAIRE
- Journal :
- Bioinformatics
- Accession number :
- edsair.doi.dedup.....9bdcbce8a5ebbc48b789dd71e08c2ec3