Start Over

The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

Authors :: Piccolo, Stephen R.
Mecham, Avery
Golightly, Nathan P.
Johnson, Jérémie L.
Miller, Dustin B.
Source :: PLoS Computational Biology; 3/11/2022, Vol. 18 Issue 3, p1-34, 34p, 1 Diagram, 3 Charts, 5 Graphs
Publication Year :: 2022
Abstract: By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies. Author summary: When a patient is treated in a medical setting, a clinician may extract a tissue sample and use transcriptome-profiling technologies to quantify the extent to which thousands of genes are expressed in the sample. These measurements reflect biological activity that may influence disease development, progression, and/or treatment responses. Patterns that differ between patients in distinct groups (for example, patients who do or do not have a disease or do or do not respond to a treatment) may be used to classify future patients into these groups. This study is a large-scale benchmark comparison of algorithms that can be used to perform such classifications. Additionally, we evaluated feature-selection algorithms, which can be used to identify which variables (genes and/or patient characteristics) are most relevant for classification. Through a series of analyses that build on each other, we show that classification performance varies considerably, depending on which algorithms are used, whether feature selection is used, which settings are used when executing the algorithms, and which metrics are used to evaluate the algorithms' performance. Researchers can use these findings as a resource for deciding which algorithms and settings to prioritize when deriving transcriptome-based biomarkers in future efforts. [ABSTRACT FROM AUTHOR]

Subjects :: CLASSIFICATION algorithms
FEATURE selection
MACHINE learning
ALGORITHMS
DATABASES

Details

Language :: English
ISSN :: 1553734X
Volume :: 18
Issue :: 3
Database :: Complementary Index
Journal :: PLoS Computational Biology
Publication Type :: Academic Journal
Accession number :: 155690772
Full Text :: https://doi.org/10.1371/journal.pcbi.1009926

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources