Back to Search
Start Over
A comprehensive simulation study on classification of RNA-Seq data
- Source :
- PLoS ONE, Vol 12, Iss 8, p e0182507 (2017), PLoS ONE
- Publication Year :
- 2017
- Publisher :
- Public Library of Science (PLoS), 2017.
-
Abstract
- RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNASeq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www. bioconductor. org/packages/release/bioc/ html/MLSeq. html.
- Subjects :
- 0301 basic medicine
Support Vector Machine
Molecular biology
Microarrays
Computer science
lcsh:Medicine
Linear Discriminant Analysis
Bioinformatics
Biochemistry
Trees
Machine Learning
Sequencing techniques
Mathematical and Statistical Techniques
0302 clinical medicine
lcsh:Science
Multidisciplinary
Applied Mathematics
Simulation and Modeling
RNA sequencing
Plants
Random forest
Nucleic acids
Bioassays and Physiological Analysis
030220 oncology & carcinogenesis
Physical Sciences
Algorithms
Statistics (Mathematics)
Research Article
Computer and Information Sciences
03 medical and health sciences
Overdispersion
Artificial Intelligence
Support Vector Machines
Genetics
Humans
Statistical Methods
Non-coding RNA
Biology and life sciences
Sequence Analysis, RNA
business.industry
lcsh:R
Organisms
Pattern recognition
Linear discriminant analysis
Gene regulation
Research and analysis methods
Support vector machine
MicroRNAs
Statistical classification
Molecular biology techniques
030104 developmental biology
Sample size determination
RNA
lcsh:Q
Gene expression
Artificial intelligence
business
Mathematics
Subjects
Details
- Language :
- English
- ISSN :
- 19326203
- Volume :
- 12
- Issue :
- 8
- Database :
- OpenAIRE
- Journal :
- PLoS ONE
- Accession number :
- edsair.doi.dedup.....f7a557b40de249d7a6e26f850e3bf0f5