Back to Search
Start Over
Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
- Source :
- BMC Bioinformatics, LOCUS Repositório Institucional da UFV, Universidade Federal de Viçosa (UFV), instacron:UFV, BMC Bioinformatics, Vol 18, Iss 1, Pp 1-14 (2017)
- Publication Year :
- 2017
- Publisher :
- Springer Science and Business Media LLC, 2017.
-
Abstract
- Background Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns. Results This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively. Conclusions Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp. Electronic supplementary material The online version of this article (10.1186/s12859-017-1839-x) contains supplementary material, which is available to authorized users.
- Subjects :
- Geminivirus
machine learning
0301 basic medicine
Gene prediction
Genus classification
DNA, Satellite
Biology
lcsh:Computer applications to medicine. Medical informatics
Machine learning
computer.software_genre
Biochemistry
Genome
Machine Learning
Open Reading Frames
User-Computer Interface
03 medical and health sciences
Structural Biology
Genus
Multilayer perceptron
ORFS
lcsh:QH301-705.5
Molecular Biology
Internet
Random Forest
Support vector machines
business.industry
Methodology Article
Applied Mathematics
Plants
Computer Science Applications
Random forest
Support vector machine
Geminiviridae
030104 developmental biology
lcsh:Biology (General)
ROC Curve
Metagenomics
Area Under Curve
lcsh:R858-859.7
Gene classification
Sequential minimal optimization
Artificial intelligence
business
computer
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 18
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics
- Accession number :
- edsair.doi.dedup.....d9421e5ff7ccfc580dd02e91b0d7a3e3