Back to Search
Start Over
Geminivirus data warehouse: a database enriched with machine learning approaches
- Source :
- BMC Bioinformatics, LOCUS Repositório Institucional da UFV, Universidade Federal de Viçosa (UFV), instacron:UFV, BMC Bioinformatics, Vol 18, Iss 1, Pp 1-11 (2017)
- Publication Year :
- 2016
-
Abstract
- Background The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. Results Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. Conclusions The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1646-4) contains supplementary material, which is available to authorized users.
- Subjects :
- 0301 basic medicine
Computer science
computer.software_genre
Biochemistry
Genome
Transcriptome
Machine Learning
chemistry.chemical_compound
Knowledge discovery
Structural Biology
Genus
Databases, Genetic
lcsh:QH301-705.5
Phylogeny
Genomic organization
Database
biology
Applied Mathematics
Plants
Data warehouse
Computer Science Applications
Geminiviridae
lcsh:R858-859.7
Geminivirus
Algorithms
Data Warehouse
DNA, Single-Stranded
Genomics
Context (language use)
Machine learning
lcsh:Computer applications to medicine. Medical informatics
03 medical and health sciences
Open Reading Frames
Phylogenetics
Molecular Biology
Data mining
Random Forest
business.industry
Host (biology)
Computational Biology
biology.organism_classification
030104 developmental biology
chemistry
lcsh:Biology (General)
Vector (epidemiology)
DNA, Viral
Artificial intelligence
business
computer
DNA
Random forest
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 18
- Issue :
- 1
- Database :
- OpenAIRE
- Journal :
- BMC bioinformatics
- Accession number :
- edsair.doi.dedup.....63c23430ce1e5f022b63b298c245e758