Back to Search Start Over

A machine learning based framework to identify and classify long terminal repeat retrotransposons

Authors :
Claudia M. A. Carareto
Jan Ramon
Hendrik Blockeel
Celine Vens
Leander Schietgat
Eduardo P. Costa
Carlos Norberto Fischer
Ricardo Cerri
Catholic University of Leuven - Katholieke Universiteit Leuven (KU Leuven)
Machine Learning in Information Networks (MAGNET)
Inria Lille - Nord Europe
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL)
Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)
Federal University of São Carlos (UFSCar)
Universidade Estadual Paulista Júlio de Mesquita Filho = São Paulo State University (UNESP)
Universidade de São Paulo = University of São Paulo (USP)
Declarative Languages and Artificial Intelligence (DTAI)
Université Catholique de Louvain = Catholic University of Louvain (UCL)
Katholieke Universiteit Leuven ( KU Leuven )
Machine Learning in Information Networks ( MAGNET )
Institut National de Recherche en Informatique et en Automatique ( Inria ) -Institut National de Recherche en Informatique et en Automatique ( Inria ) -Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189 ( CRIStAL )
Institut National de Recherche en Informatique et en Automatique ( Inria ) -Ecole Centrale de Lille-Institut Mines-Télécom [Paris]-Université de Lille-Centre National de la Recherche Scientifique ( CNRS ) -Ecole Centrale de Lille-Institut Mines-Télécom [Paris]-Université de Lille-Centre National de la Recherche Scientifique ( CNRS )
Federal University of São Carlos ( UFSCar )
São Paulo State University ( UNESP )
Universidade de São Paulo ( USP )
Declarative Languages and Artificial Intelligence ( DTAI )
Université Catholique de Louvain ( UCL )
Universidade de São Paulo (USP)
KU Leuven
KU Leuven Kulak
Ghent University and VIB Inflammation Research Center
Universidade Federal de São Carlos (UFSCar)
Universidade Estadual Paulista (Unesp)
INRIA Lille Nord Europe
Source :
PLoS Computational Biology, PLoS Computational Biology, 2018, 14 (4), pp.1-21. ⟨10.1371/journal.pcbi.1006097⟩, PLOS COMPUTATIONAL BIOLOGY, PLoS Computational Biology, Public Library of Science, 2018, 14 (4), pp.1-21. 〈10.1371/journal.pcbi.1006097〉, PLoS Computational Biology, Public Library of Science, 2018, 14 (4), pp.1-21. ⟨10.1371/journal.pcbi.1006097⟩, PLoS Computational Biology, Vol 14, Iss 4, p e1006097 (2018), Scopus, Repositório Institucional da UNESP, Universidade Estadual Paulista (UNESP), instacron:UNESP
Publication Year :
2018
Publisher :
HAL CCSD, 2018.

Abstract

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.<br />Author summary Over the years, with the increase of the acquisition of biological data, the extraction of knowledge from this data is getting more important. To understand how biology works is very important to increase the quality of the products and services which use biological data. This directly influences companies and governments, which need to remain in the knowledge frontier of an increasing competitive economy. Transposable Elements (TEs) are an example of very important biological data, and to understand their role in the genomes of organisms is very important for the development of products based on biological data. As an example, we can cite the production biofuels such as the sugar-cane-based ones. Many studies have revealed the presence of active TEs in this plant, which has gained economic importance in many countries. To understand how TEs influence the plant should help researchers to develop more resistant varieties of sugar-cane, increasing the production. Thus, the development of computational methods able to help biologists in the correct identification and classification of TEs is very important from both theoretical and practical perspectives.

Subjects

Subjects :
Decision Analysis
Genome, Insect
Arabidopsis
02 engineering and technology
computer.software_genre
Biochemistry
[STAT.ML]Statistics [stat]/Machine Learning [stat.ML]
Mobile Genetic Elements
lcsh:QH301-705.5
Drosophila Melanogaster
Eukaryota
Genomics
Plants
ARABIDOPSIS
Long terminal repeat
GENOME
Computational Theory and Mathematics
Modeling and Simulation
Engineering and Technology
020201 artificial intelligence & image processing
Management Engineering
Transposable element
Bioinformatics
Arabidopsis Thaliana
Sequence Databases
Brassica
CLASSIFICATION
Evolution, Molecular
03 medical and health sciences
Protein Domains
Plant and Algal Models
[ INFO.INFO-BI ] Computer Science [cs]/Bioinformatics [q-bio.QM]
EUKARYOTIC TRANSPOSABLE ELEMENTS
Genetics
[ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST]
[ INFO.INFO-AI ] Computer Science [cs]/Artificial Intelligence [cs.AI]
Molecular Biology
Genome size
Ecology, Evolution, Behavior and Systematics
LTR RETROTRANSPOSONS
IDENTIFICATION
SEQUENCES
Arabidopsis Proteins
Terminal Repeat Sequences
Organisms
Transposable Elements
Computational Biology
Biology and Life Sciences
Proteins
Invertebrates
030104 developmental biology
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
Software
0301 basic medicine
Computer science
Retrotransposon
Genome
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Machine Learning
Database and Informatics Methods
[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST]
Invertebrate Genomics
0202 electrical engineering, electronic engineering, information engineering
TOOL
Drosophila Proteins
[ STAT.ML ] Statistics [stat]/Machine Learning [stat.ML]
Conserved Sequence
2. Zero hunger
Ecology
Animal Models
Insects
Identification (information)
Retrotransposons
Experimental Organism Systems
Drosophila
Sequence Analysis
INVERTED
Genome, Plant
[PHYS.PHYS.PHYS-DATA-AN]Physics [physics]/Physics [physics]/Data Analysis, Statistics and Probability [physics.data-an]
Research Article
Genome evolution
Computer and Information Sciences
DNA, Plant
Retroelements
Arthropoda
Machine learning
Research and Analysis Methods
[ PHYS.PHYS.PHYS-DATA-AN ] Physics [physics]/Physics [physics]/Data Analysis, Statistics and Probability [physics.data-an]
RANDOM FORESTS
Cellular and Molecular Neuroscience
Annotation
Model Organisms
Genetic Elements
Artificial Intelligence
Animals
business.industry
Decision Trees
Biological Databases
lcsh:Biology (General)
Animal Genomics
REPEATS
Artificial intelligence
business
computer

Details

Language :
English
ISSN :
1553734X and 15537358
Database :
OpenAIRE
Journal :
PLoS Computational Biology, PLoS Computational Biology, 2018, 14 (4), pp.1-21. ⟨10.1371/journal.pcbi.1006097⟩, PLOS COMPUTATIONAL BIOLOGY, PLoS Computational Biology, Public Library of Science, 2018, 14 (4), pp.1-21. 〈10.1371/journal.pcbi.1006097〉, PLoS Computational Biology, Public Library of Science, 2018, 14 (4), pp.1-21. ⟨10.1371/journal.pcbi.1006097⟩, PLoS Computational Biology, Vol 14, Iss 4, p e1006097 (2018), Scopus, Repositório Institucional da UNESP, Universidade Estadual Paulista (UNESP), instacron:UNESP
Accession number :
edsair.doi.dedup.....5abd7a1ccee51677c7796ce965009f7c
Full Text :
https://doi.org/10.1371/journal.pcbi.1006097⟩