Back to Search Start Over

Comparing the Statistical Fate of Paralogous and Orthologous Sequences

Authors :
Michael Sheinman
Florian Massip
Peter F. Arndt
Sophie Schbath
Statistique en grande dimension pour la génomique
Département PEGASE [LBBE] (PEGASE)
Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE)
Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)
Source :
Genetics, Genetics, Genetics Society of America, 2016, 204 (2), pp.475-482. ⟨10.1534/genetics.116.193912⟩, Genetics, 2016, 204 (2), pp.475-482. ⟨10.1534/genetics.116.193912⟩
Publication Year :
2016
Publisher :
HAL CCSD, 2016.

Abstract

For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent α=−5. Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.

Details

Language :
English
ISSN :
00166731
Database :
OpenAIRE
Journal :
Genetics, Genetics, Genetics Society of America, 2016, 204 (2), pp.475-482. ⟨10.1534/genetics.116.193912⟩, Genetics, 2016, 204 (2), pp.475-482. ⟨10.1534/genetics.116.193912⟩
Accession number :
edsair.doi.dedup.....ab92ab213e5dafd1be3a3fbe7fac5d32