Back to Search
Start Over
Measuring Phylogenetic Information of Incomplete Sequence Data
- Source :
- Systematic Biology, Systematic Biology, Oxford University Press (OUP), 2021, pp.syab073. ⟨10.1093/sysbio/syab073⟩, Syst Biol, Systematic Biology, 2021, pp.syab073. ⟨10.1093/sysbio/syab073⟩
- Publication Year :
- 2021
- Publisher :
- HAL CCSD, 2021.
-
Abstract
- Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.]
- Subjects :
- 0106 biological sciences
[SDV]Life Sciences [q-bio]
Sequence alignment
Biology
010603 evolutionary biology
01 natural sciences
insertion
Set (abstract data type)
Evolution, Molecular
03 medical and health sciences
symbols.namesake
Tree (descriptive set theory)
INDEL Mutation
Genetics
deletion
[MATH]Mathematics [math]
Fisher information
Ecology, Evolution, Behavior and Systematics
Phylogeny
030304 developmental biology
0303 health sciences
Sequence
Models, Statistical
Phylogenetic tree
Models, Genetic
business.industry
gaps
Probabilistic logic
Pattern recognition
Data set
model adequacy
indel
sequence alignment
symbols
goodness-of-fit test
Artificial intelligence
business
Regular Articles
Subjects
Details
- Language :
- English
- ISSN :
- 10635157 and 1076836X
- Database :
- OpenAIRE
- Journal :
- Systematic Biology, Systematic Biology, Oxford University Press (OUP), 2021, pp.syab073. ⟨10.1093/sysbio/syab073⟩, Syst Biol, Systematic Biology, 2021, pp.syab073. ⟨10.1093/sysbio/syab073⟩
- Accession number :
- edsair.doi.dedup.....6cc7a20c24a7277d35bd945551fe1ef7