Back to Search
Start Over
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.
- Source :
- PLoS ONE, Vol 16, Iss 10, p e0258693 (2021)
- Publication Year :
- 2021
- Publisher :
- Public Library of Science (PLoS), 2021.
-
Abstract
- Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Details
- Language :
- English
- ISSN :
- 19326203
- Volume :
- 16
- Issue :
- 10
- Database :
- Directory of Open Access Journals
- Journal :
- PLoS ONE
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.245f2896b0d84253a7e346dc3bf9ac4d
- Document Type :
- article
- Full Text :
- https://doi.org/10.1371/journal.pone.0258693