Back to Search
Start Over
kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison
- Source :
- Bioinformatics, Bioinformatics, 2014, 30 (14), pp.2000-2008. ⟨10.1093/bioinformatics/btu331⟩, Bioinformatics, Oxford University Press (OUP), 2014, 30 (14), pp.2000-2008. ⟨10.1093/bioinformatics/btu331⟩
- Publication Year :
- 2014
- Publisher :
- HAL CCSD, 2014.
-
Abstract
- Motivation: Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k -mismatch substrings, and we describe kmacs , an efficient implementation of this idea based on generalized enhanced suffix arrays. Results: To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. Availability and implementation: kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/ Contact: chris.leimeister@stud.uni-goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online.
- Subjects :
- Statistics and Probability
Primates
Source code
media_common.quotation_subject
[SDV]Life Sciences [q-bio]
Biology
Biochemistry
Sequence Analysis, Protein
Animals
Greedy algorithm
Molecular Biology
Alignment-free sequence analysis
Phylogeny
media_common
Multiple sequence alignment
Sequence Analysis, DNA
Roseobacter
Original Papers
Substring
Computer Science Applications
Computational Mathematics
Computational Theory and Mathematics
k-mer
Genome, Mitochondrial
Suffix
Algorithm
Sequence Analysis
Sequence Alignment
Word (computer architecture)
Algorithms
Genome, Bacterial
Subjects
Details
- Language :
- English
- ISSN :
- 13674803 and 13674811
- Database :
- OpenAIRE
- Journal :
- Bioinformatics, Bioinformatics, 2014, 30 (14), pp.2000-2008. ⟨10.1093/bioinformatics/btu331⟩, Bioinformatics, Oxford University Press (OUP), 2014, 30 (14), pp.2000-2008. ⟨10.1093/bioinformatics/btu331⟩
- Accession number :
- edsair.doi.dedup.....f2f4f494e6456d74a9dc6ddbe5a49e48