Back to Search
Start Over
How to optimally sample a sequence for rapid analysis.
- Source :
-
Bioinformatics . Feb2023, Vol. 39 Issue 2, p1-7. 7p. - Publication Year :
- 2023
-
Abstract
- Motivation We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. Results We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. Availability and implementation Source code is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
- Subjects :
- *SEQUENCE analysis
*SOURCE code
*PROTEIN-protein interactions
*INTUITION
Subjects
Details
- Language :
- English
- ISSN :
- 13674803
- Volume :
- 39
- Issue :
- 2
- Database :
- Academic Search Index
- Journal :
- Bioinformatics
- Publication Type :
- Academic Journal
- Accession number :
- 164277589
- Full Text :
- https://doi.org/10.1093/bioinformatics/btad057