Back to Search Start Over

How to optimally sample a sequence for rapid analysis.

Authors :
Frith MC
Shaw J
Spouge JL
Source :
Bioinformatics (Oxford, England) [Bioinformatics] 2023 Feb 03; Vol. 39 (2).
Publication Year :
2023

Abstract

Motivation: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.<br />Results: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible.<br />Availability and Implementation: Source code is freely available at https://gitlab.com/mcfrith/noverlap.<br />Supplementary Information: Supplementary data are available at Bioinformatics online.<br /> (© The Author(s) 2023. Published by Oxford University Press.)

Details

Language :
English
ISSN :
1367-4811
Volume :
39
Issue :
2
Database :
MEDLINE
Journal :
Bioinformatics (Oxford, England)
Publication Type :
Academic Journal
Accession number :
36702468
Full Text :
https://doi.org/10.1093/bioinformatics/btad057