1. Overlap detection on long, error-prone sequencing reads via smooth q-gram
- Author
-
Haoyu Zhang, Qin Zhang, Haixu Tang, and Yan Song
- Subjects
Statistics and Probability ,0303 health sciences ,Source code ,Computer science ,Fragment (computer graphics) ,media_common.quotation_subject ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,Biochemistry ,Computer Science Applications ,Nanopores ,03 medical and health sciences ,Computational Mathematics ,0302 clinical medicine ,Computational Theory and Mathematics ,Minion ,Nanopore sequencing ,Molecular Biology ,Algorithm ,Algorithms ,Software ,030217 neurology & neurosurgery ,030304 developmental biology ,Gram ,media_common - Abstract
Motivation Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. Results We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. Availability and implementation The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2020
- Full Text
- View/download PDF