1. Finding long tandem repeats in long noisy reads
- Author
-
Kazuki Ichikawa, Eugene W. Myers, and Shinichi Morishita
- Subjects
Statistics and Probability ,AcademicSubjects/SCI01060 ,Computer science ,Word error rate ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,Tandem repeat ,Humans ,Molecular Biology ,030304 developmental biology ,Repeat unit ,0303 health sciences ,Genome, Human ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,Genome Analysis ,Original Papers ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Microsatellite ,Human genome ,Algorithm ,Algorithms ,030217 neurology & neurosurgery ,Microsatellite Repeats - Abstract
Motivation Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10โ20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats ( Results Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity. Availability and implementation https://github.com/morisUtokyo/mTR.
- Published
- 2020
- Full Text
- View/download PDF