1. Merging of multi-string BWTs with applications
- Author
-
James Holt and Leonard McMillan
- Subjects
Statistics and Probability ,Source code ,Theoretical computer science ,Computer science ,media_common.quotation_subject ,Sequence alignment ,Biochemistry ,Longest common substring problem ,Mice ,Compression (functional analysis) ,Code (cryptography) ,Animals ,Point (geometry) ,Molecular Biology ,Throughput (business) ,media_common ,String (computer science) ,High-Throughput Nucleotide Sequencing ,Genomics ,Data Compression ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Algorithm ,Sequence Alignment ,Hitseq Papers ,Algorithms ,Data compression - Abstract
Motivation : The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows–Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k -mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis. Results : We present a novel algorithm that merges multi-string BWTs in O(LCS×N) time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using O(N×log2(F)) bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications. Availability and implementation : The MSBWT package is available through PyPI with source code located at https://code.google.com/p/msbwt/ . Contact : holtjma@cs.unc.edu
- Published
- 2014