1. Computing the multi-string BWT and LCP array in external memory
- Author
-
Bonizzoni, Paola, Della Vedova, Gianluca, Pirola, Yuri, Previtali, Marco, Rizzi, Raffaella, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, and Rizzi, R
- Subjects
General Computer Science ,Computer science ,Generalization ,String (computer science) ,Search engine indexing ,LCP array ,INF/01 - INFORMATICA ,Data_CODINGANDINFORMATIONTHEORY ,0102 computer and information sciences ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Theoretical Computer Science ,External-memory algorithm ,Set (abstract data type) ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,sort ,020201 artificial intelligence & image processing ,Out-of-core algorithm ,Burrows-Wheeler transform, Longest common prefix array, External-memory algorithm ,Longest common prefix array ,Burrows–Wheeler transform ,Algorithm ,Auxiliary memory - Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O ( m k l ) time and I/O volume, using O ( k + m ) main memory, where l is the maximum value in the LCP array.
- Published
- 2020
- Full Text
- View/download PDF