Back to Search
Start Over
Disk compression of k-mer sets.
- Source :
-
Algorithms for Molecular Biology . 6/21/2021, Vol. 16 Issue 1, p1-14. 14p. - Publication Year :
- 2021
-
Abstract
- K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do. [ABSTRACT FROM AUTHOR]
- Subjects :
- *HARD disks
*DE Bruijn graph
*ALGORITHMS
*DATABASE searching
*DATA compression
Subjects
Details
- Language :
- English
- ISSN :
- 17487188
- Volume :
- 16
- Issue :
- 1
- Database :
- Academic Search Index
- Journal :
- Algorithms for Molecular Biology
- Publication Type :
- Academic Journal
- Accession number :
- 151001551
- Full Text :
- https://doi.org/10.1186/s13015-021-00192-7