Back to Search Start Over

FCLQC: fast and concurrent lossless quality scores compressor

Authors :
Minhyeok Cho
Albert No
Source :
BMC Bioinformatics, Vol 22, Iss 1, Pp 1-14 (2021)
Publication Year :
2021
Publisher :
BMC, 2021.

Abstract

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.

Details

Language :
English
ISSN :
14712105
Volume :
22
Issue :
1
Database :
Directory of Open Access Journals
Journal :
BMC Bioinformatics
Publication Type :
Academic Journal
Accession number :
edsdoj.1cb52953087142fb87001e105c3d2031
Document Type :
article
Full Text :
https://doi.org/10.1186/s12859-021-04516-7