1. Reference-based data compression for genome in cloud
- Author
-
Yongqing Zhu, Juniarto Samsudin, and Haixiang Shi
- Subjects
FASTQ format ,Computer science ,business.industry ,Cloud computing ,Data_CODINGANDINFORMATIONTHEORY ,computer.software_genre ,Genome ,Compression (functional analysis) ,Compression ratio ,Benchmark (computing) ,Data mining ,business ,Cloud storage ,computer ,Data compression - Abstract
In this paper, we propose a new reference-based data compression method for efficient compressing of genome sequencing data in FASTQ format. With the advance of the next sequencing technology, the genome data can be generated faster and cheaper, which brings the challenges for efficient storage of these data when used in cloud computing. In order to efficiently store these types of genome data in cloud, content-aware compressing methods have to be developed to make use of the specific file structures. Compared with existing genome-specific compression methods, our proposed content-aware method focused on high compression ratio by taking advantages of repetitive nature of DNA sequence, and using reference genomes in compressing the sequences inside the FASTQ files. The benchmark results of 8 datasets show that our method can achieve highest compression ratio compared with existing FASTQ file compressors.
- Published
- 2016
- Full Text
- View/download PDF