1. GABAC: an arithmetic coding solution for genomic data
- Author
-
Tom Paridaens, Jan Fostier, Liudmila Sergeevna Mainzer, Jan Voges, Mikel Hernaez, Fabian Müntefering, Brian Bliss, Idoia Ochoa, Jörn Ostermann, and Mingyu Yang
- Subjects
Statistics and Probability ,Theoretical computer science ,Technology and Engineering ,Dewey Decimal Classification::500 | Naturwissenschaften::570 | Biowissenschaften, Biologie ,Computer science ,Genomic data ,Binary number ,Data_CODINGANDINFORMATIONTHEORY ,MPEG-G compliant entropy codec ,Biochemistry ,03 medical and health sciences ,Dewey Decimal Classification::000 | Allgemeines, Wissenschaft::000 | Informatik, Wissen, Systeme::004 | Informatik ,0302 clinical medicine ,International Organization for Standardization (ISO) ,GABAC ,ddc:570 ,Entropy (information theory) ,Codec ,Molecular Biology ,030304 developmental biology ,coding ,0303 health sciences ,Genome ,High-Throughput Nucleotide Sequencing ,Moving Picture Experts Group (MPEG) ,Genomics ,Data Compression ,Applications Notes ,Computer Science Applications ,Arithmetic coding ,Computational Mathematics ,Computational Theory and Mathematics ,030220 oncology & carcinogenesis ,COMPRESSION ,genomic data ,ddc:004 ,Sequence Analysis ,Software - Abstract
Motivation In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. Availability and implementation The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2019