Back to Search Start Over

Sparse Project VCF: efficient encoding of population genotype matrices.

Authors :
Lin MF
Bai X
Salerno WJ
Reid JG
Source :
Bioinformatics (Oxford, England) [Bioinformatics] 2021 Apr 01; Vol. 36 (22-23), pp. 5537-5538.
Publication Year :
2021

Abstract

Summary: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts.<br />Availability and Implementation: Apache-licensed reference implementation: github.com/mlin/spVCF.<br />Supplementary Information: Supplementary data are available at Bioinformatics online.<br /> (© The Author(s) 2020. Published by Oxford University Press.)

Details

Language :
English
ISSN :
1367-4811
Volume :
36
Issue :
22-23
Database :
MEDLINE
Journal :
Bioinformatics (Oxford, England)
Publication Type :
Academic Journal
Accession number :
33300997
Full Text :
https://doi.org/10.1093/bioinformatics/btaa1004