Back to Search Start Over

Comparison of phasing strategies for whole human genomes

Authors :
Ewen F. Kirkness
Yongwook Choi
Agnes P. Chan
Nicholas J. Schork
Amalio Telenti
Source :
PLoS Genetics, Vol 14, Iss 4, p e1007308 (2018), PLoS Genetics
Publication Year :
2018
Publisher :
Public Library of Science (PLoS), 2018.

Abstract

Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not ‘phase’ the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available ‘Genome-In-A-Bottle’ (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density.<br />Author summary Humans are a diploid species that inherit one set of chromosomes paternally and one set of chromosomes maternally. Separating the nucleotide content of the maternally and paternally-derived chromosomes for an individual, i.e., ‘phasing’ that individual’s genome, is not trivial with today’s sequencing technologies. This is in part due to the fact that most available sequencing technologies generate short sequencing reads that make it hard to assemble individual homologous chromosome pairs. Phase information can be crucial for putting into context the likely functional consequences of DNA sequence variants as well as certain evolutionary and population genetics phenomena. In order to assess the reliability of current sequencing-based phasing strategies, we compared 11 different approaches using a public domain reference genome as a test case. These phasing strategies included laboratory-based experimental techniques as well as purely computational approaches. Importantly, our comparisons show that a hybrid or combined approach that leverages population-based phasing via the SHAPEIT software suite works well and can be improved with the addition of genome-wide sequence read or parental genotype data.

Details

Language :
English
ISSN :
15537404 and 15537390
Volume :
14
Issue :
4
Database :
OpenAIRE
Journal :
PLoS Genetics
Accession number :
edsair.doi.dedup.....4a29f4997539217eeda800965b7a606a