Back to Search Start Over

Improving population scale statistical phasing with whole-genome sequencing data.

Authors :
Wertenbroek, Rick
Hofmeister, Robin J.
Xenarios, Ioannis
Thoma, Yann
Delaneau, Olivier
Source :
PLoS Genetics. 7/3/2024, Vol. 20 Issue 7, p1-22. 22p.
Publication Year :
2024

Abstract

Haplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information. In this study we introduce SAPPHIRE, a new method that leverages whole-genome sequencing data to enhance the precision of haplotype calls produced by statistical phasing. SAPPHIRE achieves this by refining haplotype estimates through the realignment of sequencing reads, particularly targeting low-confidence phase calls. Our findings demonstrate that SAPPHIRE significantly enhances the accuracy of haplotypes obtained from state of the art methods and also provides the subset of phase calls that are validated by sequencing reads. Finally, we show that our method scales to large data sets by its successful application to the extensive 3.6 Petabytes of sequencing data of the last UK Biobank 200,031 sample release. Author summary: Haplotype estimation, also known as phasing, is now applied to population scale projects, typically of hundreds of thousands of samples to millions of samples. Generally phasing relies on statistical methods as they provide very accurate results for common variations. However, for rare and very rare variants the lack of statistical power often results in poor phasing. The large amount of rare variations discovered with whole-genome sequencing as well as the number of samples makes it expensive to process. We have developed the SAPPHIRE method that leverages whole-genome sequencing data to verify and correct the phase at poorly phased variant loci. It does so by finding sequencing reads that contain both the poorly phased variant and an accurately phased common variant. SAPPHIRE scales with large data sets by specifically targeting variation where statistical phasing performed poorly, therefore it reduces the quantity of sequencing data to be processed and combines the advantages of both read-based and statistical approaches. We show the efficiency of SAPPHIRE by improving the estimated haplotypes for 200,031 samples in the UK Biobank. SAPPHIRE is free and available as open-source software. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15537390
Volume :
20
Issue :
7
Database :
Academic Search Index
Journal :
PLoS Genetics
Publication Type :
Academic Journal
Accession number :
178235262
Full Text :
https://doi.org/10.1371/journal.pgen.1011092