1. A crowdsourced set of curated structural variants for the human genome
- Author
-
Eric T. Dawson, Chunlin Xiao, Noah Alexander, Sree Rohit Raj Kolora, Lesley M. Chapman, Aaron M. Wenger, Christopher M. Watson, Giuseppe Narzisi, Justin M. Zook, Daniel Blankenberg, Christian Brueffer, Graeme C. Smith, Marc L. Salit, Azza Ahmed, Paul Wolujewicz, Saadlee Shehreen, Naoki Nariai, Patrick Pai, Christos Proukakis, Andrew Carroll, Garan Jones, Wayne E. Clarke, Noah Spies, Chun Shen Lim, Chapman, Lesley M [0000-0001-7413-4392], Spies, Noah [0000-0002-6759-9842], Pai, Patrick [0000-0001-5304-788X], Lim, Chun Shen [0000-0001-7015-0125], Carroll, Andrew [0000-0002-4824-6689], Narzisi, Giuseppe [0000-0003-1118-8849], Watson, Christopher M [0000-0003-2371-1844], Proukakis, Christos [0000-0001-6423-6539], Clarke, Wayne E [0000-0003-2471-0712], Dawson, Eric [0000-0001-5448-1653], Jones, Garan [0000-0002-8917-3930], Brueffer, Christian [0000-0002-3826-0989], Kolora, Sree Rohit Raj [0000-0001-7839-735X], Wolujewicz, Paul [0000-0003-2982-9448], Ahmed, Azza E [0000-0002-1358-8371], Smith, Graeme [0000-0002-7413-4998], Shehreen, Saadlee [0000-0002-4869-0747], Wenger, Aaron M [0000-0003-1183-0432], Salit, Marc [0000-0003-1624-5195], and Apollo - University of Cambridge Repository
- Subjects
0301 basic medicine ,Heredity ,Computer science ,Genome ,Database and Informatics Methods ,0302 clinical medicine ,INDEL Mutation ,Heuristics ,Genome Sequencing ,Biology (General) ,Ecology ,Genomics ,Genetic Mapping ,Tandem Repeats ,Computational Theory and Mathematics ,Modeling and Simulation ,Sequence Analysis ,Research Article ,Bioinformatics ,QH301-705.5 ,Concordance ,Variant Genotypes ,Computational biology ,Research and Analysis Methods ,Genome Complexity ,DNA sequencing ,Set (abstract data type) ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Genetics ,Humans ,Repeated Sequences ,Molecular Biology Techniques ,Sequencing Techniques ,Indel ,Molecular Biology ,Alleles ,Ecology, Evolution, Behavior and Systematics ,Genome, Human ,Biology and Life Sciences ,Computational Biology ,Genome Analysis ,030104 developmental biology ,Haplotypes ,Genetic Loci ,Genomic Structural Variation ,Human genome ,Sequence Alignment ,030217 neurology & neurosurgery ,Reference genome - Abstract
A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app—SVCurator—to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. ‘Expert’ curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of ‘expert’ curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies., Author summary Large genomic changes, called structural variants, can cause a variety of human diseases, but have been challenging to detect with conventional DNA sequencing methods. We are working in the Genome in a Bottle Consortium to develop authoritatively characterized genomes with benchmark structural variants that can be used by anyone to assess the accuracy of their sequencing and analysis methods. Manual curation of the sequencing reads from multiple technologies has been essential to establish benchmark variant calls. Here, we present consensus curations from a web-based platform that displays a comprehensive set of visualizations of sequencing read support for structural variants. We use the svviz visualization tool to present evidence not only for deletions but also for insertions, which have previously not been possible to curate. We derive consensus calls from the multiple curations of each variant, and we find these are highly concordant with a draft Genome in a Bottle structural variant benchmark set.
- Published
- 2020