1. Towards complete and error-free genome assemblies of all vertebrate species
- Author
-
Richard Hall, Tandy Warnow, Tanya M. Lama, Oliver A. Ryder, David Haussler, Matthew T. Biegler, Klaus-Peter Koepfli, Ivo Gut, Paul Flicek, Mark Chaisson, James Torrance, Guojie Zhang, Andrew J. Crawford, Federica Di Palma, Michael Hiller, Jennifer A. Marshall Graves, Sadye Paez, Sarah E. London, Mark Wilkinson, Kateryna D. Makova, Byung June Ko, Jimin George, Farooq O. Al-Ajli, Emma C. Teeling, George F. Turner, Robert H. S. Kraus, Sonja C. Vernes, Zev N. Kronenberg, Michelle Smith, Jonas Korlach, Daryl Eason, Jonathan Wood, Simona Secomandi, Claudio V. Mello, Arkarachai Fungtammasan, Arang Rhie, Tomas Marques-Bonet, Benedict Paten, Ekaterina Osipova, Richard Durbin, M. Thomas P. Gilbert, Beth Shapiro, Ivan Sović, Bruce C. Robertson, Richard E. Green, Eugene W. Myers, Leanne Haggerty, Sergey Koren, Martin Pippel, Bettina Haase, Patrick Masterson, Jay Ghurye, Maria Simbirsky, Samantha R. Friedrich, Chul Hee Lee, Luis R Nassar, Lindsey J. Cantin, Kerstin Howe, Erich D. Jarvis, Marlys L. Houck, Jason T. Howard, Jacquelyn Mountcastle, Mark Mooney, Paolo Franchini, Giulio Formenti, Siddarth Selvaraj, Robel E. Dagnew, Brett T. Hannigan, Brian P. Walenz, Alan Tracey, Heebal Kim, Constantina Theofanopoulou, Nicholas H. Putnam, Karen Clark, Iliana Bista, H. William Detrich, Dengfeng Guan, David Iorns, Andrew Digby, Trevor Pesout, Zemin Ning, Gregory Gedman, Woori Kwak, Maximilian Wagner, Joanna Collins, Harris A. Lewin, Hannes Svardal, Milan Malinsky, Byrappa Venkatesh, Françoise Thibaud-Nissen, Joana Damas, Andreas F. Kautt, Olivier Fedrigo, Christopher Dunn, William Chow, Warren E. Johnson, Yang Zhou, Adam M. Phillippy, Taylor Edwards, Paul Medvedev, Peter V. Lovell, Joyce V. Lee, Sylke Winkler, Stephen J. O'Brien, Wesley C. Warren, Alex Hastie, Marcela Uliano-Silva, Kevin L. Howe, Sarah B. Kingan, Fergal J. Martin, Christopher N. Balakrishnan, David F. Clayton, Ying Sims, Robert W. Murphy, Axel Meyer, Dave W Burt, Shane A. McCarthy, Sarah Pelan, Erik Garrison, Mark Diekhans, Frank Grützner, Gavin J. P. Naylor, Robert S. Harris, Hiram Clawson, Jinna Hoffman, Ann C Misuraca, J. H. Kim, University of St Andrews. School of Biology, University of St Andrews. St Andrews Bioinformatics Unit, Rhie, Arang [0000-0002-9809-8127], Fedrigo, Olivier [0000-0002-6450-7551], Formenti, Giulio [0000-0002-7554-5991], Koren, Sergey [0000-0002-1472-8962], Uliano-Silva, Marcela [0000-0001-6723-4715], Thibaud-Nissen, Francoise [0000-0003-4957-7807], Mountcastle, Jacquelyn [0000-0003-1078-4905], Winkler, Sylke [0000-0002-0915-3316], Vernes, Sonja C. [0000-0003-0305-4584], Grutzner, Frank [0000-0002-3088-7314], Balakrishnan, Christopher N. [0000-0002-0788-0659], Burt, Dave [0000-0002-9991-1028], George, Julia M. [0000-0001-6194-6914], Digby, Andrew [0000-0002-1870-8811], Robertson, Bruce [0000-0002-5348-2731], Edwards, Taylor [0000-0002-7235-6175], Meyer, Axel [0000-0002-0888-8193], Kautt, Andreas F. [0000-0001-7792-0735], Franchini, Paolo [0000-0002-8184-1463], Detrich, H. William, III [0000-0002-0783-4505], Pippel, Martin [0000-0002-8134-5929], Malinsky, Milan [0000-0002-1462-6317], Kingan, Sarah B. [0000-0002-4900-0189], Hall, Richard [0000-0001-6490-8227], Dunn, Christopher [0000-0002-0601-3254], Lee, Joyce [0000-0002-3492-1102], Putnam, Nicholas H. [0000-0002-1315-782X], Gut, Ivo [0000-0001-7219-632X], Tracey, Alan [0000-0002-4805-9058], Guan, Dengfeng [0000-0002-6376-3940], London, Sarah E. [0000-0002-7839-2644], Clayton, David F. [0000-0002-6395-3488], Mello, Claudio V. [0000-0002-9826-8421], Friedrich, Samantha R. [0000-0003-0570-6080], Osipova, Ekaterina [0000-0002-6769-7223], Al-Ajli, Farooq O. [0000-0002-4692-7106], Secomandi, Simona [0000-0001-8597-6034], Kim, Heebal [0000-0003-3064-1303], Theofanopoulou, Constantina [0000-0003-2014-7563], Zhou, Yang [0000-0003-1247-5049], Martin, Fergal [0000-0002-1672-050X], Flicek, Paul [0000-0002-3897-7955], Walenz, Brian P. [0000-0001-8431-1428], Diekhans, Mark [0000-0002-0430-0989], Paten, Benedict [0000-0001-8863-3539], Crawford, Andrew J. [0000-0003-3153-6898], Gilbert, M. Thomas P. [0000-0002-5805-7195], Zhang, Guojie [0000-0001-6860-1521], Venkatesh, Byrappa [0000-0003-3620-0277], Shapiro, Beth [0000-0002-2733-7776], Johnson, Warren E. [0000-0002-5954-186X], Marques-Bonet, Tomas [0000-0002-5597-3075], Teeling, Emma C. [0000-0002-3309-1346], Ryder, Oliver A. [0000-0003-2427-763X], Haussler, David [0000-0003-1533-4575], Korlach, Jonas [0000-0003-3047-4250], Lewin, Harris A. [0000-0002-1043-7287], Howe, Kerstin [0000-0003-2237-513X], Myers, Eugene W. [0000-0002-6580-7839], Durbin, Richard [0000-0002-9130-1006], Phillippy, Adam M. [0000-0003-2983-8934], Jarvis, Erich D. [0000-0001-8931-5049], Apollo - University of Cambridge Repository, National Institutes of Health (US), National Human Genome Research Institute (US), Ministry of Health and Welfare (South Korea), Wellcome Trust, European Molecular Biology Laboratory, Howard Hughes Medical Institute, Rockefeller University, Robert and Rosabel Osborne Endowment, European Commission, National Library of Medicine (US), Korea Institute of Marine Science & Technology, Ministry of Oceans and Fisheries (South Korea), Alfred P. Sloan Foundation, Max Planck Society, Maine Department of Inland Fisheries & Wildlife, National Science Foundation (US), University of Queensland, Science Exchange, Northeastern University (US), Federal Ministry of Education and Research (Germany), EMBO, National Key Research and Development Program (China), Qatar Society of Al-Gannas (Algannas), Katara Cultural Village, Government of Qatar, Monash University Malaysia, Hessen State Ministry of Higher Education, Research and the Arts, Ministry of Science, Research and Art Baden-Württemberg, Agency for Science, Technology and Research A*STAR (Singapore), European Research Council, Ministerio de Ciencia, Innovación y Universidades (España), Fundación 'la Caixa', Generalitat de Catalunya, Irish Research Council, Danish National Research Foundation, Australian Research Council, Vernes, Sonja C [0000-0003-0305-4584], Balakrishnan, Christopher N [0000-0002-0788-0659], George, Julia M [0000-0001-6194-6914], Kautt, Andreas F [0000-0001-7792-0735], Detrich, H William [0000-0002-0783-4505], Kingan, Sarah B [0000-0002-4900-0189], Putnam, Nicholas H [0000-0002-1315-782X], London, Sarah E [0000-0002-7839-2644], Clayton, David F [0000-0002-6395-3488], Mello, Claudio V [0000-0002-9826-8421], Friedrich, Samantha R [0000-0003-0570-6080], Al-Ajli, Farooq O [0000-0002-4692-7106], Walenz, Brian P [0000-0001-8431-1428], Crawford, Andrew J [0000-0003-3153-6898], Gilbert, M Thomas P [0000-0002-5805-7195], Johnson, Warren E [0000-0002-5954-186X], Teeling, Emma C [0000-0002-3309-1346], Ryder, Oliver A [0000-0003-2427-763X], Lewin, Harris A [0000-0002-1043-7287], Myers, Eugene W [0000-0002-6580-7839], Phillippy, Adam M [0000-0003-2983-8934], and Jarvis, Erich D [0000-0001-8931-5049]
- Subjects
QH301 Biology ,Genome ,0302 clinical medicine ,Genome Size ,Vertebrats ,Uncategorized ,64 ,0303 health sciences ,Sex Chromosomes ,Multidisciplinary ,High-Throughput Nucleotide Sequencing ,Genomics ,Mitochondrial ,Vertebrates ,Identification (biology) ,Engineering sciences. Technology ,Sequence Analysis ,Neuroinformatics ,45/23 ,QH426 Genetics ,Biology ,Article ,Evolutionary genetics ,38 ,Birds ,QH301 ,03 medical and health sciences ,Molecular evolution ,ddc:570 ,Genome assembly algorithms ,Animals ,631/181/735 ,14. Life underwater ,Genomes ,QH426 ,Gene ,Gene Library ,Genome, Mitochondrial ,Haplotypes ,Molecular Sequence Annotation ,Sequence Alignment ,Sequence Analysis, DNA ,030304 developmental biology ,45/91 ,631/61/212/2302 ,45 ,Human evolutionary genetics ,Haplotype ,DAS ,DNA ,Research data ,706/648/697 ,631/181/2474 ,Evolutionary biology ,Genètica ,030217 neurology & neurosurgery ,Reference genome - Abstract
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1,2,3,4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences., We thank them for their permission to publish. A.R., S.K., B.P.W. and A.M.P. were supported by the Intramural Research Program of the NHGRI, NIH (1ZIAHG200398). A.R. was also supported by the Korea Health Technology R&D Project through KHIDI, funded by the Ministry of Health & Welfare, Republic of Korea (HI17C2098). S.A.M., I.B. and R.D. were supported by Wellcome Trust grant WT207492; W.C., M. Smith, Z.N., Y.S., J.C., S. Pelan, J.T., A.T., J.W. and Kerstin Howe by WT206194; L.H., F.M., Kevin Howe and P. Flicek by WT108749/Z/15/Z, WT218328/B/19/Z and the European Molecular Biology Laboratory. O.F. and E.D.J. were supported by Howard Hughes Medical Institute and Rockefeller University start-up funds for this project. J.D. and H.A.L. were supported by the Robert and Rosabel Osborne Endowment. M.U.-S. received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement (750747). F.T.-N., J. Hoffman, P. Masterson and K.C. were supported by the Intramural Research Program of the NLM, NIH. C.L., B.J.K., J. Kim and H.K. were supported by the Marine Biotechnology Program of KIMST, funded by the Ministry of Ocean and Fisheries, Republic of Korea (20180430). M.C. was supported by Sloan Research Fellowship (FG-2020-12932). S.C.V. was funded by a Max Planck Research Group award from the Max Planck Society, and a Human Frontiers Science Program (HFSP) Research grant (RGP0058/2016). T.M.L., W.E.J. and the Canada lynx genome were funded by the Maine Department of Inland Fisheries & Wildlife (F11AF01099), including when W.E.J. held a National Research Council Research Associateship Award at the Walter Reed Army Institute of Research (WRAIR). C.B. was supported by the NSF (1457541 and 1456612). D.B. was funded by The University of Queensland (HFSP - RGP0030/2015). D.I. was supported by Science Exchange Inc. (Palo Alto, CA). H.W.D. was supported by NSF grants (OPP-0132032 ICEFISH 2004 Cruise, PLR-1444167 and OPP-1955368) and the Marine Science Center at Northeastern University (416). G.J.P.N. and the thorny skate genome were funded by Lenfest Ocean Program (30884). M.P. was funded by the German Federal Ministry of Education and Research (01IS18026C). M. Malinsky was supported by an EMBO fellowship (ALTF 456-2016). The following authors’ contributions were supported by the NIH: S. Selvaraj (R44HG008118); C.V.M., S.R.F., P.V.L. (R21 DC014432/DC/NIDCD); K.D.M. (R01GM130691); H.C. (5U41HG002371-19); M.D. (U41HG007234); and B.P. (R01HG010485). D.G. was supported by the National Key Research and Development Program of China (2017YFC1201201, 2018YFC0910504 and 2017YFC0907503). F.O.A. was supported by Al-Gannas Qatari Society and The Cultural Village Foundation-Katara, Doha, State of Qatar and Monash University Malaysia. C.T. was supported by The Rockefeller University. M. Hiller was supported by the LOEWE-Centre for Translational Biodiversity Genomics (TBG) funded by the Hessen State Ministry of Higher Education, Research and the Arts (HMWK). H.C. was supported by the NHGRI (5U41HG002371-19). R.H.S.K. was funded by the Max Planck Society with computational resources at the bwUniCluster and BinAC funded by the Ministry of Science, Research and the Arts Baden-Württemberg and the Universities of the State of Baden-Württemberg, Germany (bwHPC-C5). B.V. was supported by the Biomedical Research Council of A*STAR, Singapore. T.M.-B. was funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme (864203), MINECO/FEDER, UE (BFU2017-86471-P), Unidad de Excelencia María de Maeztu, AEI (CEX2018-000792-M), a Howard Hughes International Early Career award, Obra Social “La Caixa” and Secretaria d’Universitats i Recerca and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 880). E.C.T. was supported by the European Research Council (ERC-2012-StG311000) and an Irish Research Council Laureate Award. M.T.P.G. was supported by an ERC Consolidator Award 681396-Extinction Genomics, and a Danish National Research Foundation Center Grant (DNRF143). T.W. was supported by the NSF (1458652). J. M. Graves was supported by the Australian Research Council (CEO561477). E.W.M. was partially supported by the German Federal Ministry of Education and Research (01IS18026C). Complementary sequencing support for the Anna’s hummingbird and several genomes was provided by Pacific Biosciences, Bionano Genomics, Dovetail Genomics, Arima Genomics, Phase Genomics, 10X Genomics, NRGene, Oxford Nanopore Technologies, Illumina, and DNAnexus. All other sequencing and assembly were conducted at the Rockefeller University, Sanger Institute, and Max Planck Institute Dresden genome labs. Part of this work used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov). We acknowledge funding from the Wellcome Trust (108749/Z/15/Z) and the European Molecular Biology Laboratory., With funding from the Spanish government through the "Severo Ochoa Centre of Excellence" accreditation (CEX2018-000792-M).
- Published
- 2021
- Full Text
- View/download PDF