Back to Search Start Over

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly

Authors :
Kateryna D. Makova
Rayan Chikhi
Marta Tomaszkiewicz
Monika Cechova
Samarth Rangavittal
Paul Medvedev
Robert S. Harris
Pennsylvania State University (Penn State)
Penn State System
Department of Anaesthesia
St George's Hospital
Institut de Génomique Fonctionnelle de Lyon (IGFL)
École normale supérieure - Lyon (ENS Lyon)-Institut National de la Recherche Agronomique (INRA)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS)
Dept. of Computer Science and Engineering
Penn State System-Penn State System
University of Pennsylvania [Philadelphia]
École normale supérieure de Lyon (ENS de Lyon)-Institut National de la Recherche Agronomique (INRA)-Université Claude Bernard Lyon 1 (UCBL)
University of Pennsylvania
Source :
Bioinformatics, Bioinformatics, Oxford University Press (OUP), 2017, ⟨10.1093/bioinformatics/btx771⟩, Bioinformatics, 2017, ⟨10.1093/bioinformatics/btx771⟩
Publication Year :
2017
Publisher :
HAL CCSD, 2017.

Abstract

Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Supplementary information Supplementary data are available at Bioinformatics online.

Details

Language :
English
ISSN :
13674803, 13674811, and 14602059
Database :
OpenAIRE
Journal :
Bioinformatics, Bioinformatics, Oxford University Press (OUP), 2017, ⟨10.1093/bioinformatics/btx771⟩, Bioinformatics, 2017, ⟨10.1093/bioinformatics/btx771⟩
Accession number :
edsair.doi.dedup.....3ca23ef25644c4274bc099cd6912b032