Back to Search Start Over

Achieving high-quality ddRAD-like reference catalogs for non-model species: the power of overlapping paired-end reads

Authors :
Sibelle T. Vilaça
Susan Mbedi
Tomás Carrasco-Valenzuela
Camila J. Mazzoni
Benoit de Thoisy
Felix Heeger
Maximilian Driller
Damien Chevallier
Larissa Souza Arantes
Source :
bioRxiv beta
Publication Year :
2020
Publisher :
Cold Spring Harbor Laboratory, 2020.

Abstract

Reduced representation libraries (RRS) allow large scale studies on non-model species to be performed without the need for a reference genome, by building a pseudo-reference locus catalog directly from the data. However, using closely-related high-quality genomes can help maximize nucleotide variation identified from RRS libraries. While chromosome-level genomes remain unavailable for most species, researchers can still invest in building high-quality and project-specificde novolocus catalogs. Among methods that use restriction enzymes (RADSeq), those including fragment size selection to help obtain the desired number of loci - such as double-digest RAD (ddRAD) - are highly flexible but can present important technical issues. Inconsistent size selection reproducibility across libraries and variable coverage across fragment lengths can affect genotyping confidence, number of identified single nucleotide polymorphisms (SNPs), and quality and completeness of thede novoreference catalog. We have developed a strategy to optimize locus catalog building from ddRAD-like data by sequencing overlapping reads that recreate original fragments and add information about coverage per fragment size. Furtherin silicosize selection and digestion steps limit the filtered dataset to well-covered sets of loci and identity thresholds are estimated based on sequence pairwise comparisons. We have developed a full workflow that identifies a set of reduced-representation single-copy orthologs (R2SCOs) for any given species and that includes estimating and evaluating allelic variation in comparison with SNP calling results. We also show how to use our concept in an established RADSeq pipeline - Stacks - and confirm that our approach increases average coverage and number of SNPs called per locus in the final catalog. We have demonstrated our full workflow using newly generated data from five sea turtle species and provided further proof-of-principle using published hybrid sea turtle and primate datasets. Finally, we showed that a project-specific set of R2SCOs perform better than a draft genome as a reference.

Details

Database :
OpenAIRE
Journal :
bioRxiv beta
Accession number :
edsair.doi.dedup.....5e05b89b43de663cae447f6daa79d8d7