291 results on '"Birol, I"'
Search Results
152. Fusion-Bloom: fusion detection in assembled transcriptomes.
- Author
-
Chiu R, Nip KM, and Birol I
- Subjects
- Genomics, RNA, Sequence Analysis, RNA, Software, Transcriptome
- Abstract
Summary: Presence or absence of gene fusions is one of the most important diagnostic markers in many cancer types. Consequently, fusion detection methods using various genomics data types, such as RNA sequencing (RNA-seq) are valuable tools for research and clinical applications. While information-rich RNA-seq data have proven to be instrumental in discovery of a number of hallmark fusion events, bioinformatics tools to detect fusions still have room for improvement. Here, we present Fusion-Bloom, a fusion detection method that leverages recent developments in de novo transcriptome assembly and assembly-based structural variant calling technologies (RNA-Bloom and PAVFinder, respectively). We benchmarked Fusion-Bloom against the performance of five other state-of-the-art fusion detection tools using multiple datasets. Overall, we observed Fusion-Bloom to display a good balance between detection sensitivity and specificity. We expect the tool to find applications in translational research and clinical genomics pipelines., Availability and Implementation: Fusion-Bloom is implemented as a UNIX Make utility, available at https://github.com/bcgsc/pavfinder and released under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2020
- Full Text
- View/download PDF
153. ORCA: a comprehensive bioinformatics container environment for education and research.
- Author
-
Jackman SD, Mozgacheva T, Chen S, O'Huiginn B, Bailey L, Birol I, and Jones SJM
- Subjects
- Genome, Computational Biology, Software
- Abstract
Summary: The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research., Availability and Implementation: The ORCA Docker image is available at https://hub.docker.com/r/bcgsc/orca/. The source code of ORCA is available at https://github.com/bcgsc/orca under the MIT license., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2019
- Full Text
- View/download PDF
154. ntEdit: scalable genome sequence polishing.
- Author
-
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, and Birol I
- Subjects
- Animals, Genome, Human, Haploidy, Humans, Sequence Analysis, DNA, Software, Genomics, High-Throughput Nucleotide Sequencing
- Abstract
Motivation: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes., Results: We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024., Availability and Implementation: https://github.com/bcgsc/ntedit., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2019
- Full Text
- View/download PDF
155. Replicated Landscape Genomics Identifies Evidence of Local Adaptation to Urbanization in Wood Frogs.
- Author
-
Homola JJ, Loftin CS, Cammen KM, Helbing CC, Birol I, Schultz TF, and Kinnison MT
- Subjects
- Animals, Biological Evolution, Genetic Variation, Genetics, Population, Maine, Quantitative Trait Loci, Selection, Genetic, Adaptation, Physiological, Genome, Genomics methods, Ranidae genetics, Urbanization
- Abstract
Native species that persist in urban environments may benefit from local adaptation to novel selection factors. We used double-digest restriction-side associated DNA (RAD) sequencing to evaluate shifts in genome-wide genetic diversity and investigate the presence of parallel evolution associated with urban-specific selection factors in wood frogs (Lithobates sylvaticus). Our replicated paired study design involved 12 individuals from each of 4 rural and urban populations to improve our confidence that detected signals of selection are indeed associated with urbanization. Genetic diversity measures were less for urban populations; however, the effect size was small, suggesting little biological consequence. Using an FST outlier approach, we identified 37 of 8344 genotyped single nucleotide polymorphisms with consistent evidence of directional selection across replicates. A genome-wide association study analysis detected modest support for an association between environment type and 12 of the 37 FST outlier loci. Discriminant analysis of principal components using the 37 FST outlier loci produced correct reassignment for 87.5% of rural samples and 93.8% of urban samples. Eighteen of the 37 FST outlier loci mapped to the American bullfrog (Rana [Lithobates] catesbeiana) genome, although none were in coding regions. This evidence of parallel evolution to urban environments provides a powerful example of the ability of urban landscapes to direct evolutionary processes., (© The American Genetic Association 2019.)
- Published
- 2019
- Full Text
- View/download PDF
156. The Genome of the Steller Sea Lion ( Eumetopias jubatus ).
- Author
-
Kwan HH, Culibrk L, Taylor GA, Leelakumari S, Tan R, Jackman SD, Tse K, MacLeod T, Cheng D, Chuah E, Kirk H, Pandoh P, Carlsen R, Zhao Y, Mungall AJ, Moore R, Birol I, Marra MA, Rosen DAS, Haulena M, and Jones SJM
- Subjects
- Animals, Genomic Library, Microfluidics methods, Nanopores, Whole Genome Sequencing, Genome, Sea Lions genetics
- Abstract
The Steller sea lion is the largest member of the Otariidae family and is found in the coastal waters of the northern Pacific Rim. Here, we present the Steller sea lion genome, determined through DNA sequencing approaches that utilized microfluidic partitioning library construction, as well as nanopore technologies. These methods constructed a highly contiguous assembly with a scaffold N50 length of over 14 megabases, a contig N50 length of over 242 kilobases and a total length of 2.404 gigabases. As a measure of completeness, 95.1% of 4104 highly conserved mammalian genes were found to be complete within the assembly. Further annotation identified 19,668 protein coding genes. The assembled genome sequence and underlying sequence data can be found at the National Center for Biotechnology Information (NCBI) under the BioProject accession number PRJNA475770.
- Published
- 2019
- Full Text
- View/download PDF
157. Complete Chloroplast Genome Sequence of an Engelmann Spruce ( Picea engelmannii , Genotype Se404-851) from Western Canada.
- Author
-
Lin D, Coombe L, Jackman SD, Gagalova KK, Warren RL, Hammond SA, McDonald H, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Doerksen T, Jaquish B, Bousquet J, Jones SJM, Bohlmann J, and Birol I
- Abstract
Engelmann spruce ( Picea engelmannii ) is a conifer found primarily on the west coast of North America. Here, we present the complete chloroplast genome sequence of Picea engelmannii genotype Se404-851. This chloroplast sequence will benefit future conifer genomic research and contribute resources to further species conservation efforts., (Copyright © 2019 Lin et al.)
- Published
- 2019
- Full Text
- View/download PDF
158. Complete Chloroplast Genome Sequence of a White Spruce (Picea glauca, Genotype WS77111) from Eastern Canada.
- Author
-
Lin D, Coombe L, Jackman SD, Gagalova KK, Warren RL, Hammond SA, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Jaquish B, Isabel N, Bousquet J, Jones SJM, Bohlmann J, and Birol I
- Abstract
Here, we present the complete chloroplast genome sequence of white spruce ( Picea glauca , genotype WS77111), a coniferous tree widespread in the boreal forests of North America. This sequence contributes to genomic and phylogenetic analyses of the Picea genus that are part of ongoing research to understand their adaptation to environmental stress., (Copyright © 2019 Lin et al.)
- Published
- 2019
- Full Text
- View/download PDF
159. Base excision repair deficiency signatures implicate germline and somatic MUTYH aberrations in pancreatic ductal adenocarcinoma and breast cancer oncogenesis.
- Author
-
Thibodeau ML, Zhao EY, Reisle C, Ch'ng C, Wong HL, Shen Y, Jones MR, Lim HJ, Young S, Cremin C, Pleasance E, Zhang W, Holt R, Eirew P, Karasinska J, Kalloger SE, Taylor G, Majounie E, Bonakdar M, Zong Z, Bleile D, Chiu R, Birol I, Gelmon K, Lohrisch C, Mungall KL, Mungall AJ, Moore R, Ma YP, Fok A, Yip S, Karsan A, Huntsman D, Schaeffer DF, Laskin J, Marra MA, Renouf DJ, Jones SJM, and Schrader KA
- Subjects
- Age of Onset, DNA Glycosylases deficiency, Female, Germ-Line Mutation, Humans, Loss of Heterozygosity, Middle Aged, Proto-Oncogene Proteins p21(ras) genetics, Breast Neoplasms genetics, Carcinoma, Pancreatic Ductal genetics, DNA Glycosylases genetics, Mutation, Pancreatic Neoplasms genetics
- Abstract
We report a case of early-onset pancreatic ductal adenocarcinoma in a patient harboring biallelic MUTYH germline mutations, whose tumor featured somatic mutational signatures consistent with defective MUTYH -mediated base excision repair and the associated driver KRAS transversion mutation p.Gly12Cys. Analysis of an additional 730 advanced cancer cases ( N = 731) was undertaken to determine whether the mutational signatures were also present in tumors from germline MUTYH heterozygote carriers or if instead the signatures were only seen in those with biallelic loss of function. We identified two patients with breast cancer each carrying a pathogenic germline MUTYH variant with a somatic MUTYH copy loss leading to the germline variant being homozygous in the tumor and demonstrating the same somatic signatures. Our results suggest that monoallelic inactivation of MUTYH is not sufficient for C:G>A:T transversion signatures previously linked to MUTYH deficiency to arise ( N = 9), but that biallelic complete loss of MUTYH function can cause such signatures to arise even in tumors not classically seen in MUTYH -associated polyposis ( N = 3). Although defective MUTYH is not the only determinant of these signatures, MUTYH germline variants may be present in a subset of patients with tumors demonstrating elevated somatic signatures possibly suggestive of MUTYH deficiency (e.g., COSMIC Signature 18, SigProfiler SBS18/SBS36, SignatureAnalyzer SBS18/SBS36)., (© 2019 Thibodeau et al.; Published by Cold Spring Harbor Laboratory Press.)
- Published
- 2019
- Full Text
- View/download PDF
160. Antimicrobial peptides from Rana [Lithobates] catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles.
- Author
-
Helbing CC, Hammond SA, Jackman SH, Houston S, Warren RL, Cameron CE, and Birol I
- Subjects
- Amino Acid Sequence, Animals, Anti-Infective Agents chemistry, Anti-Infective Agents metabolism, Antimicrobial Cationic Peptides metabolism, Genome, Ranidae, Sequence Homology, Amino Acid, Antimicrobial Cationic Peptides chemistry, Antimicrobial Cationic Peptides genetics, Computational Biology methods, Gene Expression Regulation, Developmental, Larva physiology, Metamorphosis, Biological genetics, Transcriptome
- Abstract
Antimicrobial peptides (AMPs) exhibit broad-spectrum antimicrobial activity, and have promise as new therapeutic agents. While the adult North American bullfrog (Rana [Lithobates] catesbeiana) is a prolific source of high-potency AMPs, the aquatic tadpole represents a relatively untapped source for new AMP discovery. The recent publication of the bullfrog genome and transcriptomic resources provides an opportune bridge between known AMPs and bioinformatics-based AMP discovery. The objective of the present study was to identify novel AMPs with therapeutic potential using a combined bioinformatics and wet lab-based approach. In the present study, we identified seven novel AMP precursor-encoding transcripts expressed in the tadpole. Comparison of their amino acid sequences with known AMPs revealed evidence of mature peptide sequence conservation with variation in the prepro sequence. Two mature peptide sequences were unique and demonstrated bacteriostatic and bactericidal activity against Mycobacteria but not Gram-negative or Gram-positive bacteria. Nine known and seven novel AMP-encoding transcripts were detected in premetamorphic tadpole back skin, olfactory epithelium, liver, and/or tail fin. Treatment of tadpoles with 10 nM 3,5,3'-triiodothyronine for 48 h did not affect transcript abundance in the back skin, and had limited impact on these transcripts in the other three tissues. Gene mapping revealed considerable diversity in size (1.6-15 kbp) and exon number (one to four) of AMP-encoding genes with clear evidence of alternative splicing leading to both prepro and mature amino acid sequence diversity. These findings verify the accuracy and utility of the bullfrog genome assembly, and set a firm foundation for bioinformatics-based AMP discovery.
- Published
- 2019
- Full Text
- View/download PDF
161. The Genome of the North American Brown Bear or Grizzly: Ursus arctos ssp. horribilis.
- Author
-
Taylor GA, Kirk H, Coombe L, Jackman SD, Chu J, Tse K, Cheng D, Chuah E, Pandoh P, Carlsen R, Zhao Y, Mungall AJ, Moore R, Birol I, Franke M, Marra MA, Dutton C, and Jones SJM
- Abstract
The grizzly bear ( Ursus arctos ssp. horribilis ) represents the largest population of brown bears in North America. Its genome was sequenced using a microfluidic partitioning library construction technique, and these data were supplemented with sequencing from a nanopore-based long read platform. The final assembly was 2.33 Gb with a scaffold N50 of 36.7 Mb, and the genome is of comparable size to that of its close relative the polar bear (2.30 Gb). An analysis using 4104 highly conserved mammalian genes indicated that 96.1% were found to be complete within the assembly. An automated annotation of the genome identified 19,848 protein coding genes. Our study shows that the combination of the two sequencing modalities that we used is sufficient for the construction of highly contiguous reference quality mammalian genomes. The assembled genome sequence and the supporting raw sequence reads are available from the NCBI (National Center for Biotechnology Information) under the bioproject identifier PRJNA493656, and the assembly described in this paper is version QXTK01000000., Competing Interests: The authors declare no conflicts of interest.
- Published
- 2018
- Full Text
- View/download PDF
162. Tigmint: correcting assembly errors using linked reads from large molecules.
- Author
-
Jackman SD, Coombe L, Chu J, Warren RL, Vandervalk BP, Yeo S, Xue Z, Mohamadi H, Bohlmann J, Jones SJM, and Birol I
- Subjects
- Chromosomes, Human genetics, Genome, Human, Genomics, Humans, Nanopores, Repetitive Sequences, Nucleic Acid, High-Throughput Nucleotide Sequencing methods, Software
- Abstract
Background: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap., Results: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing., Conclusions: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.
- Published
- 2018
- Full Text
- View/download PDF
163. TAP: a targeted clinical genomics pipeline for detecting transcript variants using RNA-seq data.
- Author
-
Chiu R, Nip KM, Chu J, and Birol I
- Subjects
- Humans, INDEL Mutation, Leukemia, Myeloid, Acute genetics, Leukemia, Myeloid, Acute pathology, RNA chemistry, RNA genetics, RNA Splicing, Sequence Analysis, RNA, Genetic Variation, Genomics methods, RNA metabolism, User-Computer Interface
- Abstract
Background: RNA-seq is a powerful and cost-effective technology for molecular diagnostics of cancer and other diseases, and it can reach its full potential when coupled with validated clinical-grade informatics tools. Despite recent advances in long-read sequencing, transcriptome assembly of short reads remains a useful and cost-effective methodology for unveiling transcript-level rearrangements and novel isoforms. One of the major concerns for adopting the proven de novo assembly approach for RNA-seq data in clinical settings has been the analysis turnaround time. To address this concern, we have developed a targeted approach to expedite assembly and analysis of RNA-seq data., Results: Here we present our Targeted Assembly Pipeline (TAP), which consists of four stages: 1) alignment-free gene-level classification of RNA-seq reads using BioBloomTools, 2) de novo assembly of individual targets using Trans-ABySS, 3) alignment of assembled contigs to the reference genome and transcriptome with GMAP and BWA and 4) structural and splicing variant detection using PAVFinder. We show that PAVFinder is a robust gene fusion detection tool when compared to established methods such as Tophat-Fusion and deFuse on simulated data of 448 events. Using the Leucegene acute myeloid leukemia (AML) RNA-seq data and a set of 580 COSMIC target genes, TAP identified a wide range of hallmark molecular anomalies including gene fusions, tandem duplications, insertions and deletions in agreement with published literature results. Moreover, also in this dataset, TAP captured AML-specific splicing variants such as skipped exons and novel splice sites reported in studies elsewhere. Running time of TAP on 100-150 million read pairs and a 580-gene set is one to 2 hours on a 48-core machine., Conclusions: We demonstrated that TAP is a fast and robust RNA-seq variant detection pipeline that is potentially amenable to clinical applications. TAP is available at http://www.bcgsc.ca/platform/bioinfo/software/pavfinder.
- Published
- 2018
- Full Text
- View/download PDF
164. Recurrent tumor-specific regulation of alternative polyadenylation of cancer-related genes.
- Author
-
Xue Z, Warren RL, Gibb EA, MacMillan D, Wong J, Chiu R, Hammond SA, Yang C, Nip KM, Ennis CA, Hahn A, Reynolds S, and Birol I
- Subjects
- 3' Untranslated Regions, Cloud Computing, Databases, Genetic, Fibroblast Growth Factor 2 genetics, Gene Expression Regulation, Neoplastic, Humans, Neoplasm Recurrence, Local genetics, Neoplasms pathology, Polyadenylation, RNA Cleavage, RNA, Messenger metabolism, Software, Neoplasms genetics, RNA, Messenger genetics
- Abstract
Background: Alternative polyadenylation (APA) results in messenger RNA molecules with different 3' untranslated regions (3' UTRs), affecting the molecules' stability, localization, and translation. APA is pervasive and implicated in cancer. Earlier reports on APA focused on 3' UTR length modifications and commonly characterized APA events as 3' UTR shortening or lengthening. However, such characterization oversimplifies the processing of 3' ends of transcripts and fails to adequately describe the various scenarios we observe., Results: We built a cloud-based targeted de novo transcript assembly and analysis pipeline that incorporates our previously developed cleavage site prediction tool, KLEAT. We applied this pipeline to elucidate the APA profiles of 114 genes in 9939 tumor and 729 tissue normal samples from The Cancer Genome Atlas (TCGA). The full set of 10,668 RNA-Seq samples from 33 cancer types has not been utilized by previous APA studies. By comparing the frequencies of predicted cleavage sites between normal and tumor sample groups, we identified 77 events (i.e. gene-cancer type pairs) of tumor-specific APA regulation in 13 cancer types; for 15 genes, such regulation is recurrent across multiple cancers. Our results also support a previous report showing the 3' UTR shortening of FGF2 in multiple cancers. However, over half of the events we identified display complex changes to 3' UTR length that resist simple classification like shortening or lengthening., Conclusions: Recurrent tumor-specific regulation of APA is widespread in cancer. However, the regulation pattern that we observed in TCGA RNA-seq data cannot be described as straightforward 3' UTR shortening or lengthening. Continued investigation into this complex, nuanced regulatory landscape will provide further insight into its role in tumor formation and development.
- Published
- 2018
- Full Text
- View/download PDF
165. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.
- Author
-
Coombe L, Zhang J, Vandervalk BP, Chu J, Jackman SD, Birol I, and Warren RL
- Subjects
- Humans, Chromosomes, Human genetics, Genome, Human, Genomics methods, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Background: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time., Results: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13)., Conclusions: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
- Published
- 2018
- Full Text
- View/download PDF
166. ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data.
- Author
-
Khan H, Mohamadi H, Vandervalk BP, Warren RL, Chu J, and Birol I
- Subjects
- Algorithms, Genome, High-Throughput Nucleotide Sequencing methods, RNA, Sequence Analysis, RNA methods, Software, Alternative Splicing, Exons, Transcriptome, Whole Genome Sequencing
- Abstract
Motivation: Sequencing studies on non-model organisms often interrogate both genomes and transcriptomes with massive amounts of short sequences. Such studies require de novo analysis tools and techniques, when the species and closely related species lack high quality reference resources. For certain applications such as de novo annotation, information on putative exons and alternative splicing may be desirable., Results: Here we present ChopStitch, a new method for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are represented as splice graphs in DOT output format., Availability and Implementation: ChopStitch is written in Python and C++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ChopStitch., Contact: hkhan@bcgsc.ca or ibirol@bcgsc.ca., Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2018
- Full Text
- View/download PDF
167. ARCS: scaffolding genome drafts with linked reads.
- Author
-
Yeo S, Coombe L, Warren RL, Chu J, and Birol I
- Subjects
- Genomics methods, Humans, Genome, Human, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNA-a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics., Results: Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts., Availability and Implementation: https://github.com/bcgsc/ARCS/., Contact: rwarren@bcgsc.ca., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2017. Published by Oxford University Press.)
- Published
- 2018
- Full Text
- View/download PDF
168. A novel approach to wildlife transcriptomics provides evidence of disease-mediated differential expression and changes to the microbiome of amphibian populations.
- Author
-
Campbell LJ, Hammond SA, Price SJ, Sharma MD, Garner TWJ, Birol I, Helbing CC, Wilfert L, and Griffiths AGF
- Subjects
- Animals, Animals, Wild microbiology, DNA Virus Infections virology, Microbiota genetics, Rana temporaria genetics, Sequence Analysis, RNA, United Kingdom, Animals, Wild genetics, DNA Virus Infections genetics, Rana temporaria virology, Ranavirus pathogenicity
- Abstract
Ranaviruses are responsible for a lethal, emerging infectious disease in amphibians and threaten their populations throughout the world. Despite this, little is known about how amphibian populations respond to ranaviral infection. In the United Kingdom, ranaviruses impact the common frog (Rana temporaria). Extensive public engagement in the study of ranaviruses in the UK has led to the formation of a unique system of field sites containing frog populations of known ranaviral disease history. Within this unique natural field system, we used RNA sequencing (RNA-Seq) to compare the gene expression profiles of R. temporaria populations with a history of ranaviral disease and those without. We have applied a RNA read-filtering protocol that incorporates Bloom filters, previously used in clinical settings, to limit the potential for contamination that comes with the use of RNA-Seq in nonlaboratory systems. We have identified a suite of 407 transcripts that are differentially expressed between populations of different ranaviral disease history. This suite contains genes with functions related to immunity, development, protein transport and olfactory reception among others. A large proportion of potential noncoding RNA transcripts present in our differentially expressed set provide first evidence of a possible role for long noncoding RNA (lncRNA) in amphibian response to viruses. Our read-filtering approach also removed significantly more bacterial reads from libraries generated from positive disease history populations. Subsequent analysis revealed these bacterial read sets to represent distinct communities of bacterial species, which is suggestive of an interaction between ranavirus and the host microbiome in the wild., (© 2018 The Authors. Molecular Ecology Published by John Wiley & Sons Ltd.)
- Published
- 2018
- Full Text
- View/download PDF
169. Genome-Enhanced Detection and Identification (GEDI) of plant pathogens.
- Author
-
Feau N, Beauseigle S, Bergeron MJ, Bilodeau GJ, Birol I, Cervantes-Arango S, Dhillon B, Dale AL, Herath P, Jones SJM, Lamarche J, Ojeda DI, Sakalidis ML, Taylor G, Tsui CKM, Uzunovic A, Yueh H, Tanguay P, and Hamelin RC
- Abstract
Plant diseases caused by fungi and Oomycetes represent worldwide threats to crops and forest ecosystems. Effective prevention and appropriate management of emerging diseases rely on rapid detection and identification of the causal pathogens. The increase in genomic resources makes it possible to generate novel genome-enhanced DNA detection assays that can exploit whole genomes to discover candidate genes for pathogen detection. A pipeline was developed to identify genome regions that discriminate taxa or groups of taxa and can be converted into PCR assays. The modular pipeline is comprised of four components: (1) selection and genome sequencing of phylogenetically related taxa, (2) identification of clusters of orthologous genes, (3) elimination of false positives by filtering, and (4) assay design. This pipeline was applied to some of the most important plant pathogens across three broad taxonomic groups: Phytophthoras (Stramenopiles, Oomycota), Dothideomycetes (Fungi, Ascomycota) and Pucciniales (Fungi, Basidiomycota). Comparison of 73 fungal and Oomycete genomes led the discovery of 5,939 gene clusters that were unique to the targeted taxa and an additional 535 that were common at higher taxonomic levels. Approximately 28% of the 299 tested were converted into qPCR assays that met our set of specificity criteria. This work demonstrates that a genome-wide approach can efficiently identify multiple taxon-specific genome regions that can be converted into highly specific PCR assays. The possibility to easily obtain multiple alternative regions to design highly specific qPCR assays should be of great help in tackling challenging cases for which higher taxon-resolution is needed., Competing Interests: The authors declare there are no competing interests.
- Published
- 2018
- Full Text
- View/download PDF
170. The Genome of the Northern Sea Otter (Enhydra lutris kenyoni).
- Author
-
Jones SJ, Haulena M, Taylor GA, Chan S, Bilobram S, Warren RL, Hammond SA, Mungall KL, Choo C, Kirk H, Pandoh P, Ally A, Dhalla N, Tam AKY, Troussard A, Paulino D, Coope RJN, Mungall AJ, Moore R, Zhao Y, Birol I, Ma Y, Marra M, and Jones SJM
- Abstract
The northern sea otter inhabits coastal waters of the northern Pacific Ocean and is the largest member of the Mustelidae family. DNA sequencing methods that utilize microfluidic partitioned and non-partitioned library construction were used to establish the sea otter genome. The final assembly provided 2.426 Gbp of highly contiguous assembled genomic sequences with a scaffold N50 length of over 38 Mbp. We generated transcriptome data derived from a lymphoma to aid in the determination of functional elements. The assembled genome sequence and underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the BioProject accession number PRJNA388419., Competing Interests: The authors declare no competing financial interests.
- Published
- 2017
- Full Text
- View/download PDF
171. The Genome of the Beluga Whale (Delphinapterus leucas).
- Author
-
Jones SJM, Taylor GA, Chan S, Warren RL, Hammond SA, Bilobram S, Mordecai G, Suttle CA, Miller KM, Schulze A, Chan AM, Jones SJ, Tse K, Li I, Cheung D, Mungall KL, Choo C, Ally A, Dhalla N, Tam AKY, Troussard A, Kirk H, Pandoh P, Paulino D, Coope RJN, Mungall AJ, Moore R, Zhao Y, Birol I, Ma Y, Marra M, and Haulena M
- Abstract
The beluga whale is a cetacean that inhabits arctic and subarctic regions, and is the only living member of the genus Delphinapterus . The genome of the beluga whale was determined using DNA sequencing approaches that employed both microfluidic partitioning library and non-partitioned library construction. The former allowed for the construction of a highly contiguous assembly with a scaffold N50 length of over 19 Mbp and total reconstruction of 2.32 Gbp. To aid our understanding of the functional elements, transcriptome data was also derived from brain, duodenum, heart, lung, spleen, and liver tissue. Assembled sequence and all of the underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the Bioproject accession number PRJNA360851A., Competing Interests: The authors declare no competing financial interests.
- Published
- 2017
- Full Text
- View/download PDF
172. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA.
- Author
-
Hammond SA, Warren RL, Vandervalk BP, Kucuk E, Khan H, Gibb EA, Pandoh P, Kirk H, Zhao Y, Jones M, Mungall AJ, Coope R, Pleasance S, Moore RA, Holt RA, Round JM, Ohora S, Walle BV, Veldhoen N, Helbing CC, and Birol I
- Subjects
- Animals, Computational Biology, Genome, Mitochondrial, Male, Molecular Sequence Annotation, North America, Phylogeny, RNA, Long Noncoding metabolism, Rana catesbeiana metabolism, Thyroid Hormones metabolism, Genome, RNA, Long Noncoding genetics, Rana catesbeiana genetics
- Abstract
Frogs play important ecological roles, and several species are important model organisms for scientific research. The globally distributed Ranidae (true frogs) are the largest frog family, and have substantial evolutionary distance from the model laboratory Xenopus frog species. Unfortunately, there are currently no genomic resources for the former, important group of amphibians. More widely applicable amphibian genomic data is urgently needed as more than two-thirds of known species are currently threatened or are undergoing population declines. We report a 5.8 Gbp (NG50 = 69 kbp) genome assembly of a representative North American bullfrog (Rana [Lithobates] catesbeiana). The genome contains over 22,000 predicted protein-coding genes and 6,223 candidate long noncoding RNAs (lncRNAs). RNA-Seq experiments show thyroid hormone causes widespread transcriptional change among protein-coding and putative lncRNA genes. This initial bullfrog draft genome will serve as a key resource with broad utility including amphibian research, developmental biology, and environmental research.
- Published
- 2017
- Full Text
- View/download PDF
173. Complete Genome Sequence of Mycobacterium chimaera SJ42, a Nonoutbreak Strain from an Immunocompromised Patient with Pulmonary Disease.
- Author
-
Hasan NA, Warren RL, Epperson LE, Malecha A, Alexander DC, Turenne CY, MacMillan D, Birol I, Pleasance S, Coope R, Jones SJM, Romney MG, Ng M, Chan T, Rodrigues M, Tang P, Gardy JL, and Strong M
- Abstract
Mycobacterium chimaera , a nontuberculous mycobacterium (NTM) belonging to the Mycobacterium avium complex (MAC), is an opportunistic pathogen that can cause respiratory and disseminated disease. We report the complete genome sequence of a strain, SJ42, isolated from an immunocompromised male presenting with MAC pneumonia, assembled from Illumina and Oxford Nanopore data., (Copyright © 2017 Hasan et al.)
- Published
- 2017
- Full Text
- View/download PDF
174. Kollector: transcript-informed, targeted de novo assembly of gene loci.
- Author
-
Kucuk E, Chu J, Vandervalk BP, Hammond SA, Warren RL, and Birol I
- Subjects
- Animals, Caenorhabditis elegans genetics, Humans, Pediculus genetics, Picea genetics, Eukaryota genetics, Genetic Loci, Genomics methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Despite considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise. A targeted assembly approach to perform local assembly of sequences of interest remains a valuable option for some applications. This is especially true for gene-centric assemblies, whose resulting sequence can be readily utilized for more focused biological research. Here we describe Kollector, an alignment-free targeted assembly pipeline that uses thousands of transcript sequences concurrently to inform the localized assembly of corresponding gene loci. Kollector robustly reconstructs introns and novel sequences within these loci, and scales well to large genomes-properties that makes it especially useful for researchers working on non-model eukaryotic organisms., Results: We demonstrate the performance of Kollector for assembling complete or near-complete Caenorhabditis elegans and Homo sapiens gene loci from their respective, input transcripts. In a time- and memory-efficient manner, the Kollector pipeline successfully reconstructs respectively 99% and 80% (compared to 86% and 73% with standard de novo assembly techniques) of C.elegans and H.sapiens transcript targets in their corresponding genomic space using whole genome shotgun sequencing reads. We also show that Kollector outperforms both established and recently released targeted assembly tools. Finally, we demonstrate three use cases for Kollector, including comparative and cancer genomics applications., Availability and Implementation: Kollector is implemented as a bash script, and is available at https://github.com/bcgsc/kollector., Contact: ibirol@bcgsc.ca., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2017. Published by Oxford University Press.)
- Published
- 2017
- Full Text
- View/download PDF
175. ntCard: a streaming algorithm for cardinality estimation in genomics data.
- Author
-
Mohamadi H, Khan H, and Birol I
- Subjects
- Algorithms, Genome Size, Genome, Human, Genome, Plant, Humans, Models, Statistical, Picea genetics, Genomics methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task., Results: Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications., Availability and Implementation: ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard., Contact: hmohamadi@bcgsc.ca or ibirol@bcgsc.ca., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2017. Published by Oxford University Press.)
- Published
- 2017
- Full Text
- View/download PDF
176. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.
- Author
-
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, and Birol I
- Subjects
- Contig Mapping standards, Genome Size, Genomics standards, Humans, Sequence Analysis, DNA methods, Sequence Analysis, DNA standards, Contig Mapping methods, Genomics methods, Software
- Abstract
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp., (© 2017 Jackman et al.; Published by Cold Spring Harbor Laboratory Press.)
- Published
- 2017
- Full Text
- View/download PDF
177. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.
- Author
-
Chu J, Mohamadi H, Warren RL, Yang C, and Birol I
- Subjects
- Algorithms, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput., Contact: cjustin@bcgsc.ca , ibirol@bcgsc.ca., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2016. Published by Oxford University Press.)
- Published
- 2017
- Full Text
- View/download PDF
178. A high-resolution reference genetic map positioning 8.8 K genes for the conifer white spruce: structural genomics implications and correspondence with physical distance.
- Author
-
Pavy N, Lamothe M, Pelgas B, Gagnon F, Birol I, Bohlmann J, Mackay J, Isabel N, and Bousquet J
- Subjects
- Chromosome Mapping methods, Computational Biology methods, DNA, Plant genetics, Genomics methods, Polymorphism, Single Nucleotide genetics, Synteny, Genome, Plant genetics, Picea genetics
- Abstract
Over the last decade, extensive genetic and genomic resources have been developed for the conifer white spruce (Picea glauca, Pinaceae), which has one of the largest plant genomes (20 Gbp). Draft genome sequences of white spruce and other conifers have recently been produced, but dense genetic maps are needed to comprehend genome macrostructure, delineate regions involved in quantitative traits, complement functional genomic investigations, and assist the assembly of fragmented genomic sequences. A greatly expanded P. glauca composite linkage map was generated from a set of 1976 full-sib progeny, with the positioning of 8793 expressed genes. Regions with significant low or high gene density were identified. Gene family members tended to be mapped on the same chromosomes, with tandemly arrayed genes significantly biased towards specific functional classes. The map was integrated with transcriptome data surveyed across eight tissues. In total, 69 clusters of co-expressed and co-localising genes were identified. A high level of synteny was found with pine genetic maps, which should facilitate the transfer of structural information in the Pinaceae. Although the current white spruce genome sequence remains highly fragmented, dozens of scaffolds encompassing more than one mapped gene were identified. From these, the relationship between genetic and physical distances was examined and the genome-wide recombination rate was found to be much smaller than most estimates reported for angiosperm genomes. This gene linkage map shall assist the large-scale assembly of the next-generation white spruce genome sequence and provide a reference resource for the conifer genomics community., (© 2017 Her Majesty the Queen in Right of Canada. The Plant Journal published by John Wiley & Sons Ltd and Society for Experimental Biology.)
- Published
- 2017
- Full Text
- View/download PDF
179. NanoSim: nanopore sequence read simulator based on statistical characterization.
- Author
-
Yang C, Chu J, Warren RL, and Birol I
- Subjects
- Algorithms, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA, Computational Biology methods, Genomics methods, Software
- Abstract
Background: The MinION sequencing instrument from Oxford Nanopore Technologies (ONT) produces long read lengths from single-molecule sequencing - valuable features for detailed genome characterization. To realize the potential of this platform, a number of groups are developing bioinformatics tools tuned for the unique characteristics of its data. We note that these development efforts would benefit from a simulator software, the output of which could be used to benchmark analysis tools., Results: Here, we introduce NanoSim, a fast and scalable read simulator that captures the technology-specific features of ONT data and allows for adjustments upon improvement of nanopore sequencing technology. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation stage uses the model built in the previous step to produce in silico reads for a given reference genome. NanoSim is written in Python and R. The source files and manual are available at the Genome Sciences Centre website: http://www.bcgsc.ca/platform/bioinfo/software/nanosim., Conclusion: In this work, we model the base-calling errors of ONT reads to inform the simulation of sequences with similar characteristics. We showcase the performance of NanoSim on publicly available datasets generated using the R7 and R7.3 chemistries and different sequencing kits and compare the resulting synthetic reads to those of other long-sequence simulators and experimental ONT reads. We expect NanoSim to have an enabling role in the field and benefit the development of scalable next-generation sequencing technologies for the long nanopore reads, including genome assembly, mutation detection, and even metagenomic analysis software., (© The Authors 2017. Published by Oxford University Press.)
- Published
- 2017
- Full Text
- View/download PDF
180. De novo assembly of the ringed seal (Pusa hispida) blubber transcriptome: A tool that enables identification of molecular health indicators associated with PCB exposure.
- Author
-
Brown TM, Hammond SA, Behsaz B, Veldhoen N, Birol I, and Helbing CC
- Subjects
- Animals, Gene Expression Profiling, Gene Ontology, Molecular Sequence Annotation, Polymerase Chain Reaction, RNA, Messenger genetics, RNA, Messenger metabolism, Reproducibility of Results, Sequence Analysis, RNA, Water Pollutants, Chemical toxicity, Animal Structures metabolism, Environmental Exposure analysis, Health Status Indicators, Polychlorinated Biphenyls toxicity, Seals, Earless genetics, Transcriptome genetics
- Abstract
The ringed seal, Pusa hispida, is a keystone species in the Arctic marine ecosystem, and is proving a useful marine mammal for linking polychlorinated biphenyl (PCB) exposure to toxic injury. We report here the first de novo assembled transcriptome for the ringed seal (342,863 transcripts, of which 53% were annotated), which we then applied to a population of ringed seals exposed to a local PCB source in Arctic Labrador, Canada. We found an indication of energy metabolism imbalance in local ringed seals (n=4), and identified five significant gene transcript targets: plasminogen receptor (Plg-R(KT)), solute carrier family 25 member 43 receptor (Slc25a43), ankyrin repeat domain-containing protein 26-like receptor (Ankrd26), HIS30 (not yet annotated) and HIS16 (not yet annotated) that may represent indicators of PCB exposure and effects in marine mammals. The abundance profiles of these five gene targets were validated in blubber samples collected from 43 ringed seals using a qPCR assay. The mRNA transcript levels for all five gene targets, (Plg-R(KT), r
2 =0.43), (Slc25a43, r2 =0.51), (Ankrd26, r2 =0.43), (HIS30, r2 =0.39) and (HIS16, r2 =0.31) correlated with increasing levels of blubber PCBs. Results from the present study contribute to our understanding of PCB associated effects in marine mammals, and provide new tools for future molecular and toxicology work in pinnipeds., (Copyright © 2017 Elsevier B.V. All rights reserved.)- Published
- 2017
- Full Text
- View/download PDF
181. Genomic and Cytogenetic Characterization of a Balanced Translocation Disrupting NUP98.
- Author
-
Thibodeau ML, Steinraths M, Brown L, Zong Z, Shomer N, Taubert S, Mungall KL, Ma YP, Mueller R, Birol I, and Lehman A
- Subjects
- Adult, Amniocentesis, Cytogenetic Analysis methods, Female, GPI-Linked Proteins genetics, Genome-Wide Association Study, Genomics methods, Humans, Intercellular Signaling Peptides and Proteins genetics, Neoplasm Proteins genetics, Promoter Regions, Genetic, Pseudogenes, Transcription Initiation Site, Tuberous Sclerosis diagnosis, Tuberous Sclerosis genetics, Angiomyolipoma genetics, Chromosomes, Human, Pair 11 genetics, Chromosomes, Human, Pair 12 genetics, Kidney Neoplasms genetics, Nuclear Pore Complex Proteins genetics, Translocation, Genetic
- Abstract
A 41-year-old Asian woman with bilateral renal angiomyolipomas (AML) was incidentally identified to have a balanced translocation, 46,XX,t(11;12)(p15.4;q15). She had no other features or family history to suggest a diagnosis of tuberous sclerosis. Her healthy daughter had the same translocation and no renal AML at the age of 3 years. Whole-genome sequencing was performed on genomic maternal DNA isolated from blood. A targeted de novo assembly was then conducted with ABySS for chromosomes 11 and 12. Sanger sequencing was used to validate the translocation breakpoints. As a result, genomic characterization of chromosomes 11 and 12 revealed that the 11p breakpoint disrupted the NUP98 gene in intron 1, causing a separation of the promoter and transcription start site from the rest of the gene. The translocation breakpoint on chromosome 12q was located in a gene desert. NUP98 has not yet been associated with renal AML pathogenesis, but somatic NUP98 alterations are recurrently implicated in hematological malignancies, most often following a gene fusion event. We also found evidence for complex structural events involving chromosome 12, which appear to disrupt the TDG gene. We identified a TDGP1 partially processed pseudogene at 12p12.1, which adds complexity to the de novo assembly. In conclusion, this is the first report of a germline constitutional structural chromosome rearrangement disrupting NUP98 that occurred in a generally healthy woman with bilateral renal AML., (© 2017 S. Karger AG, Basel.)
- Published
- 2017
- Full Text
- View/download PDF
182. ntHash: recursive nucleotide hashing.
- Author
-
Mohamadi H, Chu J, Vandervalk BP, and Birol I
- Subjects
- Animals, Humans, Sequence Alignment, Sequence Analysis, DNA, Software, Algorithms, Nucleotides
- Abstract
Motivation: Hashing has been widely used for indexing, querying and rapid similarity search in many bioinformatics applications, including sequence alignment, genome and transcriptome assembly, k-mer counting and error correction. Hence, expediting hashing operations would have a substantial impact in the field, making bioinformatics applications faster and more efficient., Results: We present ntHash, a hashing algorithm tuned for processing DNA/RNA sequences. It performs the best when calculating hash values for adjacent k-mers in an input sequence, operating an order of magnitude faster than the best performing alternatives in typical use cases., Availability and Implementation: ntHash is available online at http://www.bcgsc.ca/platform/bioinfo/software/nthash and is free for academic use., Contacts: hmohamadi@bcgsc.ca or ibirol@bcgsc.caSupplementary information: Supplementary data are available at Bioinformatics online., (© The Author 2016. Published by Oxford University Press.)
- Published
- 2016
- Full Text
- View/download PDF
183. Genome sequences of six Phytophthora species threatening forest ecosystems.
- Author
-
Feau N, Taylor G, Dale AL, Dhillon B, Bilodeau GJ, Birol I, Jones SJ, and Hamelin RC
- Abstract
The Phytophthora genus comprises of some of the most destructive plant pathogens and attack a wide range of hosts including economically valuable tree species, both angiosperm and gymnosperm. Many known species of Phytophthora are invasive and have been introduced through nursery and agricultural trade. As part of a larger project aimed at utilizing genomic data for forest disease diagnostics, pathogen detection and monitoring (The TAIGA project: Tree Aggressors Identification using Genomic Approaches; http://taigaforesthealth.com/), we sequenced the genomes of six important Phytophthora species that are important invasive pathogens of trees and a serious threat to the international trade of forest products. This genomic data was used to develop highly sensitive and specific detection assays and for genome comparisons and to make evolutionary inferences and will be useful to the broader plant and tree health community. These WGS data have been deposited in the International Nucleotide Sequence Database Collaboration (DDBJ/ENA/GenBank) under the accession numbers AUPN01000000, AUVH01000000, AUWJ02000000, AUUF02000000, AWVV02000000 and AWVW02000000.
- Published
- 2016
- Full Text
- View/download PDF
184. Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics' GemCode Sequencing Data.
- Author
-
Coombe L, Warren RL, Jackman SD, Yang C, Vandervalk BP, Moore RA, Pleasance S, Coope RJ, Bohlmann J, Holt RA, Jones SJ, and Birol I
- Subjects
- Phylogeny, Picea classification, Chloroplasts genetics, Genome, Plant, Picea genetics
- Abstract
The linked read sequencing library preparation platform by 10X Genomics produces barcoded sequencing libraries, which are subsequently sequenced using the Illumina short read sequencing technology. In this new approach, long fragments of DNA are partitioned into separate micro-reactions, where the same index sequence is incorporated into each of the sequencing fragment inserts derived from a given long fragment. In this study, we exploited this property by using reads from index sequences associated with a large number of reads, to assemble the chloroplast genome of the Sitka spruce tree (Picea sitchensis). Here we report on the first Sitka spruce chloroplast genome assembled exclusively from P. sitchensis genomic libraries prepared using the 10X Genomics protocol. We show that the resulting 124,049 base pair long genome shares high sequence similarity with the related white spruce and Norway spruce chloroplast genomes, but diverges substantially from a previously published P. sitchensis- P. thunbergii chimeric genome. The use of reads from high-frequency indices enabled separation of the nuclear genome reads from that of the chloroplast, which resulted in the simplification of the de Bruijn graphs used at the various stages of assembly., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2016
- Full Text
- View/download PDF
185. Divergent clonal selection dominates medulloblastoma at recurrence.
- Author
-
Morrissy AS, Garzia L, Shih DJ, Zuyderduyn S, Huang X, Skowron P, Remke M, Cavalli FM, Ramaswamy V, Lindsay PE, Jelveh S, Donovan LK, Wang X, Luu B, Zayne K, Li Y, Mayoh C, Thiessen N, Mercier E, Mungall KL, Ma Y, Tse K, Zeng T, Shumansky K, Roth AJ, Shah S, Farooq H, Kijima N, Holgado BL, Lee JJ, Matan-Lithwick S, Liu J, Mack SC, Manno A, Michealraj KA, Nor C, Peacock J, Qin L, Reimand J, Rolider A, Thompson YY, Wu X, Pugh T, Ally A, Bilenky M, Butterfield YS, Carlsen R, Cheng Y, Chuah E, Corbett RD, Dhalla N, He A, Lee D, Li HI, Long W, Mayo M, Plettner P, Qian JQ, Schein JE, Tam A, Wong T, Birol I, Zhao Y, Faria CC, Pimentel J, Nunes S, Shalaby T, Grotzer M, Pollack IF, Hamilton RL, Li XN, Bendel AE, Fults DW, Walter AW, Kumabe T, Tominaga T, Collins VP, Cho YJ, Hoffman C, Lyden D, Wisoff JH, Garvin JH Jr, Stearns DS, Massimi L, Schüller U, Sterba J, Zitterbart K, Puget S, Ayrault O, Dunn SE, Tirapelli DP, Carlotti CG, Wheeler H, Hallahan AR, Ingram W, MacDonald TJ, Olson JJ, Van Meir EG, Lee JY, Wang KC, Kim SK, Cho BK, Pietsch T, Fleischhack G, Tippelt S, Ra YS, Bailey S, Lindsey JC, Clifford SC, Eberhart CG, Cooper MK, Packer RJ, Massimino M, Garre ML, Bartels U, Tabori U, Hawkins CE, Dirks P, Bouffet E, Rutka JT, Wechsler-Reya RJ, Weiss WA, Collier LS, Dupuy AJ, Korshunov A, Jones DT, Kool M, Northcott PA, Pfister SM, Largaespada DA, Mungall AJ, Moore RA, Jabado N, Bader GD, Jones SJ, Malkin D, Marra MA, and Taylor MD
- Subjects
- Animals, Cerebellar Neoplasms genetics, Cerebellar Neoplasms pathology, Cerebellar Neoplasms radiotherapy, Cerebellar Neoplasms surgery, Clone Cells pathology, Craniospinal Irradiation, DNA Mutational Analysis, Disease Models, Animal, Drosophila melanogaster cytology, Drosophila melanogaster genetics, Female, Genome, Human genetics, Humans, Male, Medulloblastoma genetics, Medulloblastoma pathology, Medulloblastoma radiotherapy, Medulloblastoma surgery, Mice, Molecular Targeted Therapy methods, Neoplasm Recurrence, Local therapy, Radiotherapy, Image-Guided, Signal Transduction, Xenograft Model Antitumor Assays, Cerebellar Neoplasms therapy, Clone Cells drug effects, Clone Cells metabolism, Medulloblastoma therapy, Neoplasm Recurrence, Local genetics, Neoplasm Recurrence, Local pathology, Selection, Genetic drug effects
- Abstract
The development of targeted anti-cancer therapies through the study of cancer genomes is intended to increase survival rates and decrease treatment-related toxicity. We treated a transposon-driven, functional genomic mouse model of medulloblastoma with 'humanized' in vivo therapy (microneurosurgical tumour resection followed by multi-fractionated, image-guided radiotherapy). Genetic events in recurrent murine medulloblastoma exhibit a very poor overlap with those in matched murine diagnostic samples (<5%). Whole-genome sequencing of 33 pairs of human diagnostic and post-therapy medulloblastomas demonstrated substantial genetic divergence of the dominant clone after therapy (<12% diagnostic events were retained at recurrence). In both mice and humans, the dominant clone at recurrence arose through clonal selection of a pre-existing minor clone present at diagnosis. Targeted therapy is unlikely to be effective in the absence of the target, therefore our results offer a simple, proximal, and remediable explanation for the failure of prior clinical trials of targeted therapy.
- Published
- 2016
- Full Text
- View/download PDF
186. Large-scale profiling of microRNAs for The Cancer Genome Atlas.
- Author
-
Chu A, Robertson G, Brooks D, Mungall AJ, Birol I, Coope R, Ma Y, Jones S, and Marra MA
- Subjects
- Computational Biology methods, Datasets as Topic, Humans, Gene Expression Profiling methods, Genomics methods, MicroRNAs genetics, Neoplasms genetics
- Abstract
The comprehensive multiplatform genomics data generated by The Cancer Genome Atlas (TCGA) Research Network is an enabling resource for cancer research. It includes an unprecedented amount of microRNA sequence data: ~11 000 libraries across 33 cancer types. Combined with initiatives like the National Cancer Institute Genomics Cloud Pilots, such data resources will make intensive analysis of large-scale cancer genomics data widely accessible. To support such initiatives, and to enable comparison of TCGA microRNA data to data from other projects, we describe the process that we developed and used to generate the microRNA sequence data, from library construction through to submission of data to repositories. In the context of this process, we describe the computational pipeline that we used to characterize microRNA expression across large patient cohorts., (© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2016
- Full Text
- View/download PDF
187. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation.
- Author
-
Jackman SD, Warren RL, Gibb EA, Vandervalk BP, Mohamadi H, Chu J, Raymond A, Pleasance S, Coope R, Wildung MR, Ritland CE, Bousquet J, Jones SJ, Bohlmann J, and Birol I
- Subjects
- Base Sequence, Contig Mapping, Molecular Sequence Annotation, Molecular Sequence Data, Genome, Chloroplast, Genome, Mitochondrial, Genome, Plant, Picea genetics
- Abstract
The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data., (© The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.)
- Published
- 2015
- Full Text
- View/download PDF
188. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads.
- Author
-
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, and Birol I
- Subjects
- Sequence Alignment, Genome
- Abstract
Background: Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value., Results: We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes., Conclusions: This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.
- Published
- 2015
- Full Text
- View/download PDF
189. Sealer: a scalable gap-closing application for finishing draft genomes.
- Author
-
Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, and Birol I
- Subjects
- Algorithms, Genome, Human, Genome, Plant, High-Throughput Nucleotide Sequencing, Humans, Internet, Pinaceae genetics, Sequence Analysis, DNA, Computational Biology methods, User-Computer Interface
- Abstract
Background: While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes., Results: Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8% and 13.8% of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study., Conclusion: Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.
- Published
- 2015
- Full Text
- View/download PDF
190. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.
- Author
-
Warren RL, Keeling CI, Yuen MM, Raymond A, Taylor GA, Vandervalk BP, Mohamadi H, Paulino D, Chiu R, Jackman SD, Robertson G, Yang C, Boyle B, Hoffmann M, Weigel D, Nelson DR, Ritland C, Isabel N, Jaquish B, Yanchuk A, Bousquet J, Jones SJ, MacKay J, Birol I, and Bohlmann J
- Subjects
- Alkyl and Aryl Transferases metabolism, Computational Biology, Cytochrome P-450 Enzyme System metabolism, Transcriptome, Genome, Plant, Multigene Family, Phenols metabolism, Picea genetics, Terpenes metabolism
- Abstract
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation., (© 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.)
- Published
- 2015
- Full Text
- View/download PDF
191. De novo Transcriptome Assemblies of Rana (Lithobates) catesbeiana and Xenopus laevis Tadpole Livers for Comparative Genomics without Reference Genomes.
- Author
-
Birol I, Behsaz B, Hammond SA, Kucuk E, Veldhoen N, and Helbing CC
- Subjects
- Animals, Gene Expression Profiling, Gene Expression Regulation, Developmental, Gene Ontology, High-Throughput Nucleotide Sequencing, Larva genetics, Molecular Sequence Annotation, RNA, Messenger genetics, RNA, Messenger metabolism, Reference Standards, Signal Transduction genetics, Genome, Genomics, Liver metabolism, Rana catesbeiana genetics, Transcriptome genetics, Xenopus laevis genetics
- Abstract
In this work we studied the liver transcriptomes of two frog species, the American bullfrog (Rana (Lithobates) catesbeiana) and the African clawed frog (Xenopus laevis). We used high throughput RNA sequencing (RNA-seq) data to assemble and annotate these transcriptomes, and compared how their baseline expression profiles change when tadpoles of the two species are exposed to thyroid hormone. We generated more than 1.5 billion RNA-seq reads in total for the two species under two conditions as treatment/control pairs. We de novo assembled these reads using Trans-ABySS to reconstruct reference transcriptomes, obtaining over 350,000 and 130,000 putative transcripts for R. catesbeiana and X. laevis, respectively. Using available genomics resources for X. laevis, we annotated over 97% of our X. laevis transcriptome contigs, demonstrating the utility and efficacy of our methodology. Leveraging this validated analysis pipeline, we also annotated the assembled R. catesbeiana transcriptome. We used the expression profiles of the annotated genes of the two species to examine the similarities and differences between the tadpole liver transcriptomes. We also compared the gene ontology terms of expressed genes to measure how the animals react to a challenge by thyroid hormone. Our study reports three main conclusions. First, de novo assembly of RNA-seq data is a powerful method for annotating and establishing transcriptomes of non-model organisms. Second, the liver transcriptomes of the two frog species, R. catesbeiana and X. laevis, show many common features, and the distribution of their gene ontology profiles are statistically indistinguishable. Third, although they broadly respond the same way to the presence of thyroid hormone in their environment, their receptor/signal transduction pathways display marked differences.
- Published
- 2015
- Full Text
- View/download PDF
192. UniqTag: Content-Derived Unique and Stable Identifiers for Gene Annotation.
- Author
-
Jackman SD, Bohlmann J, and Birol İ
- Subjects
- Humans, Genome, Human, Molecular Sequence Annotation methods, Sequence Analysis, DNA methods, Software
- Abstract
When working on an ongoing genome sequencing and assembly project, it is rather inconvenient when gene identifiers change from one build of the assembly to the next. The gene labelling system described here, UniqTag, addresses this common challenge. UniqTag assigns a unique identifier to each gene that is a representative k-mer, a string of length k, selected from the sequence of that gene. Unlike serial numbers, these identifiers are stable between different assemblies and annotations of the same data without requiring that previous annotations be lifted over by sequence alignment. We assign UniqTag identifiers to ten builds of the Ensembl human genome spanning eight years to demonstrate this stability. The implementation of UniqTag in Ruby and an R package are available at https://github.com/sjackman/uniqtag sjackman/uniqtag. The R package is also available from CRAN: install.packages ("uniqtag"). Supplementary material and code to reproduce it is available at https://github.com/sjackman/uniqtag-paper.
- Published
- 2015
- Full Text
- View/download PDF
193. DIDA: Distributed Indexing Dispatched Alignment.
- Author
-
Mohamadi H, Vandervalk BP, Raymond A, Jackman SD, Chu J, Breshears CP, and Birol I
- Subjects
- Humans, Computational Biology methods, Databases, Genetic, Sequence Alignment methods, Software
- Abstract
One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.
- Published
- 2015
- Full Text
- View/download PDF
194. Reduced adenosine-to-inosine miR-455-5p editing promotes melanoma growth and metastasis.
- Author
-
Shoshan E, Mobley AK, Braeuer RR, Kamiya T, Huang L, Vasquez ME, Salameh A, Lee HJ, Kim SJ, Ivan C, Velazquez-Torres G, Nip KM, Zhu K, Brooks D, Jones SJ, Birol I, Mosqueda M, Wen YY, Eterovic AK, Sood AK, Hwu P, Gershenwald JE, Robertson AG, Calin GA, Markel G, Fidler IJ, and Bar-Eli M
- Subjects
- Adenosine Deaminase genetics, Adenosine Deaminase metabolism, Animals, Base Sequence, Cell Line, Tumor, Cyclic AMP Response Element-Binding Protein genetics, Cyclic AMP Response Element-Binding Protein metabolism, Disease Progression, Female, Genes, Reporter, Humans, Luciferases genetics, Luciferases metabolism, Melanoma metabolism, Melanoma pathology, Mice, Mice, Nude, MicroRNAs, Molecular Sequence Data, Neoplasm Metastasis, Neoplasm Transplantation, RNA-Binding Proteins genetics, RNA-Binding Proteins metabolism, Skin Neoplasms metabolism, Skin Neoplasms pathology, Transcription Factors genetics, Transcription Factors metabolism, mRNA Cleavage and Polyadenylation Factors genetics, mRNA Cleavage and Polyadenylation Factors metabolism, Adenosine metabolism, Gene Expression Regulation, Neoplastic, Inosine metabolism, Melanoma genetics, RNA Editing, Skin Neoplasms genetics
- Abstract
Although recent studies have shown that adenosine-to-inosine (A-to-I) RNA editing occurs in microRNAs (miRNAs), its effects on tumour growth and metastasis are not well understood. We present evidence of CREB-mediated low expression of ADAR1 in metastatic melanoma cell lines and tumour specimens. Re-expression of ADAR1 resulted in the suppression of melanoma growth and metastasis in vivo. Consequently, we identified three miRNAs undergoing A-to-I editing in the weakly metastatic melanoma but not in strongly metastatic cell lines. One of these miRNAs, miR-455-5p, has two A-to-I RNA-editing sites. The biological function of edited miR-455-5p is different from that of the unedited form, as it recognizes a different set of genes. Indeed, wild-type miR-455-5p promotes melanoma metastasis through inhibition of the tumour suppressor gene CPEB1. Moreover, wild-type miR-455 enhances melanoma growth and metastasis in vivo, whereas the edited form inhibits these features. These results demonstrate a previously unrecognized role for RNA editing in melanoma progression.
- Published
- 2015
- Full Text
- View/download PDF
195. Spaced Seed Data Structures for De Novo Assembly.
- Author
-
Birol I, Chu J, Mohamadi H, Jackman SD, Raghavan K, Vandervalk BP, Raymond A, and Warren RL
- Abstract
De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.
- Published
- 2015
- Full Text
- View/download PDF
196. Kleat: cleavage site analysis of transcriptomes.
- Author
-
Birol I, Raymond A, Chiu R, Nip KM, Jackman SD, Kreitzman M, Docking TR, Ennis CA, Robertson AG, and Karsan A
- Subjects
- 3' Untranslated Regions, Binding Sites, Cell Line, Computational Biology, Gene Library, Humans, ROC Curve, Sequence Alignment statistics & numerical data, Sequence Analysis, RNA statistics & numerical data, Transcriptome
- Abstract
In eukaryotic cells, alternative cleavage of 3' untranslated regions (UTRs) can affect transcript stability, transport and translation. For polyadenylated (poly(A)) transcripts, cleavage sites can be characterized with short-read sequencing using specialized library construction methods. However, for large-scale cohort studies as well as for clinical sequencing applications, it is desirable to characterize such events using RNA-seq data, as the latter are already widely applied to identify other relevant information, such as mutations, alternative splicing and chimeric transcripts. Here we describe KLEAT, an analysis tool that uses de novo assembly of RNA-seq data to characterize cleavage sites on 3' UTRs. We demonstrate the performance of KLEAT on three cell line RNA-seq libraries constructed and sequenced by the ENCODE project, and assembled using Trans-ABySS. Validating the KLEAT predictions with matched ENCODE RNA-seq and RNA-PET libraries, we show that the tool has over 90% positive predictive value when there are at least three RNA-seq reads supporting a poly(A) tail and requiring at least three RNA-PET reads mapping within 100 nucleotides as validation. We also compare the performance of KLEAT with other popular RNA-seq analysis pipelines that reconstruct 3' UTR ends, and show that it performs favourably, based on an ROC-like curve.
- Published
- 2015
197. Konnector v2.0: pseudo-long reads from paired-end sequencing data.
- Author
-
Vandervalk BP, Yang C, Xue Z, Raghavan K, Chu J, Mohamadi H, Jackman SD, Chiu R, Warren RL, and Birol I
- Subjects
- Algorithms, DNA chemistry, High-Throughput Nucleotide Sequencing, Sequence Analysis, DNA methods, Software
- Abstract
Background: Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool., Results: Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences., Conclusions: Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
- Published
- 2015
- Full Text
- View/download PDF
198. BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters.
- Author
-
Chu J, Sadeghi S, Raymond A, Jackman SD, Nip KM, Mar R, Mohamadi H, Butterfield YS, Robertson AG, and Birol I
- Subjects
- Algorithms, Animals, Humans, Mice, Sequence Analysis, DNA methods, Software
- Abstract
Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools., (© The Author 2014. Published by Oxford University Press.)
- Published
- 2014
- Full Text
- View/download PDF
199. Insights into conifer giga-genomes.
- Author
-
De La Torre AR, Birol I, Bousquet J, Ingvarsson PK, Jansson S, Jones SJ, Keeling CI, MacKay J, Nilsson O, Ritland K, Street N, Yanchuk A, Zerbe P, and Bohlmann J
- Subjects
- Forests, Genome, Plant genetics, Picea genetics, Pinus genetics, Tracheophyta genetics
- Abstract
Insights from sequenced genomes of major land plant lineages have advanced research in almost every aspect of plant biology. Until recently, however, assembled genome sequences of gymnosperms have been missing from this picture. Conifers of the pine family (Pinaceae) are a group of gymnosperms that dominate large parts of the world's forests. Despite their ecological and economic importance, conifers seemed long out of reach for complete genome sequencing, due in part to their enormous genome size (20-30 Gb) and the highly repetitive nature of their genomes. Technological advances in genome sequencing and assembly enabled the recent publication of three conifer genomes: white spruce (Picea glauca), Norway spruce (Picea abies), and loblolly pine (Pinus taeda). These genome sequences revealed distinctive features compared with other plant genomes and may represent a window into the past of seed plant genomes. This Update highlights recent advances, remaining challenges, and opportunities in light of the publication of the first conifer and gymnosperm genomes., (© 2014 American Society of Plant Biologists. All Rights Reserved.)
- Published
- 2014
- Full Text
- View/download PDF
200. JAGuaR: junction alignments to genome for RNA-seq reads.
- Author
-
Butterfield YS, Kreitzman M, Thiessen N, Corbett RD, Li Y, Pang J, Ma YP, Jones SJ, and Birol İ
- Subjects
- Algorithms, Animals, Base Sequence, Exons, Gene Expression Profiling methods, High-Throughput Nucleotide Sequencing, Humans, RNA Splicing genetics, RNA genetics, Sequence Alignment methods, Sequence Analysis, RNA methods, Software
- Abstract
JAGuaR is an alignment protocol for RNA-seq reads that uses an extended reference to increase alignment sensitivity. It uses BWA to align reads to the genome and reference transcript models (including annotated exon-exon junctions) specifically allowing for the possibility of a single read spanning multiple exons. Reads aligned to the transcript models are then re-mapped on to genomic coordinates, transforming alignments that span multiple exons into large-gapped alignments on the genome. While JAGuaR does not detect novel junctions, we demonstrate how JAGuaR generates fast and accurate transcriptome alignments, which allows for both sensitive and specific SNV calling.
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.