26 results on '"Alkan C"'
Search Results
2. New Insights into Centromere Organization and Evolution from the White-Cheeked Gibbon and Marmoset
- Author
-
G. Della Valle, Claudia Rita Catacchio, Can Alkan, Francesca Antonacci, Maria Francesca Cardone, Mario Ventura, Maika Malig, Mariano Rocchi, Giuliana Giannuzzi, Angelo Cellamare, Evan E. Eichler, Cellamare A, Catacchio CR, Alkan C, Giannuzzi G, Antonacci F, Cardone MF, Della Valle G, Malig M, Rocchi M, Eichler EE, and Ventura M.
- Subjects
Primates ,Centromere ,Sequence assembly ,ALPHA-SATELLITE DNA ,Genome ,GIBBON AND MARMOSET GENOME ,Cell Line ,03 medical and health sciences ,chemistry.chemical_compound ,Phylogenetics ,Hylobates ,biology.animal ,Genetics ,Animals ,Humans ,Molecular Biology ,CENTROMERE EVOLUTION ,Ecology, Evolution, Behavior and Systematics ,Research Articles ,CENTROMERIC SEQUENCES ,030304 developmental biology ,New World monkey ,0303 health sciences ,biology ,030302 biochemistry & molecular biology ,Marmoset ,Callithrix ,biology.organism_classification ,Biological Evolution ,body regions ,chemistry ,DNA - Abstract
The evolutionary history of alpha-satellite DNA, the major component of primate centromeres, is hardly defined because of the difficulty in its sequence assembly and its rapid evolution when compared with most genomic sequences. By using several approaches, we have cloned, sequenced, and characterized alpha-satellite sequences from two species representing critical nodes in the primate phylogeny: the white-cheeked gibbon, a lesser ape, and marmoset, a New World monkey. Sequence analyses demonstrate that white-cheeked gibbon and marmoset alpha-satellite sequences are formed by units of approximately 171 and approximately 342 bp, respectively, and they both lack the high-order structure found in humans and great apes. Fluorescent in situ hybridization characterization shows a broad dispersal of alpha-satellite in the white-cheeked gibbon genome including centromeric, telomeric, and chromosomal interstitial localizations. On the other hand, centromeres in marmoset appear organized in highly divergent dimers roughly of 342 bp that show a similarity between monomers much lower than previously reported dimers, thus representing an ancient dimeric structure. All these data shed light on the evolution of the centromeric sequences in Primates. Our results suggest radical differences in the structure, organization, and evolution of alpha-satellite DNA among different primate species, supporting the notion that 1) all the centromeric sequence in Primates evolved by genomic amplification, unequal crossover, and sequence homogenization using a 171 bp monomer as the basic seeding unit and 2) centromeric function is linked to relatively short repeated elements, more than higher-order structure. Moreover, our data indicate that complex higher-order repeat structures are a peculiarity of the hominid lineage, showing the more complex organization in humans.
- Published
- 2009
3. Identification of protein-protein interaction bridges for multiple sclerosis.
- Author
-
Yazıcı G, Kurt Vatandaslar B, Aydin Canturk I, Aydinli FI, Arici Duz O, Karakoc E, Kerman BE, and Alkan C
- Subjects
- Humans, Leukocytes, Mononuclear, Oligodendroglia physiology, Neurons, Myelin Sheath, Multiple Sclerosis
- Abstract
Motivation: Identifying and prioritizing disease-related proteins is an important scientific problem to develop proper treatments. Network science has become an important discipline to prioritize such proteins. Multiple sclerosis, an autoimmune disease for which there is still no cure, is characterized by a damaging process called demyelination. Demyelination is the destruction of myelin, a structure facilitating fast transmission of neuron impulses, and oligodendrocytes, the cells producing myelin, by immune cells. Identifying the proteins that have special features on the network formed by the proteins of oligodendrocyte and immune cells can reveal useful information about the disease., Results: We investigated the most significant protein pairs that we define as bridges among the proteins providing the interaction between the two cells in demyelination, in the networks formed by the oligodendrocyte and each type of two immune cells (i.e. macrophage and T-cell) using network analysis techniques and integer programming. The reason, we investigated these specialized hubs was that a problem related to these proteins might impose a bigger damage in the system. We showed that 61%-100% of the proteins our model detected, depending on parameterization, have already been associated with multiple sclerosis. We further observed the mRNA expression levels of several proteins we prioritized significantly decreased in human peripheral blood mononuclear cells of multiple sclerosis patients. We therefore present a model, BriFin, which can be used for analyzing processes where interactions of two cell types play an important role., Availability and Implementation: BriFin is available at https://github.com/BilkentCompGen/brifin., (© The Author(s) 2023. Published by Oxford University Press.)
- Published
- 2023
- Full Text
- View/download PDF
4. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.
- Author
-
Firtina C, Park J, Alser M, Kim JS, Cali DS, Shahroodi T, Ghiasi NM, Singh G, Kanellopoulos K, Alkan C, and Mutlu O
- Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND , the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND., (© The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.)
- Published
- 2023
- Full Text
- View/download PDF
5. FastRemap: a tool for quickly remapping reads between genome assemblies.
- Author
-
Kim JS, Firtina C, Cavlak MB, Senol Cali D, Alkan C, and Mutlu O
- Subjects
- Sequence Analysis, DNA methods, Genomics methods, Genome, High-Throughput Nucleotide Sequencing methods, Software
- Abstract
Motivation: A genome read dataset can be quickly and efficiently remapped from one reference to another similar reference (e.g., between two reference versions or two similar species) using a variety of tools, e.g., the commonly used CrossMap tool. With the explosion of available genomic datasets and references, high-performance remapping tools will be even more important for keeping up with the computational demands of genome assembly and analysis., Results: We provide FastRemap, a fast and efficient tool for remapping reads between genome assemblies. FastRemap provides up to a 7.82× speedup (6.47×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap., Availability and Implementation: FastRemap is written in C++. Source code and user manual are freely available at: github.com/CMU-SAFARI/FastRemap. Docker image available at: https://hub.docker.com/r/alkanlab/fastremap. Also available in Bioconda at: https://anaconda.org/bioconda/fastremap-bio., (© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2022
- Full Text
- View/download PDF
6. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs.
- Author
-
Alser M, Shahroodi T, Gómez-Luna J, Alkan C, and Mutlu O
- Abstract
Motivation: We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs., Results: SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers's bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities., Availabilityand Implementation: https://github.com/CMU-SAFARI/SneakySnake., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
7. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.
- Author
-
Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, and Mutlu O
- Subjects
- High-Throughput Nucleotide Sequencing, Poland, Sequence Analysis, DNA, Technology, Algorithms, Software
- Abstract
Motivation: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively., Results: We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts., Availability and Implementation: Source code is available at https://github.com/CMU-SAFARI/Apollo., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2020
- Full Text
- View/download PDF
8. Shouji: a fast and efficient pre-alignment filter for sequence alignment.
- Author
-
Alser M, Hassan H, Kumar A, Mutlu O, and Alkan C
- Subjects
- Algorithms, Genome, Sequence Alignment, Sequence Analysis, DNA, Software Design, Software
- Abstract
Motivation: The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm., Results: Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step., Availability and Implementation: https://github.com/CMU-SAFARI/Shouji., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2019
- Full Text
- View/download PDF
9. Discovery of tandem and interspersed segmental duplications using high-throughput sequencing.
- Author
-
Soylev A, Le TM, Amini H, Alkan C, and Hormozdiari F
- Subjects
- Algorithms, Genome, Human, Genomics, Humans, Software, High-Throughput Nucleotide Sequencing, Segmental Duplications, Genomic
- Abstract
Motivation: Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants., Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions)., Availability and Implementation: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2019
- Full Text
- View/download PDF
10. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions.
- Author
-
Senol Cali D, Kim JS, Ghose S, Alkan C, and Mutlu O
- Subjects
- Animals, Chromosome Mapping, Computational Biology, Escherichia coli genetics, Genome, Bacterial, Genomics statistics & numerical data, Genomics trends, Humans, Nanopore Sequencing statistics & numerical data, Nanopore Sequencing trends, Sequence Analysis, DNA, Software, Genomics methods, Nanopore Sequencing methods
- Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology., (© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2019
- Full Text
- View/download PDF
11. Hercules: a profile HMM-based hybrid error correction algorithm for long reads.
- Author
-
Firtina C, Bar-Joseph Z, Alkan C, and Cicek AE
- Subjects
- Humans, Reproducibility of Results, Algorithms, Computational Biology methods, High-Throughput Nucleotide Sequencing methods, Machine Learning, Software
- Abstract
Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.
- Published
- 2018
- Full Text
- View/download PDF
12. Fast characterization of segmental duplications in genome assemblies.
- Author
-
Numanagic I, Gökkaya AS, Zhang L, Berger B, Alkan C, and Hach F
- Subjects
- Algorithms, Genomics, Humans, Genome, Segmental Duplications, Genomic
- Abstract
Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner., Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome., Availability and Implementation: SEDEF is available at https://github.com/vpc-ccg/sedef.
- Published
- 2018
- Full Text
- View/download PDF
13. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping.
- Author
-
Alser M, Hassan H, Xin H, Ergin O, Mutlu O, and Alkan C
- Subjects
- Algorithms, Genome, Human, Humans, Sequence Alignment methods, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms., Results: We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10., Availability and Implementation: https://github.com/BilkentCompGen/GateKeeper., Contact: mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com)
- Published
- 2017
- Full Text
- View/download PDF
14. Discovery and genotyping of novel sequence insertions in many sequenced individuals.
- Author
-
Kavak P, Lin YY, Numanagic I, Asghari H, Güngör T, Alkan C, and Hach F
- Subjects
- Algorithms, Genomics methods, High-Throughput Nucleotide Sequencing methods, Humans, Genome, Human, Genomic Structural Variation, Genotyping Techniques methods, INDEL Mutation, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects., Result: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects., Availability and Implementation: Pamir is available at https://github.com/vpc-ccg/pamir ., Contact: fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com)
- Published
- 2017
- Full Text
- View/download PDF
15. On genomic repeats and reproducibility.
- Author
-
Firtina C and Alkan C
- Subjects
- Genome, Reproducibility of Results, Sequence Analysis, DNA, Genomics, High-Throughput Nucleotide Sequencing
- Abstract
Results: Here, we present a comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data. We reanalyzed the same datasets twice, using the same tools with the same parameters, where we only altered the order of reads in the input (i.e. FASTQ file). Reshuffling caused the reads from repetitive regions being mapped to different locations in the second alignment, and we observed similar results when we only applied a scatter/gather approach for read mapping-without prior shuffling. Our results show that, some of the most common variation discovery algorithms do not handle the ambiguous read mappings accurately when random locations are selected. In addition, we also observed that even when the exact same alignment is used, the GATK HaplotypeCaller generates slightly different call sets, which we pinpoint to the variant filtration step. We conclude that, algorithms at each step of genomic variation discovery and characterization need to treat ambiguous mappings in a deterministic fashion to ensure full replication of results., Availability and Implementation: Code, scripts and the generated VCF files are available at DOI:10.5281/zenodo.32611., Contact: calkan@cs.bilkent.edu.tr, Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
16. Optimal seed solver: optimizing seed selection in read mapping.
- Author
-
Xin H, Nahar S, Zhu R, Emmons J, Pekhimenko G, Kingsford C, Alkan C, and Mutlu O
- Subjects
- Algorithms
- Abstract
Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds., Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations., Availability and Implementation: We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver, Contact: hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu, Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
17. Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.
- Author
-
Xin H, Greth J, Emmons J, Pekhimenko G, Kingsford C, Alkan C, and Mutlu O
- Subjects
- Algorithms, Base Sequence, Humans, Molecular Sequence Data, Sequence Homology, Nucleic Acid, Computational Biology methods, Sequence Alignment methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Calculating the edit-distance (i.e. minimum number of insertions, deletions and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences. In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such error-abundant sequence pairs needlessly waste resources and severely hinder the performance of read mappers. Therefore, it is crucial to develop a fast and accurate filter that can rapidly and efficiently detect error-abundant string pairs and remove them from consideration before more computationally expensive methods are used., Results: We present a simple and efficient algorithm, Shifted Hamming Distance (SHD), which accelerates the alignment verification procedure in read mapping, by quickly filtering out error-abundant sequence pairs using bit-parallel and SIMD-parallel operations. SHD only filters string pairs that contain more errors than a user-defined threshold, making it fully comprehensive. It also maintains high accuracy with moderate error threshold (up to 5% of the string length) while achieving a 3-fold speedup over the best previous algorithm (Gene Myers's bit-vector algorithm). SHD is compatible with all mappers that perform sequence alignment for verification., (© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
18. Annotated features of domestic cat - Felis catus genome.
- Author
-
Tamazian G, Simonov S, Dobrynin P, Makunin A, Logachev A, Komissarov A, Shevchenko A, Brukhin V, Cherkasov N, Svitin A, Koepfli KP, Pontius J, Driscoll CA, Blackistone K, Barr C, Goldman D, Antunes A, Quilez J, Lorente-Galdos B, Alkan C, Marques-Bonet T, Menotti-Raymond M, David VA, Narfström K, and O'Brien SJ
- Abstract
Background: Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotation a highly informative resource that facilitates multifaceted research endeavors., Findings: Here we report a preliminary annotation of the whole genome sequence of Cinnamon, a domestic cat living in Columbia (MO, USA), bisulfite sequencing of Boris, a male cat from St. Petersburg (Russia), and light 30× sequencing of Sylvester, a European wildcat progenitor of cat domestication. The annotation includes 21,865 protein-coding genes identified by a comparative approach, 217 loci of endogenous retrovirus-like elements, repetitive elements which comprise about 55.7% of the whole genome, 99,494 new SNVs, 8,355 new indels, 743,326 evolutionary constrained elements, and 3,182 microRNA homologues. The methylation sites study shows that 10.5% of cat genome cytosines are methylated. An assisted assembly of a European wildcat, Felis silvestris silvestris, was performed; variants between F. silvestris and F. catus genomes were derived and compared to F. catus., Conclusions: The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing. The assembly and its annotation offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.
- Published
- 2014
- Full Text
- View/download PDF
19. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications.
- Author
-
Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, and Sahinalp SC
- Subjects
- Genome, Human, Humans, Internet, Sequence Alignment, High-Throughput Nucleotide Sequencing methods, Polymorphism, Single Nucleotide, Software
- Abstract
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net., (© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2014
- Full Text
- View/download PDF
20. SCALCE: boosting sequence compression algorithms using locally consistent encoding.
- Author
-
Hach F, Numanagic I, Alkan C, and Sahinalp SC
- Subjects
- Computational Biology methods, Genome, Genomics methods, High-Throughput Nucleotide Sequencing, Humans, Pseudomonas aeruginosa genetics, Sequence Alignment, Algorithms, Data Compression methods, Software
- Abstract
Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome., Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19-when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time., Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at http://scalce.sourceforge.net., Contact: fhach@cs.sfu.ca or cenk@cs.sfu.ca, Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2012
- Full Text
- View/download PDF
21. Sensitive and fast mapping of di-base encoded reads.
- Author
-
Hormozdiari F, Hach F, Sahinalp SC, Eichler EE, and Alkan C
- Subjects
- Algorithms, Base Sequence, Chromosome Mapping, Chromosomes, Human, Pair 1, Genetic Variation, Genome, Humans, Nucleotides, Polymorphism, Single Nucleotide, Software, High-Throughput Nucleotide Sequencing methods
- Abstract
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed., Results: We present drFAST, a read mapper designed for di-base encoded 'color-space' sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie., Availability: The source code for drFAST is available at http://drfast.sourceforge.net
- Published
- 2011
- Full Text
- View/download PDF
22. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery.
- Author
-
Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, and Sahinalp SC
- Subjects
- Genome, Algorithms, DNA Transposable Elements, Genomics methods
- Abstract
Unlabelled: Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present 'conflict resolution' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507)., Availability: The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm., Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2010
- Full Text
- View/download PDF
23. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.
- Author
-
Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I, Eichler EE, and Sahinalp SC
- Subjects
- Databases, Genetic, Genetic Variation, Genome, Human, Humans, Models, Genetic, Mutagenesis, Insertional, Polymorphism, Single Nucleotide, Genomics methods, Sequence Analysis, DNA methods
- Abstract
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the 'detectable' sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched., Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data., Availability: The implementation of the NovelSeq pipeline is available at http://compbio.cs.sfu.ca/strvar.htm, Contact: eee@gs.washington.edu; cenk@cs.sfu.ca
- Published
- 2010
- Full Text
- View/download PDF
24. New insights into centromere organization and evolution from the white-cheeked gibbon and marmoset.
- Author
-
Cellamare A, Catacchio CR, Alkan C, Giannuzzi G, Antonacci F, Cardone MF, Della Valle G, Malig M, Rocchi M, Eichler EE, and Ventura M
- Subjects
- Animals, Cell Line, Humans, Primates genetics, Biological Evolution, Callithrix genetics, Centromere genetics, Hylobates genetics
- Abstract
The evolutionary history of alpha-satellite DNA, the major component of primate centromeres, is hardly defined because of the difficulty in its sequence assembly and its rapid evolution when compared with most genomic sequences. By using several approaches, we have cloned, sequenced, and characterized alpha-satellite sequences from two species representing critical nodes in the primate phylogeny: the white-cheeked gibbon, a lesser ape, and marmoset, a New World monkey. Sequence analyses demonstrate that white-cheeked gibbon and marmoset alpha-satellite sequences are formed by units of approximately 171 and approximately 342 bp, respectively, and they both lack the high-order structure found in humans and great apes. Fluorescent in situ hybridization characterization shows a broad dispersal of alpha-satellite in the white-cheeked gibbon genome including centromeric, telomeric, and chromosomal interstitial localizations. On the other hand, centromeres in marmoset appear organized in highly divergent dimers roughly of 342 bp that show a similarity between monomers much lower than previously reported dimers, thus representing an ancient dimeric structure. All these data shed light on the evolution of the centromeric sequences in Primates. Our results suggest radical differences in the structure, organization, and evolution of alpha-satellite DNA among different primate species, supporting the notion that 1) all the centromeric sequence in Primates evolved by genomic amplification, unequal crossover, and sequence homogenization using a 171 bp monomer as the basic seeding unit and 2) centromeric function is linked to relatively short repeated elements, more than higher-order structure. Moreover, our data indicate that complex higher-order repeat structures are a peculiarity of the hominid lineage, showing the more complex organization in humans.
- Published
- 2009
- Full Text
- View/download PDF
25. taveRNA: a web suite for RNA algorithms and applications.
- Author
-
Aksay C, Salari R, Karakoc E, Alkan C, and Sahinalp SC
- Subjects
- Algorithms, Base Sequence, Computer Simulation, Humans, Models, Statistical, Molecular Sequence Data, Programming Languages, Computational Biology methods, Internet, Models, Chemical, Nucleic Acid Conformation, RNA chemistry, Sequence Alignment methods, Sequence Analysis, RNA methods, Software
- Abstract
We present taveRNA, a web server package that hosts three RNA web services: alteRNA, inteRNA and pRuNA. alteRNA is a new alternative for RNA secondary structure prediction. It is based on a dynamic programming solution that minimizes the sum of energy density and free energy of an RNA structure. inteRNA is the first RNA-RNA interaction structure prediction web service. It also employs a dynamic programming algorithm to minimize the free energy of the resulting joint structure of the two interacting RNAs. Lastly, pRuNA is an efficient database pruning service; which given a query RNA, eliminates a significant portion of an ncRNA database and returns only a few ncRNAs as potential regulators. taveRNA is available at http://compbio.cs.sfu.ca/taverna.
- Published
- 2007
- Full Text
- View/download PDF
26. Manipulating multiple sequence alignments via MaM and WebMaM.
- Author
-
Alkan C, Tüzün E, Buard J, Lethiec F, Eichler EE, Bailey JA, and Sahinalp SC
- Subjects
- Exons, Internet, Phylogeny, Repetitive Sequences, Nucleic Acid, Genomics methods, Sequence Alignment methods, Software
- Abstract
MaM is a software tool that processes and manipulates multiple alignments of genomic sequence. MaM computes the exact location of common repeat elements, exons and unique regions within aligned genomics sequences using a variety of user identified programs, databases and/or tables. The program can extract subalignments, corresponding to these various regions of DNA to be analyzed independently or in conjunction with other elements of genomic DNA. Graphical displays further allow an assessment of sequence variation throughout these different regions of the aligned sequence, providing separate displays for their repeat, non-repeat and coding portions of genomic DNA. The program should facilitate the phylogenetic analysis and processing of different portions of genomic sequence as part of large-scale sequencing efforts. MaM source code is freely available for non-commercial use at http://compbio.cs.sfu.ca/MAM.htm; and the web interface WebMaM is hosted at http://atgc.lirmm.fr/mam.
- Published
- 2005
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.