34 results on '"Marçais G"'
Search Results
2. Honaïn
- Author
-
Marçais, G.
- Subjects
Région ,Oran - Abstract
Sur la côte d’Oranie, à 52 kilomètres de la frontière marocaine, se creuse une baie à laquelle des hauteurs escarpées donnent des allures de fjord. Une plage en occupe le fond et, derrière la plage, l’enceinte d’une ville se dessine, escaladant les pentes de la montagne. Cette ville, ou plutôt cette coquille de ville vidée de son contenu, c’est Honaïn, dont le nom figure à maintes reprises chez les chroniqueurs et les géographes du Moyen Âge. Nul centre antique ne paraît l’avoir précédée, à m...
- Published
- 2012
3. Honaïn
- Author
-
Marçais, G., primary
- Published
- 2000
- Full Text
- View/download PDF
4. Parsimonious reconstruction of network evolution
- Author
-
Patro Rob, Sefer Emre, Malin Justin, Marçais Guillaume, Navlakha Saket, and Kingsford Carl
- Subjects
Network evolution ,Arsimony ,Ancestral network reconstruction ,Interaction networks ,Regulatory networks ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Understanding the evolution of biological networks can provide insight into how their modular structure arises and how they are affected by environmental changes. One approach to studying the evolution of these networks is to reconstruct plausible common ancestors of present-day networks, allowing us to analyze how the topological properties change over time and to posit mechanisms that drive the networks’ evolution. Further, putative ancestral networks can be used to help solve other difficult problems in computational biology, such as network alignment. Results We introduce a combinatorial framework for encoding network histories, and we give a fast procedure that, given a set of gene duplication histories, in practice finds network histories with close to the minimum number of interaction gain or loss events to explain the observed present-day networks. In contrast to previous studies, our method does not require knowing the relative ordering of unrelated duplication events. Results on simulated histories and real biological networks both suggest that common ancestral networks can be accurately reconstructed using this parsimony approach. A software package implementing our method is available under the Apache 2.0 license at http://cbcb.umd.edu/kingsford-group/parana. Conclusions Our parsimony-based approach to ancestral network reconstruction is both efficient and accurate. We show that considering a larger set of potential ancestral interactions by not assuming a relative ordering of unrelated duplication events can lead to improved ancestral network inference.
- Published
- 2012
- Full Text
- View/download PDF
5. k-nonical space: sketching with reverse complements.
- Author
-
Marçais G, Elder CS, and Kingsford C
- Subjects
- Computational Biology methods, Genomics methods, Sequence Analysis, DNA methods, DNA chemistry, DNA genetics, Humans, Algorithms
- Abstract
Motivation: Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space)., Results: The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space., Availability and Implementation: The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
6. Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.
- Author
-
Marçais G, DeBlasio D, and Kingsford C
- Subjects
- Sequence Alignment methods, Humans, Sequence Analysis, DNA methods, Algorithms, Software, Computational Biology methods
- Abstract
Most sequence sketching methods work by selecting specific k -mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k -mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k -mers. Any long enough sequence, by definition, must contain a k -mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k -mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
- Published
- 2024
- Full Text
- View/download PDF
7. Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.
- Author
-
Hoang M, Marçais G, and Kingsford C
- Subjects
- Humans, Genome, Human, Algorithms, Genomics methods
- Abstract
Minimizers and syncmers are sketching methods that sample representative k -mer seeds from a long string. The minimizer scheme guarantees a well-spread k -mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
- Published
- 2024
- Full Text
- View/download PDF
8. Creating and Using Minimizer Sketches in Computational Genomics.
- Author
-
Zheng H, Marçais G, and Kingsford C
- Subjects
- Sequence Analysis, DNA methods, High-Throughput Nucleotide Sequencing methods, Software, Algorithms, Genomics methods
- Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
- Published
- 2023
- Full Text
- View/download PDF
9. Sketching methods with small window guarantee using minimum decycling sets.
- Author
-
Marçais G, DeBlasio D, and Kingsford C
- Abstract
Most sequence sketching methods work by selecting specific k -mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k -mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set , an unavoidable sets of k -mers. Any long enough sequence, by definition, must contain a k -mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the k -mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope., Competing Interests: Conflict of Interest: C.K. is a co-founder of Ocean Genomics, Inc; G.M. is VP of software engineering at Ocean Genomics, Inc.
- Published
- 2023
10. Large scale sequence alignment via efficient inference in generative models.
- Author
-
Mongia M, Shen C, Davoodi AG, Marçais G, and Mohimani H
- Subjects
- Sequence Alignment, Computational Biology methods, Probability, Sequence Analysis, DNA methods, Software, High-Throughput Nucleotide Sequencing, Algorithms, Genome
- Abstract
Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences., (© 2023. The Author(s).)
- Published
- 2023
- Full Text
- View/download PDF
11. Sequence-specific minimizers via polar sets.
- Author
-
Zheng H, Kingsford C, and Marçais G
- Subjects
- Genome, Human, Genomics, Humans, Sequence Analysis, DNA, Algorithms, Software
- Abstract
Motivation: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences., Results: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers., Availability and Implementation: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
12. HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.
- Author
-
Frisby TS, Baker SJ, Marçais G, Hoang QM, Kingsford C, and Langmead CJ
- Subjects
- Genome, Genomics, Humans, Breast Neoplasms genetics, Deep Learning, Whole Genome Sequencing
- Abstract
Background: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building., Results: We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier., Conclusion: HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.
- Published
- 2021
- Full Text
- View/download PDF
13. Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.
- Author
-
Zheng H, Kingsford C, and Marçais G
- Subjects
- Algorithms, Computational Biology trends, Humans, Genome, Human genetics, High-Throughput Nucleotide Sequencing, Sequence Analysis, DNA, Software
- Abstract
Universal hitting sets (UHS) are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between UHS and minimizer schemes, where minimizer schemes with low density (i.e., efficient schemes) correspond to UHS of small size. Local schemes are a generalization of minimizer schemes that can be used as replacement for minimizer scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a UHS. We give bounds for the remaining path length of the Mykkeltveit UHS. In addition, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.
- Published
- 2021
- Full Text
- View/download PDF
14. Improved design and analysis of practical minimizers.
- Author
-
Zheng H, Kingsford C, and Marçais G
- Subjects
- Sequence Analysis, DNA, Algorithms, Software
- Abstract
Motivation: Minimizers are methods to sample k-mers from a string, with the guarantee that similar set of k-mers will be chosen on similar strings. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage. Despite the method's popularity, many theoretical questions regarding its performance remain open. The core metric for measuring performance of a minimizer is the density, which measures the sparsity of sampled k-mers. The theoretical optimal density for a minimizer is 1/w, provably not achievable in general. For given k and w, little is known about asymptotically optimal minimizers, that is minimizers with density O(1/w)., Results: We derive a necessary and sufficient condition for existence of asymptotically optimal minimizers. We also provide a randomized algorithm, called the Miniception, to design minimizers with the best theoretical guarantee to date on density in practical scenarios. Constructing and using the Miniception is as easy as constructing and using a random minimizer, which allows the design of efficient minimizers that scale to the values of k and w used in current bioinformatics software programs., Availability and Implementation: Reference implementation of the Miniception and the codes for analysis can be found at https://github.com/kingsford-group/miniception., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press.)
- Published
- 2020
- Full Text
- View/download PDF
15. Locality-sensitive hashing for the edit distance.
- Author
-
Marçais G, DeBlasio D, Pandey P, and Kingsford C
- Subjects
- Software, Algorithms, Sequence Alignment
- Abstract
Motivation: Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy., Results: We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH., Availability and Implementation: The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2019
- Full Text
- View/download PDF
16. Asymptotically optimal minimizers schemes.
- Author
-
Marçais G, DeBlasio D, and Kingsford C
- Subjects
- Algorithms, Computational Biology methods
- Abstract
Motivation: The minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density and thereby making existing and future bioinformatics tools even more efficient., Results: From the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the three type of schemes.
- Published
- 2018
- Full Text
- View/download PDF
17. MUMmer4: A fast and versatile genome alignment system.
- Author
-
Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, and Zimin A
- Subjects
- Algorithms, Animals, Arabidopsis genetics, Genome, Human, Genome, Plant, Genomics, Humans, Models, Theoretical, Pan troglodytes, Polymorphism, Single Nucleotide, Programming Languages, Sequence Analysis, DNA, Sequence Analysis, Protein, Computational Biology methods, Sequence Alignment methods, Software
- Abstract
The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.
- Published
- 2018
- Full Text
- View/download PDF
18. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing.
- Author
-
Orenstein Y, Pellow D, Marçais G, Shamir R, and Kingsford C
- Subjects
- Animals, Caenorhabditis elegans genetics, Computer Heuristics, Humans, Algorithms, Computational Biology methods, Genome, Bacterial, Genome, Human, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.
- Published
- 2017
- Full Text
- View/download PDF
19. Improving the performance of minimizers and winnowing schemes.
- Author
-
Marçais G, Pellow D, Bork D, Orenstein Y, Shamir R, and Kingsford C
- Subjects
- Algorithms, Humans, Genome, Human, Genomics methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues., Results: We provide an in-depth analysis of the effect of k -mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al. ) on the expected density of minimizers in a random sequence., Availability and Implementation: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git ., Contact: gmarcais@cs.cmu.edu or carlk@cs.cmu.edu., (© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com)
- Published
- 2017
- Full Text
- View/download PDF
20. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.
- Author
-
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, and Salzberg SL
- Subjects
- Contig Mapping standards, Genome Size, Genomics standards, Sequence Analysis, DNA standards, Contig Mapping methods, Genome, Plant, Genomics methods, Poaceae genetics, Repetitive Sequences, Nucleic Acid, Sequence Analysis, DNA methods, Software
- Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads , which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii , a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy., (© 2017 Zimin et al.; Published by Cold Spring Harbor Laboratory Press.)
- Published
- 2017
- Full Text
- View/download PDF
21. Sequence of the Sugar Pine Megagenome.
- Author
-
Stevens KA, Wegrzyn JL, Zimin A, Puiu D, Crepeau M, Cardeno C, Paul R, Gonzalez-Ibeas D, Koriabine M, Holtz-Morris AE, Martínez-García PJ, Sezen UU, Marçais G, Jermstad K, McGuire PE, Loopstra CA, Davis JM, Eckert A, de Jong P, Yorke JA, Salzberg SL, Neale DB, and Langley CH
- Subjects
- Basidiomycota pathogenicity, DNA Transposable Elements, Genetic Variation, Genome Size, Pinus immunology, Pinus microbiology, Plant Immunity genetics, Genome, Plant, Pinus genetics
- Abstract
Until very recently, complete characterization of the megagenomes of conifers has remained elusive. The diploid genome of sugar pine (Pinus lambertiana Dougl.) has a highly repetitive, 31 billion bp genome. It is the largest genome sequenced and assembled to date, and the first from the subgenus Strobus, or white pines, a group that is notable for having the largest genomes among the pines. The genome represents a unique opportunity to investigate genome "obesity" in conifers and white pines. Comparative analysis of P. lambertiana and P. taeda L. reveals new insights on the conservation, age, and diversity of the highly abundant transposable elements, the primary factor determining genome size. Like most North American white pines, the principal pathogen of P. lambertiana is white pine blister rust (Cronartium ribicola J.C. Fischer ex Raben.). Identification of candidate genes for resistance to this pathogen is of great ecological importance. The genome sequence afforded us the opportunity to make substantial progress on locating the major dominant gene for simple resistance hypersensitive response, Cr1 We describe new markers and gene annotation that are both tightly linked to Cr1 in a mapping population, and associated with Cr1 in unrelated sugar pine individuals sampled throughout the species' range, creating a solid foundation for future mapping. This genomic variation and annotated candidate genes characterized in our study of the Cr1 region are resources for future marker-assisted breeding efforts as well as for investigations of fundamental mechanisms of invasive disease and evolutionary response., (Copyright © 2016 by the Genetics Society of America.)
- Published
- 2016
- Full Text
- View/download PDF
22. QuorUM: An Error Corrector for Illumina Reads.
- Author
-
Marçais G, Yorke JA, and Zimin A
- Subjects
- Algorithms, Animals, Genome, Humans, Computational Biology methods, Genomics methods, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term "error correction" to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available., Results: We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated., Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu., Contact: gmarcais@umd.edu.
- Published
- 2015
- Full Text
- View/download PDF
23. A new rhesus macaque assembly and annotation for next-generation sequencing analyses.
- Author
-
Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, Meehan DT, Wipfler K, Bosinger SE, Johnson ZP, Tharp GK, Marçais G, Roberts M, Ferguson B, Fox HS, Treangen T, Salzberg SL, Yorke JA, and Norgren RB Jr
- Subjects
- Amino Acid Sequence, Animals, Gene Expression Profiling, High-Throughput Nucleotide Sequencing, Molecular Sequence Annotation, Molecular Sequence Data, RNA, Messenger metabolism, Sequence Alignment, Genome, Macaca mulatta genetics
- Abstract
Background: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses., Results: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies., Conclusions: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates., Reviewers: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.
- Published
- 2014
- Full Text
- View/download PDF
24. Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies.
- Author
-
Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, Cardeno C, Koriabine M, Holtz-Morris AE, Liechty JD, Martínez-García PJ, Vasquez-Gross HA, Lin BY, Zieve JJ, Dougherty WM, Fuentes-Soriano S, Wu LS, Gilbert D, Marçais G, Roberts M, Holt C, Yandell M, Davis JM, Smith KE, Dean JF, Lorenz WW, Whetten RW, Sederoff R, Wheeler N, McGuire PE, Main D, Loopstra CA, Mockaitis K, deJong PJ, Yorke JA, Salzberg SL, and Langley CH
- Subjects
- DNA, Plant genetics, Haploidy, Contig Mapping methods, Genome, Plant, Pinus taeda genetics, Sequence Analysis, DNA methods
- Abstract
Background: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination., Results: We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome., Conclusions: In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.
- Published
- 2014
- Full Text
- View/download PDF
25. Sequencing and assembly of the 22-gb loblolly pine genome.
- Author
-
Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, Neale DB, Salzberg SL, Yorke JA, and Langley CH
- Subjects
- Genomics, Haploidy, Sequence Analysis, DNA, Transcriptome, Genome, Plant, Ovule genetics, Pinus taeda genetics
- Abstract
Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.
- Published
- 2014
- Full Text
- View/download PDF
26. The MaSuRCA genome assembler.
- Author
-
Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, and Yorke JA
- Subjects
- Algorithms, Animals, Genome, Bacterial, Mice, Rhodobacter sphaeroides genetics, Sequence Analysis, DNA methods, Software, Genomics methods
- Abstract
Motivation: Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka')., Results: We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads., Availability: MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year., Contact: alekseyz@ipst.umd.edu., Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2013
- Full Text
- View/download PDF
27. GAGE: A critical evaluation of genome assemblies and assembly algorithms.
- Author
-
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, and Yorke JA
- Subjects
- Animals, Computational Biology methods, Genome, Genome, Bacterial genetics, Humans, Internet, Reproducibility of Results, Algorithms, Genomics methods, Sequence Analysis, DNA
- Abstract
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
- Published
- 2012
- Full Text
- View/download PDF
28. Mis-assembled "segmental duplications" in two versions of the Bos taurus genome.
- Author
-
Zimin AV, Kelley DR, Roberts M, Marçais G, Salzberg SL, and Yorke JA
- Subjects
- Animals, Humans, Reproducibility of Results, Cattle genetics, Genome genetics, Segmental Duplications, Genomic genetics
- Abstract
We analyzed the whole genome sequence coverage in two versions of the Bos taurus genome and identified all regions longer than five kilobases (Kbp) that are duplicated within chromosomes with >99% sequence fidelity in both copies. We call these regions High Fidelity Duplications (HFDs). The two assemblies were Btau 4.2, produced by the Human Genome Sequencing Center at Baylor College of Medicine, and UMD Bos taurus 3.1 (UMD 3.1), produced by our group at the University of Maryland. We found that Btau 4.2 has a far greater number of HFDs, 3111 versus only 69 in UMD 3.1. Read coverage analysis shows that 39 million base pairs (Mbp) of sequence in HFDs in Btau 4.2 appear to be a result of a mis-assembly and therefore cannot be qualified as segmental duplications. UMD 3.1 has only 0.41 Mbp of sequence in HFDs that are due to a mis-assembly.
- Published
- 2012
- Full Text
- View/download PDF
29. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.
- Author
-
Marçais G and Kingsford C
- Subjects
- Animals, Base Sequence, Genome, Humans, Sequence Alignment, Algorithms, Computational Biology methods, Sequence Analysis, DNA methods, Software
- Abstract
Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm., Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution., Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
- Published
- 2011
- Full Text
- View/download PDF
30. A whole-genome assembly of the domestic cow, Bos taurus.
- Author
-
Zimin AV, Delcher AL, Florea L, Kelley DR, Schatz MC, Puiu D, Hanrahan F, Pertea G, Van Tassell CP, Sonstegard TS, Marçais G, Roberts M, Subramanian P, Yorke JA, and Salzberg SL
- Subjects
- Animals, Chromosome Mapping, Female, Genome, Human genetics, Genomics, Humans, Male, Sequence Analysis, DNA statistics & numerical data, Synteny, Y Chromosome genetics, Cattle genetics, Genome genetics, Sequence Analysis, DNA methods
- Abstract
Background: The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods., Results: We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions., Conclusions: By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.
- Published
- 2009
- Full Text
- View/download PDF
31. [Atypical Addison's disease associated with erythroderma].
- Author
-
Duperrat B, Puissant A, David V, Leturque P, and Marçais G
- Subjects
- Addison Disease diagnosis, Aged, Dermatitis, Exfoliative diagnosis, Humans, Male, Addison Disease complications, Dermatitis, Exfoliative complications
- Published
- 1971
32. [Meningitis caused by coxsackie viruses. Clinical and virologic study].
- Author
-
Modaï J, Pérol Y, Vildé JL, Zribi A, Marçais G, and Domart A
- Subjects
- Age Factors, Child, Coxsackievirus Infections cerebrospinal fluid, Enterovirus isolation & purification, Female, Humans, Male, Meningitis, Viral cerebrospinal fluid, Meningitis, Viral epidemiology, Meningitis, Viral etiology, Neurologic Manifestations, Serologic Tests, Skin Manifestations, Coxsackievirus Infections diagnosis, Meningitis, Viral diagnosis
- Published
- 1970
33. [Association of a primary-secondary syphilis with a Fiessinger-Leroy-Reiter syndrome].
- Author
-
Duperrat B, Puissant A, David V, Saurat JH, and Marçais G
- Subjects
- Adult, Arthritis, Reactive drug therapy, Humans, Male, Penicillins therapeutic use, Syphilis drug therapy, Arthritis, Reactive complications, Syphilis complications
- Published
- 1971
34. [Dermamyositis with cardiac localization].
- Author
-
Duperrat B, Puissant A, Jourdain JC, Marçais G, and Mallard D
- Subjects
- Alopecia etiology, Arthritis, Rheumatoid diagnosis, Dyspnea etiology, Edema etiology, Female, Humans, Middle Aged, Pain, Skin Manifestations, Cardiomyopathies, Dermatomyositis
- Published
- 1972
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.