6 results on '"suffix-link tree"'
Search Results
2. Linear-time String Indexing and Analysis in Small Space.
- Author
-
BELAZZOUGUI, DJAMAL, CUNIAL, FABIO, KÄRKKÄINEN, JUHA, and MÄKINEN, VELI
- Subjects
TIME management ,GENETIC techniques ,SPACETIME ,DATA structures ,NUCLEOTIDE sequence - Abstract
The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many largescale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T â {1, . . ., σ }n can be built in deterministic O(n) time using just O(n log σ) bits of space, where σ ⤠n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T. Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration and can thus be solved in deterministic O(n) time and in O(n log σ) bits of space fromthe input string by tailoring the enumeration algorithm to some problem-specific computations. We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log σ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log σ) bits of space, took O(n log log σ) time for the first two structures and O(n logϵ n) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if onewas willing to spendO(n log σ log logσ n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
3. A Framework for Space-Efficient String Kernels.
- Author
-
Belazzougui, Djamal and Cunial, Fabio
- Subjects
- *
KERNEL functions , *DATA structures , *MARKOV processes , *ENTROPY (Information theory) , *PROBABILITY theory - Abstract
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O( nd) time and in o( n) bits of space in addition to the input, using just a $$\mathtt {rangeDistinct}$$ data structure on the Burrows-Wheeler transform of the input strings that takes O( d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O( n) using a suitable implementation of the $$\mathtt {rangeDistinct}$$ data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O( n) time and in $$O(n\log {\sigma })$$ bits of space in addition to the input, where $$\sigma $$ is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just $$3n\log {\sigma }+o(n\log {\sigma })$$ bits of space, and that can be learnt in randomized O( n) time using $$O(n\log {\sigma })$$ bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O( m) time and in $$2m+o(m)$$ bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
4. A framework for space-efficient read clustering in metagenomic samples.
- Author
-
Alanko, Jarno, Cunial, Fabio, Belazzougui, Djamal, and Mäkinen, Veli
- Subjects
- *
METAGENOMICS , *DNA analysis , *CLUSTER analysis (Statistics) , *TAXONOMY , *ALGORITHMS - Abstract
Background: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length l each, on an alphabet of total size σ, our algorithms take O(n(t + log σ)) time and just 2n + o(n) + O(max{lσ log n, K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the ar [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
5. Linear-time string indexing and analysis in small space
- Author
-
Juha Kärkkäinen, Fabio Cunial, Djamal Belazzougui, Veli Mäkinen, Helsinki Institute for Information Technology, Department of Computer Science, Algorithmic Bioinformatics, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Bioinformatics, and Genome-scale Algorithmics research group / Veli Mäkinen
- Subjects
FOS: Computer and information sciences ,bidirectional BWT index ,suffix array ,compressed suffix tree ,02 engineering and technology ,01 natural sciences ,maximal repeat ,law.invention ,Mathematics (miscellaneous) ,maximal exact match ,law ,Data Structures and Algorithms (cs.DS) ,matching statistics ,SUFFIX ARRAYS ,Mathematics ,SETS ,CONSTRUCTION ,ALGORITHMS ,String (computer science) ,Suffix array ,monotone minimal perfect hash function ,string kernel ,010201 computation theory & mathematics ,TREES ,suffix-link tree ,compressed indexes ,compressed suffix array ,STORAGE ,Compressed suffix array ,Burrows–Wheeler transform ,partial rank query ,RETRIEVAL ,Suffix tree ,0206 medical engineering ,Concatenation ,suffix tree ,0102 computer and information sciences ,Data_CODINGANDINFORMATIONTHEORY ,Burrows-Wheeler transform ,Succinct data structure ,Compact data structures ,Computer Science - Data Structures and Algorithms ,Computer Science::Data Structures and Algorithms ,Time complexity ,minimal absent word ,Discrete mathematics ,113 Computer and information sciences ,maximal unique match ,SUCCINCT REPRESENTATIONS ,020602 bioinformatics - Abstract
The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,\sigma\}^n$ can be built in deterministic $O(n)$ time using just $O(n\log{\sigma})$ bits of space, where $\sigma \leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\log{\sigma})$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\log{\sigma})$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\log{\sigma})$ bits of space, took $O(n\log{\log{\sigma}})$ time for the first two structures, and $O(n\log^{\epsilon}n)$ time for the third, where $\epsilon$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output., Comment: Journal submission (52 pages, 2 figures)
- Published
- 2016
6. A framework for space-efficient read clustering in metagenomic samples
- Author
-
Veli Mäkinen, Fabio Cunial, Jarno Alanko, Djamal Belazzougui, Department of Computer Science, Genome-scale Algorithmics research group / Veli Mäkinen, Bioinformatics, and Algorithmic Bioinformatics
- Subjects
0301 basic medicine ,Theoretical computer science ,Read clustering ,BINNING ALGORITHM ,Computer science ,0206 medical engineering ,Submaximal repeat ,02 engineering and technology ,DNA Fragmentation ,Text indexing ,Burrows-Wheeler transform ,Measure (mathematics) ,Biochemistry ,Set (abstract data type) ,03 medical and health sciences ,Tree (descriptive set theory) ,Redundancy (information theory) ,Structural Biology ,Cluster Analysis ,Cluster analysis ,Suffix-link tree ,Molecular Biology ,SEQUENCES ,Applied Mathematics ,Research ,Computational Biology ,Sequence Analysis, DNA ,Models, Theoretical ,Computer Science Applications ,Tree traversal ,030104 developmental biology ,Right-maximal k-mer ,1182 Biochemistry, cell and molecular biology ,Metagenomics ,Algorithm ,020602 bioinformatics ,Algorithms - Abstract
Background: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length l each, on an alphabet of total size sigma, our algorithms take O(n(t + log sigma)) time and just 2n + o(n) + O(max{l sigma log n, K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.