Author: "Juha Kärkkäinen" / Topic: 02 engineering and technology - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Juha Kärkkäinen"' showing total 47 results

Start Over Author "Juha Kärkkäinen" Topic 02 engineering and technology

47 results on '"Juha Kärkkäinen"'

1. String inference from longest-common-prefix array

Author: Juha Kärkkäinen, Marcin Piątkowski, Simon J. Puglisi, Chatzigiannakis, Ioannis, Indyk, Piotr, Kuhn, Fabian, Muscholl, Anna, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Helsinki Institute for Information Technology, Department of Computer Science, Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan), Bioinformatics, Genome-scale Algorithmics research group / Veli Mäkinen, and Algorithmic Bioinformatics
Subjects: String inference, 000 Computer science, knowledge, general works, General Computer Science, LCP array, education, 0102 computer and information sciences, 02 engineering and technology, 113 Computer and information sciences, Quantitative Biology::Genomics, 01 natural sciences, Theoretical Computer Science, 010201 computation theory & mathematics, Computer Science, 0202 electrical engineering, electronic engineering, information engineering, NP-hardness, 020201 artificial intelligence & image processing, Computer Science::Data Structures and Algorithms, Computer Science::Formal Languages and Automata Theory
Abstract: The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape. In order to better understand the combinatorics of LCP arrays, we consider the problem of inferring a string from an LCP array, i.e., determining whether a given array of integers is a valid LCP array, and if it is, reconstructing some string or all strings with that LCP array. There are recent studies of inferring a string from a suffix tree shape but using significantly more information (in the form of suffix links) than is available in the LCP array. We provide two main results. (1) We describe two algorithms for inferring strings from an LCP array when we allow a generalized form of LCP array defined for a multiset of cyclic strings: a linear time algorithm for binary alphabet and a general algorithm with polynomial time complexity for a constant alphabet size. (2) We prove that determining whether a given integer array is a valid LCP array is NP-complete when we require more restricted forms of LCP array defined for a single cyclic or non-cyclic string or a multiset of non-cyclic strings. The result holds whether or not the alphabet is restricted to be binary. In combination, the two results show that the generalized form of LCP array for a multiset of cyclic strings is fundamentally different from the other more restricted forms.
Published: 2023

2. Block trees

Author: Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez, Simon J. Puglisi, Yasuo Tabei, Department of Computer Science, Algorithmic Bioinformatics, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Helsinki Institute for Information Technology, and Bioinformatics
Subjects: Computer Networks and Communications, Applied Mathematics, Compressed data structures, 0102 computer and information sciences, 02 engineering and technology, Repetitive string collections, Lempel-Ziv compression, 113 Computer and information sciences, 01 natural sciences, STRINGS, Theoretical Computer Science, DATA-COMPRESSION, Computational Theory and Mathematics, 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, APPROXIMATION
Abstract: Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.
Published: 2021

3. Tight Upper and Lower Bounds on Suffix Tree Breadth

Author: Bella Zhukova, Simon J. Puglisi, Juha Kärkkäinen, Golnaz Badkobeh, Paweł Gawrychowski, Department of Computer Science, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Helsinki Institute for Information Technology, Bioinformatics, Algorithmic Bioinformatics, and Genome-scale Algorithmics research group / Veli Mäkinen
Subjects: General Computer Science, Suffix tree, 0102 computer and information sciences, 02 engineering and technology, String processing, 01 natural sciences, Upper and lower bounds, Theoretical Computer Science, law.invention, Combinatorics, High Energy Physics::Theory, law, Suffix array, Trie, 0202 electrical engineering, electronic engineering, information engineering, Computer Science::Data Structures and Algorithms, Mathematics, String (computer science), Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), String, Data structure, 113 Computer and information sciences, 010201 computation theory & mathematics, 020201 artificial intelligence & image processing, Suffix, Longest common prefix, Computer Science::Formal Languages and Automata Theory
Abstract: The suffix tree — the compacted trie of all the suffixes of a string — is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes ν S ( d ) can there be at (string) depth d in its suffix tree? We prove ν ( n , d ) = max S ∈ Σ n ⁡ ν S ( d ) is O ( ( n / d ) log ⁡ ( n / d ) ) , and show that this bound is asymptotically tight, describing strings for which ν S ( d ) is Ω ( ( n / d ) log ⁡ ( n / d ) ) .
Published: 2021

4. Fixed Block Compression Boosting in FM-Indexes: Theory and Practice

Author: Simon J. Puglisi, Juha Kärkkäinen, Matthias Petri, Dominik Kempa, Simon Gog, Department of Computer Science, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Bioinformatics, Helsinki Institute for Information Technology, Genome-scale Algorithmics research group / Veli Mäkinen, and Algorithmic Bioinformatics
Subjects: Boosting (machine learning), General Computer Science, Computer science, Text indexing, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, law.invention, Compressed data structure, Wavelet, law, Suffix array, Wavelet tree, 0202 electrical engineering, electronic engineering, information engineering, Wavelet Tree, Pattern matching, FM-index, Compression boosting, Applied Mathematics, 020207 software engineering, 113 Computer and information sciences, Computer Science Applications, 010201 computation theory & mathematics, Theory of computation, Algorithm
Abstract: The FM index (Ferragina and Manzini in J ACM 52(4):552-581, 2005) is a widely-used compressed data structure that stores a string T in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good off-the-shelf choice for many applications.
Published: 2018

5. Better External Memory LCP Array Construction

Author: Juha Kärkkäinen and Dominik Kempa
Subjects: Speedup, Computer science, LCP array, Suffix array, Byte, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, Parallel computing, Data structure, 01 natural sciences, Bottleneck, Theoretical Computer Science, law.invention, 010201 computation theory & mathematics, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Out-of-core algorithm, Auxiliary memory
Abstract: The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck, especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm. We then further engineer the two new algorithms and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Eight threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: the input (text and suffix array) is treated as read-only, and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet.
Published: 2019

6. V-Order: New combinatorial properties & a simple comparison algorithm

Author: Jacqueline W. Daykin, M. Sohel Rahman, William F. Smyth, Juha Kärkkäinen, and Ali Alatabbi
Subjects: Applied Mathematics, Computation, 0102 computer and information sciences, 02 engineering and technology, Lexicographical order, Data structure, 01 natural sciences, Lyndon words, Combinatorics, Factorization, 010201 computation theory & mathematics, Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, Discrete Mathematics and Combinatorics, Order (group theory), 020201 artificial intelligence & image processing, Suffix, Mathematics
Abstract: V -order is a global order on strings related to Unique Maximal Factorization Families (UMFFs), themselves generalizations of Lyndon words. V -order has recently been proposed as an alternative to lexicographic order in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform. Efficient V -ordering of strings thus becomes a matter of considerable interest. In this paper we discover several new combinatorial properties of V -order, then explore the computational consequences; in particular, a fast, simple on-line V -order comparison algorithm that requires no auxiliary data structures.
Published: 2016

7. Tighter bounds for the sum of irreducible LCP values

Author: Marcin Piaźtkowski, Juha Kärkkäinen, and Dominik Kempa
Subjects: General Computer Science, Burrows–Wheeler transform, Search engine indexing, LCP array, Suffix array, Lower order, 0102 computer and information sciences, 02 engineering and technology, Lexicographical order, 01 natural sciences, Theoretical Computer Science, law.invention, Prefix, Combinatorics, 010201 computation theory & mathematics, law, Compression (functional analysis), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Mathematics
Abstract: The suffix array is frequently augmented with the longest-common-prefix (LCP) array that stores the lengths of the longest common prefixes between lexicographically adjacent suffixes of a text. While the sum of the values in the LCP array can be ź ( n 2 ) for a text of length n, the sum of so-called irreducible LCP values was shown to be O ( n lg ź n ) a few years ago. In this paper, we improve the bound to O ( n lg ź r ) , where r ź n is the number of runs in the Burrows-Wheeler transform of the text. We also show that our bound is tight up to lower order terms (unlike the previous bound). Our results and the techniques used in proving them provide new insights into the combinatorics of text indexing and compression, and have immediate applications to LCP array construction algorithms.
Published: 2016

8. Lazy Lempel-Ziv Factorization Algorithms

Author: Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi
Subjects: Theoretical computer science, Burrows–Wheeler transform, Computer science, Search engine indexing, LCP array, Suffix array, 0102 computer and information sciences, 02 engineering and technology, Data structure, 01 natural sciences, Bottleneck, Theoretical Computer Science, law.invention, Factorization, 010201 computation theory & mathematics, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Algorithm, Data compression
Abstract: For decades the Lempel-Ziv (LZ77) factorization has been a cornerstone of data compression and string processing algorithms, and uses for it are still being uncovered. For example, LZ77 is central to several recent text indexing data structures designed to search highly repetitive collections. However, in many applications computation of the factorization remains a bottleneck in practice. In this article, we describe a number of simple and fast LZ77 factorization algorithms, which consistently outperform all previous methods in practice, use less memory, and still offer strong worst-case performance guarantees. A common feature of the new algorithms is that they compute longest common prefix information in a lazy fashion, with the degree of laziness in preprocessing characterizing different algorithms.
Published: 2016

9. Linear-time string indexing and analysis in small space

Author: Juha Kärkkäinen, Fabio Cunial, Djamal Belazzougui, Veli Mäkinen, Helsinki Institute for Information Technology, Department of Computer Science, Algorithmic Bioinformatics, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Bioinformatics, and Genome-scale Algorithmics research group / Veli Mäkinen
Subjects: FOS: Computer and information sciences, bidirectional BWT index, suffix array, compressed suffix tree, 02 engineering and technology, 01 natural sciences, maximal repeat, law.invention, Mathematics (miscellaneous), maximal exact match, law, Data Structures and Algorithms (cs.DS), matching statistics, SUFFIX ARRAYS, Mathematics, SETS, CONSTRUCTION, ALGORITHMS, String (computer science), Suffix array, monotone minimal perfect hash function, string kernel, 010201 computation theory & mathematics, TREES, suffix-link tree, compressed indexes, compressed suffix array, STORAGE, Compressed suffix array, Burrows–Wheeler transform, partial rank query, RETRIEVAL, Suffix tree, 0206 medical engineering, Concatenation, suffix tree, 0102 computer and information sciences, Data_CODINGANDINFORMATIONTHEORY, Burrows-Wheeler transform, Succinct data structure, Compact data structures, Computer Science - Data Structures and Algorithms, Computer Science::Data Structures and Algorithms, Time complexity, minimal absent word, Discrete mathematics, 113 Computer and information sciences, maximal unique match, SUCCINCT REPRESENTATIONS, 020602 bioinformatics
Abstract: The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,\sigma\}^n$ can be built in deterministic $O(n)$ time using just $O(n\log{\sigma})$ bits of space, where $\sigma \leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\log{\sigma})$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\log{\sigma})$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\log{\sigma})$ bits of space, took $O(n\log{\log{\sigma}})$ time for the first two structures, and $O(n\log^{\epsilon}n)$ time for the third, where $\epsilon$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output., Comment: Journal submission (52 pages, 2 figures)
Published: 2016

10. Document Retrieval on Repetitive String Collections

Author: Kalle Karhu, Jouni Sirén, Gonzalo Navarro, Travis Gagie, Aleksi Hartikainen, Juha Kärkkäinen, Simon J. Puglisi, Department of Computer Science, Practical Algorithms and Data Structures on Strings research group / Juha Kärkkäinen, Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan), Helsinki Institute for Information Technology, Bioinformatics, Genome-scale Algorithmics research group / Veli Mäkinen, and Algorithmic Bioinformatics
Subjects: FOS: Computer and information sciences, Computer science, Information Retrieval Efficiency, 0102 computer and information sciences, 02 engineering and technology, Library and Information Sciences, 01 natural sciences, TEXT INDEXES, Computer Science - Information Retrieval, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Suffix trees and arrays, Relevance (information retrieval), SUFFIX ARRAYS, Document retrieval, INFORMATION-RETRIEVAL, Information retrieval, SEQUENCES, String (computer science), Search engine indexing, Order (ring theory), Repetitive string collections, EFFICIENT ALGORITHMS, Data structure, 113 Computer and information sciences, 010201 computation theory & mathematics, Pattern recognition (psychology), Document retrieval on strings, Natural language, Information Retrieval (cs.IR), Information Systems
Abstract: Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, {\em interleaved LCPs} and {\em precomputed document lists}, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-$k$ document retrieval (find the $k$ documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case., Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Accepted to the Information Retrieval Journal
Published: 2016
Full Text: View/download PDF

11. On Suffix Tree Breadth

Author: Bella Zhukova, Golnaz Badkobeh, Simon J. Puglisi, and Juha Kärkkäinen
Subjects: Suffix tree, String (computer science), Generalized suffix tree, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 0102 computer and information sciences, 02 engineering and technology, String processing, Binary logarithm, 01 natural sciences, Longest common substring problem, law.invention, Combinatorics, High Energy Physics::Theory, 010201 computation theory & mathematics, law, 020204 information systems, Trie, 0202 electrical engineering, electronic engineering, information engineering, Suffix, Computer Science::Data Structures and Algorithms, Computer Science::Formal Languages and Automata Theory, Mathematics
Abstract: The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes $\nu _S(d)$ can there be at (string) depth d in its suffix tree? We prove $\nu (n,d)=\max _{S\in \varSigma ^n} \nu _S(d)$ is $O((n/d)\log n)$, and show that this bound is almost tight, describing strings for which $\nu _S(d)$ is $\varOmega ((n/d)\log (n/d))$.
Published: 2017

12. Colored range queries and document retrieval

Author: Gonzalo Navarro, Juha Kärkkäinen, Simon J. Puglisi, and Travis Gagie
Subjects: Theoretical computer science, General Computer Science, Range query (data structures), Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 0102 computer and information sciences, 02 engineering and technology, Inverted index, Data structure, 01 natural sciences, Theoretical Computer Science, Combinatorics, Compressed data structure, Colored, 010201 computation theory & mathematics, 020204 information systems, Bounded function, 0202 electrical engineering, electronic engineering, information engineering, Entropy (information theory), Document retrieval
Abstract: Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper, we give improved time and space bounds for three important one-dimensional colored range queries - colored range listing, colored range top-k queries and colored range counting - and, as a consequence, new bounds for various document retrieval problems on general collections of sequences. Colored range listing is the problem of preprocessing a sequence S[1,n] of colors so that, later, given an interval [i,i+@?-1], we list the different colors in S[i,i+@?-1]. Colored range top-k queries ask instead for k most frequent colors in the interval. Colored range counting asks for the number of different colors in the interval. We first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the first compressed data structure (using nH"k(S)+o(nlog@s) bits, for any k=o(log"@sn), where H"k(S) is the k-th order empirical entropy of S and @s the number of different colors in S) that answers colored range listing queries in constant time per returned result. We also give an efficient data structure for document listing whose size is bounded in terms of the k-th order entropy of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how modified wavelet trees can support colored range counting using nH"0(S)+O(n)+o(nH"0(S)) bits, and answer queries in O(log@?) time. As far as we know, this is the first data structure in which the query time depends only on @? and not on n. We also show how our data structure can be made dynamic.
Published: 2013

13. Lempel-Ziv Decoding in External Memory

Author: Djamal Belazzougui, Juha Kärkkäinen, Simon J. Puglisi, and Dominik Kempa
Subjects: Computer science, Reading (computer), Computation, 020207 software engineering, Data_CODINGANDINFORMATIONTHEORY, 02 engineering and technology, Simple (abstract algebra), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Out-of-core algorithm, Arithmetic, Auxiliary memory, Decoding methods
Abstract: Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
Published: 2016

14. LCP Array Construction Using O(sort(n)) (or Less) I/Os

Author: Dominik Kempa and Juha Kärkkäinen
Subjects: Physics, LCP array, Suffix array, Order (ring theory), Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, String processing, 01 natural sciences, law.invention, Combinatorics, Suffix sorting, 010201 computation theory & mathematics, law, Internal memory, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, sort, 020201 artificial intelligence & image processing, Suffix, Computer Science::Data Structures and Algorithms, Computer Science::Formal Languages and Automata Theory
Abstract: The suffix array, one of the most important data structures in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. While there are external memory algorithms that construct the suffix array and the LCP array simultaneously in the optimal I/O complexity of $\mathcal {O}\!\left( {\mathrm {sort}\!\left( {n} \right) } \right) $, for several reasons it would be desirable to construct the suffix array first and then the LCP array from the suffix array in a separate stage. In this paper we describe the first algorithm that achieves $\mathcal {O}\!\left( {\mathrm {sort}\!\left( {n} \right) } \right) $ I/O complexity for the LCP array construction stage and is not an extension of a suffix sorting algorithm. As a variant, we obtain a Monte Carlo algorithm that, given a sparse suffix array containing $m < n$ suffixes in sorted order, computes the corresponding LCP array in $\mathcal {O}\!\left( {\mathrm {scan}\!\left( {n} \right) +\mathrm {sort}\!\left( {m} \right) \log (n/m)} \right) $ I/Os if the suffix positions are evenly spaced, and in $\mathcal {O}\!\left( {\mathrm {scan}\!\left( {n} \right) +\mathrm {sort}\!\left( {m} \right) \log (n)} \right) $ I/Os in general.
Published: 2016

15. Document Counting in Compressed Space

Author: Gonzalo Navarro, Travis Gagie, Jouni Sirén, Simon J. Puglisi, Juha Kärkkäinen, and Aleksi Hartikainen
Subjects: Structure (mathematical logic), Computer science, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, Space (commercial competition), Data structure, computer.software_genre, 01 natural sciences, 010201 computation theory & mathematics, Factor (programming language), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, Compressibility, 020201 artificial intelligence & image processing, Data mining, computer, computer.programming_language, Data compression
Abstract: We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. In this pa-per we implement these solutions and explore compressed variants, aiming to reduce data structure size. Our main result is to uncover some unexpected compressibility properties of the fastest known data structure for the problem. By taking advantage of these properties, we can reduce the size of the structure by a factor of 5-400, depending on the dataset.
Published: 2015

16. Suffix Array Construction

Author: Juha Kärkkäinen
Subjects: 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology
Published: 2015

17. Diverse Palindromic Factorization Is NP-complete

Author: Juha Kärkkäinen, Shiho Sugimoto, Travis Gagie, Hideo Bannai, Marcin Piątkowski, Dominik Kempa, Shunsuke Inenaga, and Simon J. Puglisi
Subjects: Discrete mathematics, Boolean circuit, String (computer science), Palindrome, 020206 networking & telecommunications, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, Combinatorics, Factorization, 010201 computation theory & mathematics, Compression (functional analysis), 0202 electrical engineering, electronic engineering, information engineering, NP-complete, Mathematics
Abstract: We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization.
Published: 2015

18. Linear work suffix array construction

Author: Juha Kärkkäinen, Stefan Burkhardt, and Peter Sanders
Subjects: Compressed suffix array, Suffix tree, Suffix array, LCP array, Generalized suffix tree, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, 01 natural sciences, law.invention, 010201 computation theory & mathematics, Artificial Intelligence, Hardware and Architecture, Control and Systems Engineering, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Suffix, Algorithm, Software, FM-index, Information Systems, Mathematics
Abstract: Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover . This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1, √n ], it runs in O( vn ) time using O( n / √v ) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Published: 2006

19. BDD-BASED ANALYSIS OF GAPPED q-GRAM FILTERS

Author: Stefan Burkhardt, Marc Fontaine, and Juha Kärkkäinen
Subjects: 0303 health sciences, Binary decision diagram, Hamming distance, 02 engineering and technology, Filter (signal processing), Approximate string matching, Data structure, Adaptive filter, 03 medical and health sciences, Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), 020201 artificial intelligence & image processing, Sensitivity (control systems), Algorithm, 030304 developmental biology, Mathematics
Abstract: Recently, there has been a surge of interest in gapped q-gram filters for approximate string matching. Important design parameters for filters are for example the value of q, the filter-threshold and in particular the shape (aka seed) of the filter. A good choice of parameters can improve the performance of a q-gram filter by orders of magnitude and optimizing these parameters is a nontrivial combinatorial problem. We describe a new method for analyzing gapped q-gram filters. This method is simple and generic. It applies to a variety of filters, overcomes many restrictions that are present in existing algorithms and can easily be extended to new filter variants. To implement our approach, we use an extended version of BDDs (Binary Decision Diagrams), a data structure that efficiently represents sets of bit-strings. In a second step, we define a new class of multi-shape filters and analyze these filters with the BDD-based approach. Experiments show that multi-shape filters can outperform the best single-shape filters, which are currently in use, in many aspects. The BDD-based algorithm is crucial for the design and analysis of these new and better multi-shape filters. Our results apply to the k-mismatches problem, i.e. approximate string matching with Hamming distance.
Published: 2005

20. Approximate string matching on Ziv–Lempel compressed text

Author: Gonzalo Navarro, Esko Ukkonen, and Juha Kärkkäinen
Subjects: Theoretical computer science, Speedup, Collage systems, Edit distance, 0102 computer and information sciences, 02 engineering and technology, Approximate string matching, Dynamic programming, 01 natural sciences, Ziv–Lempel compression, Theoretical Computer Science, Combinatorics, Computational Theory and Mathematics, 010201 computation theory & mathematics, Compression (functional analysis), 3-dimensional matching, 0202 electrical engineering, electronic engineering, information engineering, Compressed pattern matching, Discrete Mathematics and Combinatorics, 020201 artificial intelligence & image processing, Pattern matching, Mathematics
Abstract: We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions. On LZ78/LZW we need O(mkn + R) time in the worst case and O(k2n + mk min(n, (mσ)k) + R) on average where σ is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.
Published: 2003
Full Text: View/download PDF

21. Lempel—Ziv Index for q -Grams

Author: Juha Kärkkäinen and Erkki Sutinen
Subjects: Parsing, Index (economics), General Computer Science, Applied Mathematics, Structure (category theory), 0102 computer and information sciences, 02 engineering and technology, computer.software_genre, Binary logarithm, 01 natural sciences, Computer Science Applications, Combinatorics, 010201 computation theory & mathematics, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Pattern matching, computer, Algorithm, Gram, Mathematics
Abstract: We present a new sublinear-size index structure for finding all occurrences of a given q -gram in a text. Such a q -gram index is needed in many approximate pattern matching algorithms. All earlier q -gram indexes require at least O(n) space, where n is the length of the text. The new Lempel—Ziv index needs only O(n/log n) space while being as fast as previous methods. The new method takes advantage of repetitions in the text found by Lempel—Ziv parsing.
Published: 1998

22. Lempel-Ziv Parsing in External Memory

Author: Juha Kärkkäinen, Simon J. Puglisi, and Dominik Kempa
Subjects: FOS: Computer and information sciences, Parsing, Theoretical computer science, Computer science, String (computer science), Search engine indexing, 0102 computer and information sciences, 02 engineering and technology, computer.software_genre, 01 natural sciences, Factorization, 010201 computation theory & mathematics, Computer Science - Data Structures and Algorithms, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data Structures and Algorithms (cs.DS), computer, Auxiliary memory, Data compression
Abstract: For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections., 10 pages
Published: 2013

23. Near in Place Linear Time Minimum Redundancy Coding

Author: Juha Kärkkäinen and German Tischler
Subjects: Block code, Discrete mathematics, Triple modular redundancy, Polynomial code, Computer science, Concatenated error correction code, 020206 networking & telecommunications, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, Longitudinal redundancy check, 01 natural sciences, Linear code, Redundancy (information theory), 010201 computation theory & mathematics, Cyclic code, 0202 electrical engineering, electronic engineering, information engineering, Algorithm
Abstract: In this paper we discuss data structures and algorithms for linear time encoding and decoding of minimum redundancy codes. We show that a text of length n over an alphabet of cardinality σ can be encoded to minimum redundancy code and decoded from minimum redundancy code in time O(n) using only an additional space of O(σ) words (O(σ log n) bits) for handling the auxiliary data structures. The encoding process can replace the given block code by the corresponding minimum redundancy code in place. The decoding process is able to replace the minimum redundancy code given in sufficient space to store the block code by the corresponding block code.
Published: 2013

24. Lightweight Lempel-Ziv Parsing

Author: Dominik Kempa, Simon J. Puglisi, and Juha Kärkkäinen
Subjects: Parsing, Theoretical computer science, Factorization, 010201 computation theory & mathematics, Computer science, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0102 computer and information sciences, 02 engineering and technology, Alphabet, computer.software_genre, 01 natural sciences, computer
Abstract: We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.
Published: 2013

25. Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Author: Simon J. Puglisi, Dominik Kempa, and Juha Kärkkäinen
Subjects: Parsing, Computer science, String (computer science), Search engine indexing, 0102 computer and information sciences, 02 engineering and technology, Binary logarithm, computer.software_genre, 01 natural sciences, Factorization, 010201 computation theory & mathematics, Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Time complexity, Algorithm, computer, Data compression
Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n log n bits less space than any previous linear time algorithm. The algorithms are also practical, simple to implement, and very fast in practice.
Published: 2013

26. Slashing the Time for BWT Inversion

Author: Dominik Kempa, Simon J. Puglisi, and Juha Kärkkäinen
Subjects: Hardware_MEMORYSTRUCTURES, Speedup, Out-of-order execution, Computer science, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, Parallel computing, Cache-oblivious algorithm, 01 natural sciences, Bottleneck, 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Algorithm design, Cache, Cache algorithms, Time complexity, Algorithm
Abstract: Inverting the Burrows-Wheeler transform (BWT) is a bottleneck in BWT-based decompressors. The state-of-the-art inversion algorithm runs in linear time but is slow in practice due to CPU-cache misses. For more than a decade these cache misses have been thought to be inherent to BWT inversion. We show how to reduce the number of cache misses by a factor of nearly two, and simultaneously the cost of cache misses by another factor of two, obtaining a consistent speed up by a factor of 2.3-4. We can do even better if the data is highly repetitive. We describe an algorithm that achieves an asymptotic reduction in cache misses in theory and is the fastest algorithm in practice for such data.
Published: 2012

27. A Faster Grammar-Based Self-index

Author: Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi
Subjects: Discrete mathematics, Parsing, Grammar, Computer science, media_common.quotation_subject, String (computer science), 0102 computer and information sciences, 02 engineering and technology, computer.software_genre, 01 natural sciences, Genomic databases, Index (publishing), 010201 computation theory & mathematics, Log-log plot, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, computer, Algorithm, media_common
Abstract: To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O(m2 + (m + occ) log log n) time. All previous self-indexes are either larger or slower in the worst case.
Published: 2012

28. Grammar Precompression Speeds Up Burrows–Wheeler Compression

Author: Dominik Kempa, Pekka Mikkola, and Juha Kärkkäinen
Subjects: Speedup, Burrows–Wheeler transform, Computer science, Process (computing), Inverse, Data compression ratio, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, 010201 computation theory & mathematics, 020204 information systems, Compression (functional analysis), Compression ratio, 0202 electrical engineering, electronic engineering, information engineering, Algorithm, Data compression
Abstract: Text compression algorithms based on the Burrows---Wheeler transform (BWT) typically achieve a good compression ratio but are slow compared to Lempel---Ziv type compression algorithms. The main culprit is the time needed to compute the BWT during compression and its inverse during decompression. We propose to speed up BWT-based compression by performing a grammar-based precompression before the transform. The idea is to reduce the amount of data that BWT and its inverse have to process. We have developed a very fast grammar precompressor using pair replacement. Experiments show a substantial speed up in practice without a significant effect on compression ratio.
Published: 2012

29. Indexed Multi-pattern Matching

Author: Juha Kärkkäinen, Veli Mäkinen, Travis Gagie, Jorma Tarhio, Leena Salmela, and Kalle Karhu
Subjects: Speedup, Parsing, Matching (graph theory), Computer science, String (computer science), Parse tree, Concatenation, 0102 computer and information sciences, 02 engineering and technology, computer.software_genre, 01 natural sciences, Combinatorics, Index (publishing), 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Pattern matching, Arithmetic, computer
Abstract: If we want to search sequentially for occurrences of many patterns in a given text, then we can apply any of dozens of multi-pattern matching algorithms in the literature. As far as we know, however, no one has said what to do if we are given a compressed self-index for the text instead of the text itself. In this paper we show how to take advantage of similarities between the patterns to speed up searches in an index. For example, we show how to store a string S [1..n] in nHk (S)+o (n (Hk (S)+1)) bits such that, given the LZ77 parse of the concatenation of t patterns of total length l and maximum individual length m, we can count the occurrences of each pattern in a total of O((z + t) log l log m log1 + e n) time, where z is the number of phrases in the parse.
Published: 2012

30. Cache Friendly Burrows-Wheeler Inversion

Author: Simon J. Puglisi and Juha Kärkkäinen
Subjects: Burrows–Wheeler transform, Computer science, Suffix array, Inversion (meteorology), 02 engineering and technology, Parallel computing, Electronic mail, law.invention, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Cache, Algorithm, Time complexity, Random access, Data compression
Abstract: The Burrows-Wheeler transform permutes the symbols of a string such that the permuted string can be compressed effectively with fast, simple techniques. Inversion of the transform is a bottleneck in practice. Inversion takes linear time, but, for each symbol decoded, folklore says that a random access into the transformed string (and so a CPU cache-miss) is necessary. In this paper we show how to mitigate cache misses and so speed inversion. Our main idea is to modify the standard inversion algorithm to detect and record repeated sub strings in the original string as it is recovered. Subsequent occurrences of these repetitions are then copied in a cache friendly way from the already recovered portion of the string, short cutting a series of random accesses by the standard inversion algorithm. We show experimentally that this approach leads to faster runtimes in general, and can drastically reduce inversion time for highly repetitive data.
Published: 2011

31. Fixed Block Compression Boosting in FM-Indexes

Author: Juha Kärkkäinen and Simon J. Puglisi
Subjects: Boosting (machine learning), business.industry, Pattern recognition, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Split point, Artificial intelligence, Pattern matching, business, Algorithm, Random access, Mathematics, Fixed Block
Abstract: A compressed full-text self-index occupies space close to that of the compressed text and simultaneously allows fast pattern matching and random access to the underlying text. Among the best compressed self-indexes, in theory and in practice, are several members of the FMindex family. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes.
Published: 2011

32. Counting Colours in Compressed Strings

Author: Juha Kärkkäinen and Travis Gagie
Subjects: Range query (data structures), String (computer science), 0102 computer and information sciences, 02 engineering and technology, Binary logarithm, Data structure, 01 natural sciences, Substring, Longest repeated substring problem, Combinatorics, 010201 computation theory & mathematics, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Algorithm, Block size, Substring index, Mathematics
Abstract: Motivated by the problem of counting unique visitors to a website, we consider how to preprocess a string s[1..n] such that later, given a substring's endpoints, we can quickly count how many distinct characters that substring contains. The smallest reasonably fast previous data structure for this problem takes n log σ + O(n log log n) bits and answers queries in O(log n) time. We give a data structure for this problem that takes nH0(s) + O(n) + o(nH0(s)) bits, where H0(s) is the 0th-order empirical entropy of s, and answers queries in O(log l) time, where l is the length of the query substring. As far as we know, this is the first data structure, where the query time depends only on l and not on n. We also show how our data structure can be made partially dynamic.
Published: 2011

33. Medium-Space Algorithms for Inverse BWT

Author: Juha Kärkkäinen, Simon J. Puglisi, de Berg, Mark, Meyer, Ulrich, Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan), Department of Computer Science, and Bioinformatics
Subjects: Computer science, education, Inverse, Inversion (meteorology), Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, 113 Computer and information sciences, Inverted index, 01 natural sciences, Bottleneck, Improved performance, 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Wavelet Tree, Algorithm, Data compression
Abstract: The Burrows-Wheeler transform is a powerful tool for data compression and has been the focus of intense research in the last decade. Little attention, however, has been paid to the inverse transform, even though it is a bottleneck in decompression. We introduce three new inversion algorithms with improved performance in a wide range of the space-time spectrum, as confirmed by both theoretical analysis and experimental comparison.
Published: 2010

34. Permuted Longest-Common-Prefix Array

Author: Giovanni Manzini, Juha Kärkkäinen, and Simon J. Puglisi
Subjects: Compressed suffix array, 0206 medical engineering, LCP array, Suffix array, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, Construct (python library), Lexicographical order, Quantitative Biology::Genomics, 01 natural sciences, Bottleneck, law.invention, Sparse array, 010201 computation theory & mathematics, law, Position (vector), Computer Science::Data Structures and Algorithms, Algorithm, 020602 bioinformatics, Mathematics
Abstract: The longest-common-prefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values.
Published: 2009

35. Engineering Radix Sort for Strings

Author: Juha Kärkkäinen and Tommi Rantala
Subjects: Sorting algorithm, Selection sort, Computer science, Comparison sort, Radix sort, 0102 computer and information sciences, 02 engineering and technology, Parallel computing, 01 natural sciences, 010201 computation theory & mathematics, 020204 information systems, Integer sorting, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Bucket sort, Counting sort, American flag sort
Abstract: We describe new implementations of MSD radix sort for efficiently sorting large collections of strings. Our implementations are significantly faster than previous MSD radix sort implementations, and in fact faster than any other string sorting algorithm on several data sets. We also describe a new variant that achieves high space-efficiency at a small additional cost on runtime.
Published: 2008

36. Approximate String Matching over Ziv—Lempel Compressed Text

Author: Gonzalo Navarro, Juha Kärkkäinen, and Esko Ukkonen
Subjects: Speedup, Approximation algorithm, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, Approximate string matching, 01 natural sciences, Combinatorics, Combinatorial analysis, 010201 computation theory & mathematics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Pattern matching, Alphabet, Mathematics, Data compression
Abstract: We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn+R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k2n+min(mkn,m2(mσ)k) + R) average time, where σ is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.
Published: 2000

37. Mining for similarities in aligned time series using wavelets

Author: Juha Kärkkäinen, Yka Huhtala, and Hannu Toivonen
Subjects: Sequence, Similarity (geometry), Series (mathematics), Wavelet transform, 02 engineering and technology, computer.software_genre, Geography, Wavelet, Transformation (function), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Time series, Cluster analysis, computer
Abstract: Discovery of non-obvious relationships between time series is an important problem in many domains, such as financial, sensory, and scientific data analysis. We consider data mining in aligned time series, which arise, e.g., in numerous online monitoring applications, and we are interested in finding time series which reflect the same external events. The time series can have different vertical positions, scales and overall trends, however still show related features at the same locations. The features can be short-term, such as small peaks and turns, or long-term, such as wider mountains and valleys. We propose using a wavelet transformation of a time series to produce a natural set of features for the sequence. Wavelet transformation yields features which describe properties of the sequence, both at various locations and at varying time granularities. In the proposed method, these features are processed so that they are insensitive to changes in the vertical position, scaling, and overall trend of the time series. We discuss the use of these features in data mining, in tasks such as clustering. We demonstrate how the features allow a flexible analysis of different aspects of the similarity: we show how one can examine how the similarity between time series changes as a function of time or as a function of time granularity considered. We present experimental results with real financial data sets. Experiments indicate that the proposed method can produce useful results. For instance, important similarities can be found in time series, which would be considered unrelated by visual inspection. Experiments with compression give encouraging results for the application of the method in mining massive time series data sets.
Published: 1999

38. Lempel-Ziv index for q-grams

Author: Erkki Sutinen and Juha Kärkkäinen
Subjects: Computational complexity theory, Suffix tree, Search engine indexing, Word processing, 0102 computer and information sciences, 02 engineering and technology, Binary logarithm, 01 natural sciences, law.invention, Combinatorics, 010201 computation theory & mathematics, law, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Pattern matching, Algorithm, Indexation, Mathematics, Gram
Abstract: We present a new sublinear-size index structure for finding all occurrences of a given q -gram in a text. Such a q -gram index is needed in many approximate pattern matching algorithms. All earlier q -gram indexes require at least O(n) space, where n is the length of the text. The new Lempel—Ziv index needs only O(n/log n) space while being as fast as previous methods. The new method takes advantage of repetitions in the text found by Lempel—Ziv parsing.
Published: 1996

39. Sparse suffix trees

Author: Juha Kärkkäinen and Esko Ukkonen
Subjects: Computer science, Suffix tree, Generalized suffix tree, Contrast (statistics), Random text, Data_CODINGANDINFORMATIONTHEORY, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, Longest common substring problem, law.invention, Combinatorics, 010201 computation theory & mathematics, Search algorithm, law, 020204 information systems, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Suffix
Abstract: A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cost of increased search times. The idea of sparse suffix trees goes back to PATRICIA tries. Evenly spaced sparse suffix trees represent every kth suffix of the text. In the paper, we give general construction and search algorithms for evenly spaced sparse suffix trees, and present their run time analysis, both in the worst and in the average case. The algorithms are further improved by using so-called dual suffix trees.
Published: 1996

40. Suffix cactus: A cross between suffix tree and suffix array

Author: Juha Kärkkäinen
Subjects: Compressed suffix array, Suffix tree, Generalized suffix tree, Suffix array, 020207 software engineering, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, Longest common substring problem, law.invention, Combinatorics, 010201 computation theory & mathematics, law, 0202 electrical engineering, electronic engineering, information engineering, Regular expression, Suffix, FM-index, Mathematics
Abstract: The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suffix tree or as an augmented suffix array.
Published: 1995

41. Engineering a lightweight external memory suffix array construction algorithm

Author: Juha Kärkkäinen and Dominik Kempa
Subjects: Compressed suffix array, Computer science, Applied Mathematics, Generalized suffix tree, Suffix array, Algorithm engineering, 02 engineering and technology, law.invention, Computational Mathematics, Computational Theory and Mathematics, law, 020204 information systems, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Out-of-core algorithm, Suffix, Algorithm, FM-index, Auxiliary memory
Abstract: We describe an external memory suffix array construction algorithm based on constructing suffix arrays for blocks of text and merging them into the full suffix array. The basic idea goes back over 20 years and there has been a couple of later improvements, but we describe several further improvements that make the algorithm much faster. In particular, we reduce the I/O volume of the algorithm by a factor $\mathcal {O}\!\left( {\log _\sigma n} \right) $. Our experiments show that the algorithm is the fastest suffix array construction algorithm when the size of the text is within a factor of about five from the size of the RAM in either direction, which is a common situation in practice.

42. Run Compressed Rank/Select for Large Alphabets

Author: Dmitry Kosolobov, Juha Kärkkäinen, José Fuentes-Sepúlveda, Simon J. Puglisi, Bilgin, Ali, Marcellin, Michael W., Serra-Sagrista, Joan, Storer, James A., Department of Computer Science, Helsinki Institute for Information Technology, Bioinformatics, and Algorithmic Bioinformatics
Subjects: FOS: Computer and information sciences, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, State of the art, Data structures, Arbitrary constants, Large alphabets, Combinatorics, Log-log plot, 020204 information systems, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, Computer Science - Data Structures and Algorithms, 0202 electrical engineering, electronic engineering, information engineering, Rank (graph theory), Data Structures and Algorithms (cs.DS), Run length, succinct, Data compression, String (computer science), State (functional analysis), rank select, Predecessor problems, Data structure, Binary logarithm, 113 Computer and information sciences, 010201 computation theory & mathematics, Optimal time, Alphabet, Constant (mathematics), MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: Given a string of length $n$ that is composed of $r$ runs of letters from the alphabet $\{0,1,\ldots,\sigma{-}1\}$ such that $2 \le \sigma \le r$, we describe a data structure that, provided $r \le n / \log^{\omega(1)} n$, stores the string in $r\log\frac{n\sigma}{r} + o(r\log\frac{n\sigma}{r})$ bits and supports select and access queries in $O(\log\frac{\log(n/r)}{\log\log n})$ time and rank queries in $O(\log\frac{\log(n\sigma/r)}{\log\log n})$ time. We show that $r\log\frac{n(\sigma-1)}{r} - O(\log\frac{n}{r})$ bits are necessary for any such data structure and, thus, our solution is succinct. We also describe a data structure that uses $(1 + \epsilon)r\log\frac{n\sigma}{r} + O(r)$ bits, where $\epsilon > 0$ is an arbitrary constant, with the same query times but without the restriction $r \le n / \log^{\omega(1)} n$. By simple reductions to the colored predecessor problem, we show that the query times are optimal in the important case $r \ge 2^{\log^\delta n}$, for an arbitrary constant $\delta > 0$. We implement our solution and compare it with the state of the art, showing that the closest competitors consume 31-46% more space., Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. 10 pages, 1 figure, 4 tables; published in DCC'2018
Full Text: View/download PDF

43. Engineering external memory induced suffix sorting

Author: Bella Zhukova, Dominik Kempa, Simon J. Puglisi, and Juha Kärkkäinen
Subjects: Computer science, Suffix array, 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, law.invention, Suffix sorting, 010201 computation theory & mathematics, law, 0202 electrical engineering, electronic engineering, information engineering, Out-of-core algorithm, Arithmetic, Auxiliary memory

44. Efficient discovery of functional and approximate dependencies using partitions

Author: Juha Kärkkäinen, Pasi Porkka, Hannu Toivonen, and Yka Huhtala
Subjects: Computer science, Relational database, 02 engineering and technology, Lossless-Join Decomposition, computer.software_genre, Dependency theory (database theory), Set (abstract data type), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Database theory, Data mining, Functional dependency, Row, Algorithm, computer
Abstract: Discovery of functional dependencies from relations has been identified as an important database analysis technique. We present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The use of partitions makes the discovery of approximate functional dependencies easy and efficient, and the erroneous or exceptional rows can be identified easily. Experiments show that the new algorithm is efficient in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods.

45. Better external memory suffix array construction

Author: Juha Kärkkäinen, Roman Dementiev, Peter Sanders, and Jens Mehnert
Subjects: Compressed suffix array, Theoretical computer science, Computer science, DATA processing & computer science, Suffix array, Algorithm engineering, 0102 computer and information sciences, 02 engineering and technology, Data structure, 01 natural sciences, Theoretical Computer Science, law.invention, 010201 computation theory & mathematics, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, ddc:004, Suffix, Auxiliary memory, FM-index, Data compression
Abstract: Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications, in particular, in bioinformatics. However, so far, it has appeared prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on average. Our implementation can construct suffix arrays for inputs of up to 4-GB in hours on a low-cost machine. As a tool of possible independent interest, we present a systematic way to design, analyze, and implement pipelined algorithms.

46. Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots

Author: Yann Ponty, Rolf Backofen, Saad Sheikh, Laboratoire d'informatique de l'École polytechnique [Palaiseau] (LIX), Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X), Algorithms and Models for Integrative Biology (AMIB ), Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Albert-Ludwigs-Universität Freiburg, Juha Kärkkäinen, Juha Kärkkäinen and Jens Stoye, ANR-10-BLAN-0204,MAGNUM,Méthodes Algorithmiques de Génération aléatoire Non Uniforme, Modèles et applications.(2010), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS), and École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Recherche en Informatique (LRI)
Subjects: Work (thermodynamics), Matching (graph theory), 0206 medical engineering, Stacking, 02 engineering and technology, ACM: G.: Mathematics of Computing/G.2: DISCRETE MATHEMATICS/G.2.2: Graph Theory/G.2.2.0: Graph algorithms, Combinatorics, 03 medical and health sciences, Hardness, ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY/F.2.2: Nonnumerical Algorithms and Problems/F.2.2.1: Computations on discrete structures, RNA folding, 030304 developmental biology, Mathematics, 0303 health sciences, Contrast (statistics), ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY/F.2.2: Nonnumerical Algorithms and Problems/F.2.2.3: Pattern matching, Folding (DSP implementation), [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], General pseudoknots, Rna folding, ACM: J.: Computer Applications/J.3: LIFE AND MEDICAL SCIENCES/J.3.0: Biology and genetics, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], Parametrization, 020602 bioinformatics, Energy (signal processing), Inapproximability
Abstract: International audience; Predicting the folding of an RNA sequence, while allowing general pseudoknots (PK), consists in finding a minimal free-energy matching of its $n$ positions. Assuming independently contributing base-pairs, the problem can be solved in $\Theta(n^3)$-time using a variant of the maximal weighted matching. By contrast, the problem was previously proven NP-Hard in the more realistic nearest-neighbor energy model. In this work, we consider an intermediate model, called the stacking-pairs energy model. We extend a result by Lyngs\o, showing that RNA folding with PK is NP-Hard within a large class of parametrization for the model. We also show the approximability of the problem, by giving a practical $\Theta(n^3)$ algorithm that achieves at least a $5$-approximation for any parametrization of the stacking model. This contrasts nicely with the nearest-neighbor version of the problem, which we prove cannot be approximated within any positive ratio, unless $P=NP$.; La prédiction du repliement, avec pseudonoeuds généraux, d'une séquence d'ARN de taille $n$ est équivalent à la recherche d'un couplage d'énergie libre minimale. Dans un modèle d'énergie simple, où chaque paire de base contribue indépendamment à l'énergie, ce problème peut être résolu en temps $\Theta(n^3)$ grâce à une variante d'un algorithme de couplage pondéré maximal. Cependant, le même problème a été démontré NP-difficile dans le modèle d'énergie dit des plus proches voisins. Dans ce travail, nous étudions les propriétés du problème sous un modèle d'empilements, constituant un modèle intermédiaire entre ceux d'appariement et des plus proches voisins. Nous démontrons tout d'abord que le repliement avec pseudo-noeuds de l'ARN reste NP-difficile dans de nombreuses valuations du modèle d'énergie. . Par ailleurs, nous montrons que ce problème est approximable, en proposant un algorithme polynomial garantissant une $1/5$-approximation. Ce résultat illustre une différence essentielle entre ce modèle et celui des plus proches voisins, pour lequel nous montrons qu'il ne peut être approché à aucun ratio positif par un algorithme en temps polynomial sauf si $N=NP$.
Published: 2012

47. Hardness of Longest Common Subsequence for Sequences with Bounded Run-Lengths

Author: Minghui Jiang, Pedro J. Tejada, Laurent Bulteau, Guillaume Blin, Stéphane Vialette, Laboratoire d'Informatique Gaspard-Monge (LIGM), Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique de Nantes Atlantique (LINA), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS), Department of Mathematics and Statistics [Logan], Utah State University (USU), Juha Kärkkäinen and Jens Stoye, Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), and Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)
Subjects: [INFO.INFO-CC]Computer Science [cs]/Computational Complexity [cs.CC], 0102 computer and information sciences, 02 engineering and technology, Longest increasing subsequence, 16. Peace & justice, [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], 01 natural sciences, Combinatorics, Longest common subsequence problem, 010201 computation theory & mathematics, Symbol (programming), Bounded function, Longest alternating subsequence, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], Alphabet, Mathematics
Abstract: International audience; The longest common subsequence (LCS) problem is a classic and well-studied problem in computer science with extensive applications in diverse areas ranging from spelling error corrections to molecular biology. This paper focuses on LCS for fixed alphabet size and fixed run-lengths (i.e., maximum number of consecutive occurrences of the same symbol). We show that LCS is NP-complete even when restricted to (i) alphabets of size 3 and run-length at most 1, and (ii) alphabets of size 2 and run-length at most 2 (both results are tight). For the latter case, we show that the problem is approximable within ratio 3/5.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

47 results on '"Juha Kärkkäinen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources