Back to Search Start Over

XML tree structure compression using RePair.

Authors :
Lohrey, Markus
Maneth, Sebastian
Mennicke, Roy
Source :
Information Systems. Nov2013, Vol. 38 Issue 8, p1150-1167. 18p.
Publication Year :
2013

Abstract

Abstract: XML tree structures can conveniently be represented using ordered unranked trees. Due to the repetitiveness of XML markup these trees can be compressed effectively using dictionary-based methods, such as minimal directed acyclic graphs (DAGs) or straight-line context-free (SLCF) tree grammars. While minimal SLCF tree grammars are in general smaller than minimal DAGs, they cannot be computed in polynomial time unless . Here, we present a new linear time algorithm for computing small SLCF tree grammars, called TreeRePair, and show that it greatly outperforms the best known previous algorithm BPLEX. TreeRePair is a generalization to trees of Larsson and Moffat's RePair string compression algorithm. SLCF tree grammars can be used as efficient memory representations of trees. Using TreeRePair, we are able to produce the smallest queryable memory representation of ordered trees that we are aware of. Our investigations over a large corpus of commonly used XML documents show that tree traversals over TreeRePair grammars are 14 times slower than over pointer structures and 5 times slower than over succinct trees, while memory consumption is only 1/43 and 1/6, respectively. With respect to file compression we are able to show that a Huffman-based coding of TreeRePair grammars gives compression ratios comparable to the best known XML file compressors. [Copyright &y& Elsevier]

Details

Language :
English
ISSN :
03064379
Volume :
38
Issue :
8
Database :
Academic Search Index
Journal :
Information Systems
Publication Type :
Academic Journal
Accession number :
90422892
Full Text :
https://doi.org/10.1016/j.is.2013.06.006