Back to Search Start Over

Similar Subtree Search Using Extended Tree Inclusion.

Authors :
Mori, Tomoya
Takasu, Atsuhiro
Jansson, Jesper
Hwang, Jaewook
Tamura, Takeyuki
Akutsu, Tatsuya
Source :
IEEE Transactions on Knowledge & Data Engineering. Dec2015, Vol. 27 Issue 12, p3360-3373. 14p.
Publication Year :
2015

Abstract

This paper considers the problem of identifying all locations of subtrees in a large tree or in a large collection of trees that are similar to a specified pattern tree, where all trees are assumed to be rooted and node-labeled. The tree edit distance is a widely-used measure of tree (dis-)similarity, but is NP-hard to compute for unordered trees. To cope with this issue, we propose a new similarity measure which extends the concept of unordered tree inclusion by taking the costs of insertion and substitution operations on the pattern tree into account, and present an algorithm for computing it. Our algorithm has the same time complexity as the original one for unordered tree inclusion, i.e., it runs in $O(<INNOPIPE>T_1<INNOPIPE><INNOPIPE>T_2<INNOPIPE>)$<alternatives> <inline-graphic xlink:type="simple" xlink:href="akutsu-ieq1-2457922.gif"/></alternatives> time, where $T_1$<alternatives><inline-graphic xlink:type="simple" xlink:href="akutsu-ieq2-2457922.gif"/></alternatives> and $T_2$<alternatives> <inline-graphic xlink:type="simple" xlink:href="akutsu-ieq3-2457922.gif"/></alternatives> denote the pattern tree and the text tree, respectively, when the maximum outdegree of $T_1$<alternatives> <inline-graphic xlink:type="simple" xlink:href="akutsu-ieq4-2457922.gif"/></alternatives> is bounded by a constant. Our experimental evaluation using synthetic and real datasets confirms that the proposed algorithm is fast and scalable and very useful for bibliographic matching, which is a typical entity resolution problem for tree-structured data. Furthermore, we extend our algorithm to also allow a constant number of deletion operations on $T_1$<alternatives><inline-graphic xlink:type="simple" xlink:href="akutsu-ieq5-2457922.gif"/></alternatives> while still running in $O(<INNOPIPE>T_1<INNOPIPE><INNOPIPE>T_2<INNOPIPE>)$<alternatives> <inline-graphic xlink:type="simple" xlink:href="akutsu-ieq6-2457922.gif"/></alternatives> time. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10414347
Volume :
27
Issue :
12
Database :
Academic Search Index
Journal :
IEEE Transactions on Knowledge & Data Engineering
Publication Type :
Academic Journal
Accession number :
110834653
Full Text :
https://doi.org/10.1109/TKDE.2015.2457922