Algorithm for Enumerating All Maximal Frequent Tree Patterns among Words in Tree-Structured Documents and Its Application.

Authors :: Uchida, Tomoyuki
Kawamoto, Kayo
Source :: Database Theory & Application; 2009, p107-114, 8p
Publication Year :: 2009
Abstract: In order to extract structural features among nodes, in which characteristic words appear, from tree-structured documents, we proposed a text mining algorithm for enumerating all frequent consecutive path patterns (CPPs) on a list W of words (PAKDD, 2004). First of all, in this paper, we extend a CPP to a tree pattern, which is called a tree association pattern (TAP), over a set W of words. A TAP is an ordered rooted tree t such that the root of t has no child or at least 2 children, all leaves of t are labeled with nonempty subsets of W, and all internal nodes, if exists, are labeled with strings. Next, we present text mining algorithms for enumerating all maximal frequent TAPs in tree-structured documents. Then, by reporting experimental results for Reuters news-wires, we evaluate our algorithms. Finally, as an application of CPPs, we present an algorithm for a wrapper based on CPP using XSLT transformation language and demonstrate simply the use of wrapper to translate one of Reuters news-wires to other XML document. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools