Back to Search Start Over

Natural Language Compression on Edge-Guided text preprocessing

Authors :
Martínez-Prieto, Miguel A.
Adiego, Joaquín
de la Fuente, Pablo
Source :
Information Sciences. Dec2011, Vol. 181 Issue 24, p5387-5411. 25p.
Publication Year :
2011

Abstract

Abstract: This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G 1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G 2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM). [Copyright &y& Elsevier]

Details

Language :
English
ISSN :
00200255
Volume :
181
Issue :
24
Database :
Academic Search Index
Journal :
Information Sciences
Publication Type :
Periodical
Accession number :
66663799
Full Text :
https://doi.org/10.1016/j.ins.2011.07.039