Back to Search Start Over

n-Gram-Based Text Compression.

Authors :
Nguyen, Vu H.
Nguyen, Hien T.
Duong, Hieu N.
Snasel, Vaclav
Source :
Computational Intelligence & Neuroscience. 11/14/2016, p1-11. 11p.
Publication Year :
2016

Abstract

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
16875265
Database :
Academic Search Index
Journal :
Computational Intelligence & Neuroscience
Publication Type :
Academic Journal
Accession number :
119451991
Full Text :
https://doi.org/10.1155/2016/9483646