Back to Search Start Over

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur.

Authors :
Tohti, Turdi
Huang, Jimmy
Hamdulla, Askar
Tan, Xing
Source :
Information (2078-2489); Aug2019, Vol. 10 Issue 8, p246-246, 1p
Publication Year :
2019

Abstract

Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
20782489
Volume :
10
Issue :
8
Database :
Complementary Index
Journal :
Information (2078-2489)
Publication Type :
Academic Journal
Accession number :
138318976
Full Text :
https://doi.org/10.3390/info10080246