Efficient natural language pre-processing for analyzing large data sets

Authors :: Alexsandro Fonseca
Fatiha Sadat
Belainine Billal
Source :: IEEE BigData
Publication Year :: 2016
Publisher :: IEEE, 2016.
Abstract: The phenomenon of big data is described using five Vs: Volume, Variety, Velocity, Variability and Veracity. In this paper, we are interested by analyzing and pre-processing tweets for NLP and machine learning applications such as machine translation and classification. Collected contents from Twitter (tweets) are considered as unstructured, highly noisy and short (140 characters) texts. Overcoming these complex challenges will help learn from such data and apply traditional NLP and machine learning techniques. In this paper, we propose a pre-processing pipeline for tweets consisting of filtering part-of-speech, named entities recognition, hashtag segmentation and disambiguation. Our proposed approach is also based on the graph theory and group words of tweets using semantic relations of WordNet and the idea of connected components. Evaluations on the task of classification showed promising results when using this proposed preprocessing pipeline, with an increase in the accuracy of the classification up to 87.6%.

Tools