Back to Search
Start Over
Efficient natural language pre-processing for analyzing large data sets
- Source :
- IEEE BigData
- Publication Year :
- 2016
- Publisher :
- IEEE, 2016.
-
Abstract
- The phenomenon of big data is described using five Vs: Volume, Variety, Velocity, Variability and Veracity. In this paper, we are interested by analyzing and pre-processing tweets for NLP and machine learning applications such as machine translation and classification. Collected contents from Twitter (tweets) are considered as unstructured, highly noisy and short (140 characters) texts. Overcoming these complex challenges will help learn from such data and apply traditional NLP and machine learning techniques. In this paper, we propose a pre-processing pipeline for tweets consisting of filtering part-of-speech, named entities recognition, hashtag segmentation and disambiguation. Our proposed approach is also based on the graph theory and group words of tweets using semantic relations of WordNet and the idea of connected components. Evaluations on the task of classification showed promising results when using this proposed preprocessing pipeline, with an increase in the accuracy of the classification up to 87.6%.
- Subjects :
- Connected component
Information retrieval
Machine translation
Computer science
business.industry
Feature extraction
Big data
WordNet
02 engineering and technology
computer.software_genre
Semantics
Pipeline (software)
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
020201 artificial intelligence & image processing
Artificial intelligence
Language translation
business
computer
Natural language
Natural language processing
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- 2016 IEEE International Conference on Big Data (Big Data)
- Accession number :
- edsair.doi...........5d525cd8bf11459a68d4f434b1ec8577