Back to Search
Start Over
Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark
- Source :
- DASC/PiCom/DataCom/CyberSciTech
- Publication Year :
- 2018
- Publisher :
- IEEE, 2018.
-
Abstract
- The problem of classifying a research article into one or more fields of science is of particular importance for the academic search engines and digital libraries. A robust classification algorithm offers the users a wide variety of useful tools, such as the refinement of their search results, the browsing of articles by category, the recommendation of other similar articles, etc. In the current literature we encounter approaches which attempt to address this problem without taking into consideration important parameters such as the previous history of the authors and the categorization of the scientific journals which publish the articles. In addition, the existing works overlook the huge volume of the involved academic data. In this paper, we expand an existing effective algorithm for research articles classification, and we parallelize it on Apache Spark –a parallelization framework which is capable of sharing large amounts of data into the main memory of the nodes of a cluster– to enable the processing of large academic datasets. Furthermore, we present data manipulation methodologies which are useful not only for this particular problem, but also for most parallel machine learning approaches. In our experimental evaluation, we demonstrate that our proposed algorithm is considerably more accurate than the supervised learning approaches implemented within the machine learning library of Spark, whereas it outperforms them in terms of execution speed by a significant margin.
- Subjects :
- Computer science
business.industry
Data manipulation language
Supervised learning
Feature extraction
02 engineering and technology
Digital library
Machine learning
computer.software_genre
Statistical classification
Categorization
Margin (machine learning)
020204 information systems
Spark (mathematics)
0202 electrical engineering, electronic engineering, information engineering
020201 artificial intelligence & image processing
Artificial intelligence
business
computer
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
- Accession number :
- edsair.doi...........628d1bd08dfbd3fb4f1569f3cf52484e