Back to Search Start Over

Ensemble multi-label text categorization based on rotation forest and latent semantic indexing

Authors :
Wafa Saadaoui
Alex Aussem
Haytham Elghazel
Ouadie Gharroudi
Elghazel, Haytham
Data Mining and Machine Learning (DM2L)
Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS)
Institut National des Sciences Appliquées de Lyon (INSA Lyon)
Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-École Centrale de Lyon (ECL)
Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon)
Université de Lyon-Université Lumière - Lyon 2 (UL2)
Source :
Expert Systems with Applications, Expert Systems with Applications, Elsevier, 2016, 57, pp.1-11
Publication Year :
2016
Publisher :
Elsevier BV, 2016.

Abstract

Text categorization has gained increasing popularity in the last years due the explosive growth of multimedia documents. As a document can be associated with multiple non-exclusive categories simultaneously (e.g., Virus, Health, Sports, and Olympic Games), text categorization provides many opportunities for developing novel multi-label learning approaches devoted specifically to textual data. In this paper, we propose an ensemble multi-label classification method for text categorization based on four key ideas: (1) performing Latent Semantic Indexing based on distinct orthogonal projections on lower-dimensional spaces of concepts; (2) random splitting of the vocabulary; (3) document bootstrapping; and (4) the use of BoosTexter as a powerful multi-label base learner for text categorization to simultaneously encourage diversity and individual accuracy in the committee. Diversity of the ensemble is promoted through random splits of the vocabulary that leads to different orthogonal projections on lower-dimensional latent concept spaces. Accuracy of the committee members is promoted through the underlying latent semantic structure uncovered in the text. The combination of both rotation-based ensemble construction and Latent Semantic Indexing projection is shown to bring about significant improvements in terms of Average Precision, Coverage, Ranking loss and One error compared to five state-of-the-art approaches across 14 real-word textual data sets covering a wide variety of topics including health, education, business, science and arts.

Details

ISSN :
09574174
Volume :
57
Database :
OpenAIRE
Journal :
Expert Systems with Applications
Accession number :
edsair.doi.dedup.....b1143760295956ca13ce3e63e2d2763c
Full Text :
https://doi.org/10.1016/j.eswa.2016.03.041