Back to Search Start Over

Semantic-based multilingual document clustering via tensor modeling

Authors :
Andrea Tagarelli
Dino Ienco
Salvatore Romeo
Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica [Calabria] (DIMES)
Università della Calabria [Arcavacata di Rende] (Unical)
Territoires, Environnement, Télédétection et Information Spatiale (UMR TETIS)
Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-AgroParisTech-Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA)
ADVanced Analytics for data SciencE (ADVANSE)
Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM)
Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)
Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA)-AgroParisTech-Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)
Source :
Scopus-Elsevier, Conference on Empirical Methods in Natural Language Processing, EMNLP: Empirical Methods in Natural Language Processing, EMNLP: Empirical Methods in Natural Language Processing, Oct 2014, Doha, Qatar. pp.600-609, ⟨10.3115/v1/D14-1065⟩, EMNLP, Conference on Empirical Methods in Natural Language Processing, EMNLP, Conference on Empirical Methods in Natural Language Processing, Oct 2014, Doha, France. 10 p, EMNLP

Abstract

EMNLP, Conference on Empirical Methods in Natural Language Processing , Doha, QAT, 25-/10/2014 - 29/10/2014; International audience; A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.

Details

Database :
OpenAIRE
Journal :
Scopus-Elsevier, Conference on Empirical Methods in Natural Language Processing, EMNLP: Empirical Methods in Natural Language Processing, EMNLP: Empirical Methods in Natural Language Processing, Oct 2014, Doha, Qatar. pp.600-609, ⟨10.3115/v1/D14-1065⟩, EMNLP, Conference on Empirical Methods in Natural Language Processing, EMNLP, Conference on Empirical Methods in Natural Language Processing, Oct 2014, Doha, France. 10 p, EMNLP
Accession number :
edsair.doi.dedup.....8c449c8345580a4f03cbe2df8549843d
Full Text :
https://doi.org/10.3115/v1/D14-1065⟩