Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering.

Authors :: Tang, Guoyu
Xia, Yunqing
Cambria, Erik
Jin, Peng
Zheng, Thomas Fang
Source :: International Journal of Pattern Recognition & Artificial Intelligence. Mar2015, Vol. 29 Issue 2, p-1. 26p.
Publication Year :: 2015
Abstract: Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering. [ABSTRACT FROM AUTHOR]

Subjects :: *DOCUMENT clustering
*BILINGUALISM
*ACQUISITION of data
*TRANSLATIONS
*AMBIGUITY
*DIRICHLET problem
*WORD (Linguistics)

Language :: English
ISSN :: 02180014
Volume :: 29
Issue :: 2
Database :: Academic Search Index
Journal :: International Journal of Pattern Recognition & Artificial Intelligence
Publication Type :: Academic Journal
Accession number :: 108351166
Full Text :: https://doi.org/10.1142/S021800141559003X

Full Text Access

Tools