Back to Search Start Over

Corpus-based topic diffusion for short text clustering.

Authors :
Zheng, Chu Tao
Liu, Cheng
Wong, Hau San
Source :
Neurocomputing. Jan2018, Vol. 275, p2444-2458. 15p.
Publication Year :
2018

Abstract

In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09252312
Volume :
275
Database :
Academic Search Index
Journal :
Neurocomputing
Publication Type :
Academic Journal
Accession number :
126959201
Full Text :
https://doi.org/10.1016/j.neucom.2017.11.019