Back to Search Start Over

English and Chinese Bilingual Topic Aspect Classification: Exploring Similarity Measures, Optimal LSA Dimensions, and Centroid Correction of Translated Training Examples.

Authors :
Yejun Wu
Oard, Douglas W.
Source :
Proceedings of the Association for Information Science & Technology; 2013, Vol. 50 Issue 1, p1-12, 12p
Publication Year :
2013

Abstract

This paper explores topic aspect (i.e., subtopic or facet) classification for collections that contain more than one language (in this case, English and Chinese), and investigates several key technical issues that may affect the classification effectiveness. The evaluation model assumes a bilingual user who has found some documents on a topic and identified a few passages in each language on specific aspects of that topic that are of interest. Additional passages are then automatically labeled using a k-Nearest- Neighbor classifier and local (i.e., result set) Latent Semantic Analysis (LSA). Experiments show that when few manually annotated passages are available in either language, a classification system trained using passages from both languages can often achieve higher effectiveness than a similar system trained using passages from just one language. Using this experimental framework, this paper answers three technical research questions: whether the normalized cosine similarity measure is better than the more common unnormalized cosine similarity measure (yes), whether the number of retained LSA dimensions (which was heuristically chosen) is appropriate (yes), and whether partial corrections of the translated training examples in the LSA space can yield an improvement over no correction (no). [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
23739231
Volume :
50
Issue :
1
Database :
Complementary Index
Journal :
Proceedings of the Association for Information Science & Technology
Publication Type :
Conference
Accession number :
115251534
Full Text :
https://doi.org/10.1002/meet.14505001039