Back to Search
Start Over
English and Chinese Bilingual Topic Aspect Classification: Exploring Similarity Measures, Optimal LSA Dimensions, and Centroid Correction of Translated Training Examples.
- Source :
- Proceedings of the Association for Information Science & Technology; 2013, Vol. 50 Issue 1, p1-12, 12p
- Publication Year :
- 2013
-
Abstract
- This paper explores topic aspect (i.e., subtopic or facet) classification for collections that contain more than one language (in this case, English and Chinese), and investigates several key technical issues that may affect the classification effectiveness. The evaluation model assumes a bilingual user who has found some documents on a topic and identified a few passages in each language on specific aspects of that topic that are of interest. Additional passages are then automatically labeled using a k-Nearest- Neighbor classifier and local (i.e., result set) Latent Semantic Analysis (LSA). Experiments show that when few manually annotated passages are available in either language, a classification system trained using passages from both languages can often achieve higher effectiveness than a similar system trained using passages from just one language. Using this experimental framework, this paper answers three technical research questions: whether the normalized cosine similarity measure is better than the more common unnormalized cosine similarity measure (yes), whether the number of retained LSA dimensions (which was heuristically chosen) is appropriate (yes), and whether partial corrections of the translated training examples in the LSA space can yield an improvement over no correction (no). [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 23739231
- Volume :
- 50
- Issue :
- 1
- Database :
- Complementary Index
- Journal :
- Proceedings of the Association for Information Science & Technology
- Publication Type :
- Conference
- Accession number :
- 115251534
- Full Text :
- https://doi.org/10.1002/meet.14505001039