Cross-Language Text Classification.

Authors :: Olsson, J. Scott
Oard, Douglas W.
Hajič, Jan
Source :: SIGIR Forum; 2005 Proceedings, p645-646, 2p, 1 Chart
Publication Year :: 2005
Abstract: This article presents a study which utilized English training data to classify Czech documents in cross-language text classification. The dataset is a collection of manually transcribed, spontaneous, conversational speech in English and Czech. Indexing proceeds on the English documents by first checking if the term is already present in the probabilistic dictionary. If it is, the term's frequency is incremented. Precision was calculated over the five and ten highest ranked thesaurus labels as well as the five highest concept labels alone.