1. Feature vector construction combining structure and content for document classification
- Author
-
Catherine Roussey, Sylvie Calabretto, Samaneh Chagheri, and Cyril Dumoulin
- Subjects
Structured support vector machine ,Computer science ,computer.internet_protocol ,Computer Science::Information Retrieval ,Document classification ,Feature vector ,Linear classifier ,computer.software_genre ,Relevance vector machine ,ComputingMethodologies_PATTERNRECOGNITION ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Feature (machine learning) ,Vector space model ,Data mining ,computer ,XML - Abstract
This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. More relevant the representation phase is, more relevant the classification will be. We propose a representation model that exploits both the logical structure and the content of document. Structure is represented by the tags of XML document. Our approach is based on vector space model: a document is represented by a vector of weighted features. Each feature is a couple of (tag: term). We have modified the tf∗idf formula to calculate feature's weight according to term's structural level in the document. SVM has been used as learning algorithm. Experimentation on Reuters collection shows that our proposition improves classification performance compared to the standard classification model based on term vector.
- Published
- 2012
- Full Text
- View/download PDF