Author: "Cyril Dumoulin" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

1. Feature vector construction combining structure and content for document classification

Author: Catherine Roussey, Sylvie Calabretto, Samaneh Chagheri, and Cyril Dumoulin
Subjects: Structured support vector machine, Computer science, computer.internet_protocol, Computer Science::Information Retrieval, Document classification, Feature vector, Linear classifier, computer.software_genre, Relevance vector machine, ComputingMethodologies_PATTERNRECOGNITION, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Feature (machine learning), Vector space model, Data mining, computer, XML
Abstract: This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. More relevant the representation phase is, more relevant the classification will be. We propose a representation model that exploits both the logical structure and the content of document. Structure is represented by the tags of XML document. Our approach is based on vector space model: a document is represented by a vector of weighted features. Each feature is a couple of (tag: term). We have modified the tf∗idf formula to calculate feature's weight according to term's structural level in the document. SVM has been used as learning algorithm. Experimentation on Reuters collection shows that our proposition improves classification performance compared to the standard classification model based on term vector.
Published: 2012
Full Text: View/download PDF

2. Technical documents classification

Author: Sylvie Calabretto, Catherine Roussey, Cyril Dumoulin, Samaneh Chagheri, Distribution, Recherche d'Information et Mobilité (DRIM), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Technologies et systèmes d'information pour les agrosystèmes (UR TSCF), Institut national de recherche en sciences et technologies pour l'environnement et l'agriculture (IRSTEA), Aucun, Continew, Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), and Centre national du machinisme agricole, du génie rural, des eaux et forêts (CEMAGREF)
Subjects: Document Structure Description, Computer science, Context (language use), 02 engineering and technology, DOCUMENT TECHNIQUE, STRUCTURAL DOCUMENT, computer.software_genre, CLASSIFICATION, Common Source Data Base, Documentation, 020204 information systems, DOCUMENT STRUCTURE, 0202 electrical engineering, electronic engineering, information engineering, MACHINE A VECTEURS DE SUPPORT, Product design specification, Information retrieval, Document classification, DOCUMENT CLASSIFICATION, DOCUMENTATION, Technical documentation, SUPPORT VECTOR MACHINE, Technical communication, [SDE]Environmental Sciences, Vector space model, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, 020201 artificial intelligence & image processing, VECTOR SPACE MODEL, computer
Abstract: International audience; This research takes place in an industrial context: the CONTINEW Company. This company ensures the storage and security of critical data and technical documentation. The term technical documentation refers to different documents with product-related data and information that are used and stored for different purposes, such as user manuals and product specifications. They are strongly structured, but different authors have used different styles and models for document construction. The management of this increasing volume of documents requires document classification in order to retrieve information quickly and to construct a standard model for each category of documents.
Published: 2011

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

2 results on '"Cyril Dumoulin"'

1. Feature vector construction combining structure and content for document classification

2. Technical documents classification

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

2 results on '"Cyril Dumoulin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources