Unsupervised key-phrases extraction from scientific papers using domain and linguistic knowledge

Authors :: A. Yadrantsau
Maurizio Marchese
M. Krapivin
Yanchun Liang
Source :: ICDIM
Publication Year :: 2008
Publisher :: IEEE, 2008.
Abstract: The domain of Digital Libraries presents specific challenges for unsupervised information extraction to support both the automatic classification of documents and the enhancement of userspsila navigation in the digital content. In this paper, we propose a combined use of machine learning techniques (i.e. Support Vector Machines) and Natural Language Processing techniques (i.e. Stanford NLP parser) to tackle the problem of unsupervised key-phrases extraction from scientific papers. The proposed method strongly depends on the robust structural properties of a scientific paper as well as on the lexical knowledge that we are able to mine from its text. For the experimental assessment we have use a subset of ACM papers in the Computer Science domain containing 400 documents. Preliminary evaluation of the approach shows promising result that improves - on the same data-set - on state-of-the-art Bayesian learning system KEA from a minimum 27% to a maximum 77% depending on KEA parameters tuning and specific evaluation set. Our assessment is performed by comparison with key-phrases assigned by human experts in the specific domain and freely available through ACM portal.

Subjects :: Parsing
Information retrieval
Computer science
business.industry
Digital content
Feature extraction
computer.software_genre
Domain (software engineering)
Support vector machine
Set (abstract data type)
Information extraction
Unsupervised learning
Artificial intelligence
business
computer
Natural language processing

Database :: OpenAIRE
Journal :: 2008 Third International Conference on Digital Information Management
Accession number :: edsair.doi...........9fef39745a93531dddf4107adac0821f

Tools