Start Over

Author identification: using text sampling to handle the class imbalance problem

Authors :: Stamatatos, Efstathios
Source :: Information Processing & Management. March, 2008, Vol. 44 Issue 2, p790, 10 p.
Publication Year :: 2008
Abstract: Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multiclass imbalanced eases that reveal the properties of the presented methods. Keywords: Author identification; Class imbalance; Text categorization

Subjects :: Authorship -- Analysis
Authorship -- Reports
Identification -- Analysis
Indexing -- Reports
Business
Computers and office automation industries

Details

Language :: English
ISSN :: 03064573
Volume :: 44
Issue :: 2
Database :: Gale General OneFile
Journal :: Information Processing & Management
Publication Type :: Periodical
Accession number :: edsgcl.177817038

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Author identification: using text sampling to handle the class imbalance problem

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Author identification: using text sampling to handle the class imbalance problem

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources