Back to Search Start Over

AN APPLICATION OF NEURAL NETWORKS TO SEQUENCE ANALYSIS AND GENRE IDENTIFICATION.

Authors :
BISANT, DAVID
Source :
International Journal of Pattern Recognition & Artificial Intelligence; Mar2005, Vol. 19 Issue 2, p199-215, 17p
Publication Year :
2005

Abstract

This study borrowed a technique from molecular sequence analysis and applied it to genre identification, which is the process of determining the type or family of a given document. For example, is the document a letter, a news story, a horoscope, a joke, an advertisement, a pornographic story, etc. Genre identification allows a computer user to further filter email and web sites in a way that is totally different from topic-based methods. This study presents original research in an application of machine learning to the genre identification problem. The specific method selected for the application was neural modeling. The data for the study came from a database constructed by the author and his colleagues. The data consisted of descriptive features and the genre classification, as judged by a human, from over 5,000 different documents. Ten different genres were represented. The descriptive features consisted of 89 different measurements of each document such as average word length, the number of numeric terms, the proportion of present tense verbs, etc. The data was divided into two sets, with 75% set aside for training and 25% reserved separately for testing. The first neural network applied was a very basic single layer network that achieved 79% correct classifications on the testing data. This performance was equivalent to the previous best method on the problem, decision trees. When more complex neural networks were applied to the problem, performance increased significantly. The best performance of 86% correct classifications was achieved by a network with a single hidden layer of 300 units. Increasing the number of hidden layers, or changing the number of hidden units did not improve performance. The best score is also a significant improvement over scores obtained from topic-based filters. The neural networks were further used to determine which input features were most influential in the classifications by the networks. The average magnitude of the weights coming from each feature was computed after training. The analysis indicated that 10% of the features were not of any use to the networks. An additional 10% were very influential and were responsible for most of the performance of the networks. The remaining 80% varied between marginally useful to useful. The analysis of the features indicated that second-order information was being exploited by the networks for better performance. This means that on this problem, neural networks will outperform statistical models or other methods that only utilize first-order information. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02180014
Volume :
19
Issue :
2
Database :
Complementary Index
Journal :
International Journal of Pattern Recognition & Artificial Intelligence
Publication Type :
Academic Journal
Accession number :
16436048
Full Text :
https://doi.org/10.1142/S021800140500396X