1. SUBSEQUENCE COUNTING AS A MEASURE OF SIMILARITY FOR SEQUENCES.
- Author
-
WANG, HUI
- Subjects
- *
MATHEMATICAL sequences , *MATHEMATICS , *ALGORITHMS , *INFORMATION technology , *PROBABILITY theory - Abstract
The longest common subsequence is a well known and popular method for measuring similarity between sequences. It advocates the use of information contained in the longest common subsequence as an indication of similarity. In this paper we consider the count of all common subsequences as a measure of sequence similarity with the view that all common information is captured. This measure is inspired and derived from a generic similarity measure, neighborhood counting metric. The close connection of the neighborhood counting metric with probability and the Bayes classifier helps gain an insight from the probabilistic perspective into the all-common subsequences measure. We design algorithms to calculate this measure and we also carry out an experiment in the framework of k-nearest neighbors on a gene sequence classification task. The experiment shows that the all-common subsequences measure and the longest common subsequence measure have little difference for small k values, but differ significantly for large k values. The performance of the all-common subsequences measure remains steady as k gets larger, but the performance of the longest common subsequence measure drops sharply as k gets larger. Such a property may be useful for those tasks where we are interested not only in the nearest neighbor, but also in the first k nearest neighbors. The main contribution of this paper is a suite of algorithms for finding all common subsequences. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF