Back to Search Start Over

ASYMPTOTIC BEHAVIOR OF k-WORD MATCHES BETWEEN TWO UNIFORMLY DISTRIBUTED SEQUENCES.

Authors :
Kantorovitz, M. R.
Booth, H. S.
Burden, C. J.
Wilson, A S. R.
Source :
Journal of Applied Probability; Sep2007, Vol. 44 Issue 3, p788-805, 18p
Publication Year :
2007

Abstract

Given two sequences of length n over a finite alphabet A of size ∣A∣ = d, the D<subscript>2</subscript> statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D<subscript>2</subscript> when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D<subscript>2</subscript> statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D<subscript>2</subscript> statistic is approximately normal as n gets large. We also give a formula for the variance of D<subscript>2</subscript> in the uniform case. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00219002
Volume :
44
Issue :
3
Database :
Complementary Index
Journal :
Journal of Applied Probability
Publication Type :
Academic Journal
Accession number :
27176996
Full Text :
https://doi.org/10.1239/jap/1189717545