Abstract
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D2 word count statistic, which we call
and
. For
, which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic,
, outperforms
in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of
, we cannot provide a closed form for power calculations.
Key words: alignment-free, normal approximation, normal distribution, sequence alignment, word count statistics
1. Introduction
Comparison of the similarities between two segments of biological sequences using k-tuples (also called k-grams or k-words) arises from the need for rapid sequence comparison. Such methods are often employed in cDNA sequence comparisons. Today next-generation sequencing methods are producing unprecedented volumes of sequence data. Therefore, we expect that the use of k-tuples will play an increasingly important role for molecular sequence and genome comparisons in the current era. This article will explore in some detail the statistic which one of these methods is based upon, along with other substantially superior statistics.
One of the most widely used statistics for sequence comparison based on k-tuples is the so-called D2 statistic, which is based on the joint k-tuple content in the two sequences. If two sequences are closely related, we would expect the k-tuple content of both sequences to be very similar.
More formally, suppose that two sequences,
and
, say, are composed of letters that are drawn from a finite alphabet
of size d. For a
, let pa denote the probability of letter a. For
, let
![]() |
count the number of occurrences of w in A, and similarly, Yw counts the number of occurrences of w in B. Here,
; similarly, we put for later use
. Then D2 is defined by
![]() |
The null model is typically chosen to be such that the letters are independent and identically distributed (i.i.d.), and that the two sequences are independent. Using this model, Lippert et al. (2002) derived a Poisson approximation, a compound Poisson approximation, and a normal approximation for D2; the normal approximation is only valid under the assumption that not all letters of the alphabet are equally likely. In the case that all letters are equally likely, the D2 statistic looks asymptotically like the sum of products of independent normal variables. Lippert et al. (2002) also found that the D2 statistic is dominated by background noise, in the nonuniform case.
In the work of Kantorovitz et al. (2007a), it was shown that in the regime that all letters are equally likely, the standardized statistic
![]() |
is asymptotically normally distributed when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. In clustering some biologically related sequences, Kantorovitz et al. (2007b) found that D2z outperforms D2. The heuristic argument is that the background models for the two sequences may be different, and the D2 statistic should hence be normalized to account for the different background distributions of the sequences. Yet in the nonuniform case the issue about the variability being dominated by the noise in the single sequences remains.
In this article, we propose a new statistic, which is a self-standardized version of D2. In general, Shepp (1964) observed that, if X and Y are independent mean zero normals, X with variance
and Y with variance
, then
is again normal, with variance
. For
is the probability of occurrence of w, and the centralized count variable is denoted as
![]() |
We introduce the new count statistic as
![]() |
(1) |
Here we set
. The superscript “S” stands for “Shepp,” and also for “self-standardized.” We shall see that, under reasonable assumptions,
will be approximately normally distributed.
In practice we shall usually have to replace pa, the (unobserved) letter probabilities, by
, the relative count of letter a in the concatenation of the two sequences, based on the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution. We then estimate the probability of occurrence of
by
. In our simulations, we always estimate the letter probabilities, even when we assume that all letters are equally likely.
We also study the following version of the word count statistic:
![]() |
(2) |
which in our simulations outperforms not only D2 but also
, in terms of power for detecting the relatedness between the two sequences. This statistic comes about by considering
, but as the variance is costly to compute, it is replaced by the estimated mean of the word occurrence across the two sequences when the probability of the word pattern is small; this approach can be justified by considering a Poisson approximation for the individual word counts. We justify in Section 2 that
can be viewed as the sum of the products of independent normal variables, and we suggest how to simulate from its asymptotic distribution, for which we do not have a closed-form expression.
To explain the problem with D2,
![]() |
(3) |
Approximately, if n and m are large, under the null model,
should follow a mean zero normal distribution with variance of order n with respect to sequence length. The distribution of
should be approximately, for large n and m, the distribution of the sum of products of pairs of independent mean zero normal variables with a variance of order O(nm).
If all letters are equally likely, then pw = d−k for all words w, and hence,
, giving that
. The variability in D2 is the same as the variability in
, and indeed as in
.
When not all letters are equally likely, as the variance of Xw is of order n and the variance of Yw is of order m, the variance of
is of order O(n2m), and similarly the variance of
is of order O(nm2). Hence, the variability in D2 is dominated by the variability in
and
. Thus, in this case, the variability in D2 is dominated by the terms that reflect the noise in the single sequences only.
The asymptotic normality of D2 for both nucleic acid and amino acid sequences has been studied empirically by Forêt et al. (2009). However, no power study was undertaken; our argument shows that in the nonuniform case the asymptotic normality of D2 only stems from the asymptotic normality of the underlying word counts in the respective sequences.
Even in the regime that all letters are equally likely, if we only leave the “last” word
out, forming the statistic
, then
, which is not constant. So even if we just leave one word out of the whole set of possible words, if the sequence lengths are large and all other parameters are fixed, then the variability, now in
, will be dominated by the variability in the single sequences. Hence, the D2 statistic is, in general, not useful for assessing whether the two underlying sequences are related.
This article is structured as follows. In Section 2, we discuss the distribution of
and
under the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution, and we shall present simulation results for testing the normality of D2,
, and
.
In Section 3, we study the power of the statistics D2,
,
, and D2z under two alternative scenarios. The first scenario is that the two sequences contain a common motif, whereas the second scenario is a pattern transfer model; we pick a word in the first sequence and use it to replace a word in the second sequence. Our results illustrate not only the poor performance of D2, but also the encouraging performances of
and of
.
Section 4 illustrates how the asymptotic normality of
gives a fast method for assessing statistical significance, as only its standard deviation has to be approximated and not the empirical distribution itself.
There is a caveat—if the distribution on the alphabet is very close to uniform, and if some other conditions are satisfied which relate to having a large number of summands of products of pairs of independent normals, then D2 will behave like the sum of products of normally distributed variables, similar to the uniform case; when the deviation from uniform increases, the asymptotic normality for D2 holds. This phase transition is explored in Section 5.
We summarize our results in Section 6, and we briefly indicate generalizations to Markov chain models as well as to multiple sequence comparisons.
The proofs for Section 2 are presented in the Supplementary Material (see online supplementary material at www.liebertonline.com). The code for simulating from the distributions is available at www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html
2. The Distributions of
and
Under the Null Model
Here the null model is that the letters are i.i.d. and the two sequences are independent. We assume, as in Huang (2002), that
. For D2, Forêt et al. (2009) studied the empirical distribution via simulations, and they found that a gamma distribution outperforms the normal distribution in general. For longer sequences they showed that the normal approximation itself would be appropriate.
2.1.
and asymptotic normality
First, we focus on the word counts in a single sequence. Let
![]() |
be the vector of centered word counts with the last word
left out; note that
![]() |
(4) |
can be recovered from the set
. Huang (2002) showed a multivariate normal approximation for the word count vector in a single sequence. The limiting covariance matrix C needs some notation; see section 12.1 in Waterman (1995), with results derived by Lundstrom (1990). For
, we define, for
,
![]() |
which is the probability that w occurs, given that
has occurred. For words
and
, the overlap indicator is defined as
![]() |
This overlap indicator equals 1 if the last k − j letters of u overlap exactly the first k − j letters of v. Then the approximating covariance matrix C is given by
![]() |
(5) |
A similar normal approximation is valid for
. As we assume that A and B have the same letter probability distribution, both have the same limiting covariance matrix C.
Thus, we obtain the following approximation for
(for the proof and the precise bounds, see online Supplementary Material at www.liebertonline.com). We use the abbreviation MVN(μ, C) to denote a multivariate normal distribution with mean vector μ and covariance matrix C. Also, a function
is called Lipschitz, with Lipschitz constant 1, if for all real x and y, |h(x) − h(y)| ≤ |x − y|.
Theorem 2.1
Assume m ≥ n and
. Let
and
be two independent (dk − 1)-dimensional normal vectors. In analogy to Equation (4), put, for i = 1, 2,
![]() |
Let
![]() |
Then, Dlim is mean zero normally distributed, and, for any function h which is bounded and Lipschitz with Lipschitz constant 1, as n → ∞ with d = d(n) and k = k(d, n),
![]() |
▪
The bound in Theorem 2.1 may not be optimal; indeed, it is based on a multivariate normal approximation for word counts, Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), which is of order
. The purpose of the bound is to illustrate the trade-off between alphabet size, word length, and sequence length. If d, the alphabet size, is very large, then even moderately long words will be rare unless the sequence is very long.
Because of the complicated dependence, we were not able to give a closed-form expression for the variance of Dlim. Theorem 2.1, however, justifies using a z-test for the null model, based on the statistic
, using the estimated standard deviation.
2.2.
and the product of independent normals
The statistic
given in Equation (2) is motivated by estimating the standardized counts
![]() |
approximating
by its mean
, with the argument that 1 − pw will be close to 1 when k is reasonably large and the word w is relatively rare. From Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), we obtain a multivariate normal approximation for the standardized count vectors
and for
. Although the covariances within the vectors will not disappear, for each w,
and
are independent and would be approximated by independent univariate standard normal variables. From the Stuart and Ord (1987), we know the distribution of the product of two independent standard normal variables [see also Springer and Thompson (1966)].
Lemma 2.1
Let X and Y be two independent standard normal random variables. Then the product W = XY has probability density
![]() |
(6) |
where
dt denotes the modified Bessel function of the third kind.
Thus, the distribution of each summand
will approximately have density (6). The covariance structure will result in an approximation with a complicated distribution which would be easily assessed by simulation from many normal vectors with covariance matrix C given in Equation (5), standardizing and taking products.
2.3. The case that all letters are equally likely
In the case that all letters are equally likely, both Lippert et al. (2002) and Kantorovitz (2007) observed that D2 will not follow a normal distribution. Kantorovitz et al. (2007a) showed that
is asymptotically normal, however, when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. However, when the word length is fixed, D2, D2z, and
may not tend to normal as sequence length tends to infinity. Note that in the case that all letters are equally likely, all of D2, D2z, and
agree up to constants.
2.4. Simulations
To illustrate the quality of the normal approximation, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet
. We consider two types of distributions on the letters: the uniform distribution
and a gc rich, nonuniform distribution
; the latter distribution is the same as that used in Lippert et al. (2002) and Forêt et al. (2006) to study D2. Similar to Lippert et al. (2002) and Forêt et al. (2006), for each n = 2j × 102, where
, and for each
, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. Forêt et al. (2006) found as optimal tuple length k = 7, for n = 800, 1600, and 3200; optimal in the sense that for this choice of k, the statistic D2 will be closest to normal. All results are based on a sample size of 10,000; we use the same simulated sequences for all three scores, D2,
, and
. As D2z differs from D2 only by an additive and a multiplicative constant, we do not include D2z in these simulations.
We then use the Lilliefors test (Lilliefors, 1967) to assess whether the distributions are close to normal. The Lilliefors test is a modification of the Kolmogorov–Smirnov goodness-of-fit test, which is implemented using the sample mean and standard deviation as the mean and standard deviation of the theoretical limiting normal distribution. In contrast to the Kolmogorov–Smirnov test, statistical significance is based on the Lilliefors distribution; see also Forêt et al. (2009) for a discussion why not to use an unmodified Kolmogorov–Smirnov test when the standard deviation is estimated. A p-value of less than 0.05 indicates that we would reject the null model at the 5% significance level. Under the null model, in 100 tests we would expect about five tests resulting in a p-value of less than 5%. Precision is up to four decimal places. For easier readability, a value of 0.0000 is recorded simply as 0.
We will first discuss the nonuniform case, when asymptotic normality has been shown to hold for all three statistics D2,
, and
, when first sequence length and then word length tend to infinity. For short words, only
has been shown to be approximately normal when sequence length tends to infinity. The regime of interest here is that zwords are not too rare; for long words, say
, with
, a compound Poisson approximation is more appropriate (Lippert et al., 2002).
2.4.1. The nonuniform case
In the nonuniform case, for all three statistics, the larger the sequence length and the smaller k, the closer the distribution is to normality; the performance is rather different though. Recall that the sequence length is 2j × 100; for easier readability, we denote the 2j × 100 column in the table just by the value for j.
Table 1 summarizes the p-values for the Lilliefors tests in the nonuniform case for D2,
, and
. For D2, Table 1 shows that even for k = 1 we would reject the hypothesis of normality at the 5% level as long as the sequence length is not at least 3200 bp (j = 5). For k ≥ 2, the required sequence length would be around 25,600 bp (j = 8).
Table 1.
Lilliefors Tests in the Nonuniform Case
| |
j = 0 |
j = 1 |
j = 2 |
j = 3 |
j = 4 |
j = 5 |
j = 6 |
j = 7 |
j = 8 |
|---|---|---|---|---|---|---|---|---|---|
| k | p-values for D2 | ||||||||
| 1 | 0 | 0 | 0 | 0 | 0.0024 | 0.0938 | 0.1109 | 0.0807 | 0.3957 |
| 2 | 0 | 0 | 0 | 0 | 0.0023 | 0.0069 | 0.0106 | 0.0102 | 0.6252 |
| 3 | 0 | 0 | 0 | 0 | 0.0001 | 0.0010 | 0.0001 | 0.0131 | 0.2650 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0002 | 0.0182 | 0.1027 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0001 | 0.0063 | 0.0604 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0001 | 0.0001 | 0.1181 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0916 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0523 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0059 | 0.0221 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0002 | 0.2982 |
| k |
p-values for
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0.0034 | 0.0593 | 0.0030 | 0.2612 | 0.0032 | 0.0982 | 0.0241 | 0.0210 |
| 4 | 0 | 0 | 0.0068 | 0.1399 | 0.0269 | 0.2058 | 0.1181 | 0.1160 | 0.4827 |
| 5 | 0 | 0 | 0 | 0 | 0.0219 | 0.7885 | 0.4417 | 0.4252 | 0.9629 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0.0657 | 0.0829 | 0.4727 | 0.7407 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0.0048 | 0.0049 | 0.5533 | 0.2175 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0065 | 0.2870 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0088 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0002 |
| k |
p-values for
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.3728 | 0.5997 | 0.5025 | 0.4173 | 0.2135 | 0.7626 | 0.6838 | 0.2939 | 0.3014 |
| 2 | 0.0200 | 0.5862 | 0.2341 | 0.6421 | 0.4381 | 0.4952 | 0.0056 | 0.0859 | 0.9934 |
| 3 | 0.0850 | 0.6381 | 0.4737 | 0.0885 | 0.2534 | 0.4759 | 0.4301 | 0.3755 | 0.7175 |
| 4 | 0.0122 | 0.6068 | 0.1088 | 0.2496 | 0.2317 | 0.8684 | 0.3738 | 0.3374 | 0.5795 |
| 5 | 0 | 0.1302 | 0.0589 | 0.4513 | 0.6518 | 0.0257 | 0.2077 | 0.0963 | 0.5495 |
| 6 | 0 | 0.0002 | 0.0443 | 0.1475 | 0.6168 | 0.7280 | 0.2860 | 0.2407 | 0.7960 |
| 7 | 0 | 0 | 0.0319 | 0.0003 | 0.0278 | 0.3482 | 0.6177 | 0.7590 | 0.6117 |
| 8 | 0 | 0 | 0 | 0 | 0.8755 | 0.0997 | 0.1405 | 0.9000 | 0.1395 |
| 9 | 0 | 0 | 0 | 0 | 0.0069 | 0.0026 | 0.5661 | 0.9682 | 0.1670 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0017 | 0.5321 | 0.5256 |
Table 1 also shows that the statistic
would reject the hypothesis of normality not only for large k, but also for small k with large sequence length. This nonmonotonic behavior of
indicates that, to declare statistical significance, the statistic should not be compared with a normal distribution.
In contrast, Table 1 displays that
is reasonably close to normal even for a sequence of length 200 bp when k ≤ 4; for k = 8 a sequence of length 1600 bp would already look reasonably normal. Moreover, the statistics stay close to normal, with increasing sequence length and with increasing word length, and it thus displays the monotonicity which makes the statistic safe to apply.
We repeated the simulations using the Kolmogorov–Smirnov test with the known mean and variances, based on Kantorovitz et al. (2007b), instead of the Lilliefors test, for D2. Although the Kolmogorov–Smirnov test gave slightly larger p-values, thus indicating a slightly better fit to a normal distribution, the qualitative behavior remained (data not shown).
2.4.2. The uniform case
In the uniform case, our theoretical results predict that the limiting distribution of D2 would only look normal when the sequence length is large and k is large also, or at least in a moderate range. In contrast,
would still be asymptotically normal even for small k when the sequence is large.
Table 2 confirms this predicted behavior. Note that from Equation (3) we can see that, in the uniform case, both
and D2z are the same as D2 up to a multiplicative constant and an additive constant. Table 2 shows that the statistics D2,
, and D2z do not monotonically approach the normal distribution. In contrast, we find that
is close to normal even for sequences of length 100 bp when k ≤ 3, and it gets closer to the normal distribution with both increasing sequence length and decreasing k.
Table 2.
Lilliefors Test in the Uniform Case
| |
j = 0 |
j = 1 |
j = 2 |
j = 3 |
j = 4 |
j = 5 |
j = 6 |
j = 7 |
j = 8 |
|---|---|---|---|---|---|---|---|---|---|
| k | p-values for D2 | ||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0.0033 | 0.0189 | 0.2583 | 0.0274 | 0.0456 | 0.0395 | 0.0236 |
| 4 | 0 | 0 | 0 | 0.0329 | 0.4335 | 0.0667 | 0.6742 | 0.3441 | 0.5398 |
| 5 | 0 | 0 | 0 | 0 | 0.0204 | 0.2291 | 0.0991 | 0.0023 | 0.0789 |
| 6 | 0 | 0 | 0 | 0 | 0.0002 | 0.1659 | 0.0176 | 0.1619 | 0.2122 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0.0037 | 0.0653 | 0.4166 | 0.3541 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0189 | 0.0177 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0054 | 0.0924 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0007 |
| k |
p-values for
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0041 | 0.0230 | 0.0695 | 0.5939 | 0.0740 | 0.4645 | 0.3077 | 0.3840 | 0.6881 |
| 2 | 0.8228 | 0.8110 | 0.4015 | 0.2622 | 0.9471 | 0.4388 | 0.1174 | 0.9353 | 0.3598 |
| 3 | 0.5814 | 0.0457 | 0.7268 | 0.6882 | 0.9166 | 0.1910 | 0.6372 | 0.0612 | 0.9833 |
| 4 | 0.0012 | 0.1845 | 0.7225 | 0.4598 | 0.8781 | 0.1207 | 0.5530 | 0.1731 | 0.4103 |
| 5 | 0 | 0 | 0.1518 | 0.8733 | 0.4540 | 0.4149 | 0.3344 | 0.0865 | 0.1448 |
| 6 | 0 | 0 | 0 | 0.0986 | 0.6773 | 0.1933 | 0.5861 | 0.3467 | 0.5809 |
| 7 | 0 | 0 | 0 | 0 | 0.0144 | 0.1410 | 0.2999 | 0.8761 | 0.3339 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0.0002 | 0.0431 | 0.0862 | 0.0880 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0086 | 0.6396 | 0.4303 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0004 | 0.2524 |
3. Power Studies
In Section 2, we studied the distributions of D2,
, and
under the null model that the two sequences are i.i.d. having the same distribution. In this section, we will study the power of detecting the relationships between the two sequences under two alternative models for their relationships.
Note that as a result of estimating the mean, the term
vanishes in the case when k = 1 for
and
. So we chose k ≥ 2 for a fair comparison of our statistics.
First, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet
. We consider the same two types of distributions on the letters as earlier: the uniform distribution
and a gc rich, nonuniform distribution
. For each n = 2j × 102, where
, and for each
, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. All results are based on a sample size of 10,000.
The first alternative model renders the two sequences dependent through a common motif (W) which is randomly distributed across the two sequences. The second alternative model is inspired by horizontal gene transfer. We randomly choose a certain number of fragments in the first sequence and then replace the corresponding fragments (position-wise) in the second sequence by the letters in the first sequence. Again, as a consequence, the two sequences would no longer be independent. In more detail, the two models are chosen as follows:
-
The “common motif” model:
A motif of length L = 5 is chosen, say w = agcca. Next, Bernoulli random variables
, with P (Zi = 1) = g, are generated for
, we insert the word W in place of
in sequence 1. We avoid overlap by moving on to Zi + L, whenever Zi = 1. We repeat the process for sequence 2. The scores of the various statistics are then computed using the “newly” generated pair of sequences. -
The “pattern transfer” model:
We first choose L = 5 as the length of the segment to be “transferred” from sequence 1 to 2. Again, Bernoulli random variables
, with P(Zi = 1) = g, are generated for
. When Zi = 1, we pick the L-word
in sequence 1 and replace
in sequence 2 with it. Again, we disallow overlaps. For this model, we compute the scores of all the statistics using sequence 1 and “new” sequence 2.
The procedure described above is repeated 10,000 times, and the statistics calculated to yield the empirical distributions of the various statistics for each triplet of (k, n, g). As g values for the Bernoulli variables we chose g = 0.001, 0.005, 0.01, 0.05, and g = 0.1.
For each statistic, we set a type I level of α = 0.05. Using the empirical distribution S of the statistic under the null model we find s so that P(S ≥ s) = α. For a given g value, the power of the statistics is then estimated by the proportion of times the score under the alternative model exceeds s.
We now consider the power curves of D2,
, and
for both models, as well as a comparison between these statistics. For alternative model 1, Figure 1 shows that for k = 2, the power of D2 is even smaller than 0.05, the type I error. Further, k = 6 has the best power.
FIG. 1.
Alternative Model 1: Power curves for D2 under the gc-rich distribution; g = 0.01. For k = 2, the power of D2 is smaller than 0.05, the type I error (indicated by the horizontal dashed line).
Figure 2 shows that k = 4 has the greatest power for
under the first alternative model. For
, Figure 3 shows that k = 5 has the greatest power for
under the first alternative model, which corresponds to the length of the “common” motif which we assume relate the two sequences.
FIG. 2.
Alternative Model 1: Power curves for
under the gc-rich distribution; g = 0.01. Note: k = 4 has the greatest power.
FIG. 3.
Alternative Model 1: Power curves for
under the gc-rich distribution; g = 0.01.
Turning to a comparison of the power of our various statistics, we find that
has greater power than
for each k = 2, 4, 5, 6, 10 (result not shown).
We note that although in the uniform case, D2,
, and D2z coincide up to multiplicative and additive constants, Figure 4 shows slight differences between D2z and
. These differences stem from using the estimated parameters instead of the true model parameters in the test statistic D2z.
FIG. 4.
Alternative Model 1: Power curves for D2, D2z,
, and
under the uniform distribution; g = 0.01, k = 5. Note: For the uniform case, D2 and
differ by only a constant.
Figure 5 shows a typical scenario for alternative model 1, where both
and
have greater power than D2 for given k and g and the power increases as the length, n, of the sequences increases. Even for a small g, we are able to notice the difference in the power of the various statistics. We also note that here D2z has higher power than D2.
FIG. 5.
Alternative Model 1: Power curves for D2, D2z,
, and
under the gc-rich model; k = 5, g = 0.01. Note: All of
, D2z, and
have greater power than D2 for given k and g.
For alternative model 2, the picture changes. Figure 6 shows that the power of D2 is poor for k = 2, 4, 5, 6; but when increasing the parameter k to 10, far beyond the length of the tuple which we transfer, the power increases dramatically.
FIG. 6.
Alternative Model 2: Power curves for D2 under the gc-rich distribution; g = 0.05. This graph suggests that the power increases with k.
In contrast, Figure 7 shows that for
the power is moderate for all values of k in the plot, and it does not show a marked increase with sequence length. Using k = 10, instead of k = 6, seems to decrease the power slightly.
FIG. 7.
Alternative Model 2: Power curves for
under the gc-rich distribution; g = 0.05.
For
, Figure 8 shows that the power increases with k, and increasing sequence length slightly improves the power. For k = 10, the power approaches 1 for long sequences.
FIG. 8.
Alternative Model 2: Power curves for
under the gc-rich distribution; g = 0.05.
For the alternative model 2, Figures 6–8 suggest that, under the gc-rich, nonuniform distribution, for D2 and
, the greater the k value, the greater the power, even if this comes with a higher computational cost. We note that for k fixed,
has greater power than D2. Moreover,
has smaller power than D2 for k = 10 and long sequences. Also, we need a larger g value to see the differentiation of the power between the various k values for alternative model 2. This is due to the fact that in the first alternative model, a particular motif has a large contribution to the statistics. In the second model, however, the segment transferred from sequence 1 might be similar to the corresponding segment it replaces in sequence 2; hence, a greater g value is required before the sequences show similarity.
Under the alternative model 2, we find that for k ≤ 9 and g ≤ 0.05, D2 and D2z actually show a decrease in power as n increases, in certain intervals; this is illustrated in Figure 9.
FIG. 9.
Alternative Model 2: Power curves for D2, D2z,
, and
under the gc-rich distribution when k = 5, g = 0.05. Note: For k = 5, D2 has the least power and its power actually decreases as n increases.
For k = 10, D2 has higher power than
, but lower power than
; the higher power than
comes at a great computational cost (result not shown).
Our findings suggest that D2 is not desirable as a statistic for sequence comparison. We conjecture that this is due to the fact that D2 is dominated by the normal components of the individual sequences and so is actually “measuring the sum of the departure of each sequence from the background” (Lippert et al., 2002) rather than the (dis)similarity between the two sequences. As n increases, D2 loses its detecting power as the two sequences become more similar.
As an aside, under the uniform distribution, in the alternative model 2, all three statistics behave similarly for k = 5, as expected, see Figure 10.
FIG. 10.
Alternative Model 2: Power curves for D2, D2z,
, and
under the uniform distribution when k = 5, g = 0.05.
4. Using
to Test for Similarity
Although in our simulations
is more powerful than
, the statistic
is still considerably more powerful than D2. For tests that would result in small p-values, as required for multiple tests for example, simulating the empirical distribution of the test statistics under the null hypothesis can be time consuming. In contrast to
, the limiting distribution of
is normal with mean zero, and hence, testing is straightforward; only the standard deviation needs to be estimated.
To illustrate the procedure, for fixed k and n, we generate 10,000 pairs of sequences under the gc-rich or uniform case and compute the
scores for each pair. The standard deviation of
is then estimated from these empirical scores.
Again, for fixed k and n, we generate 2000 pairs of sequences under the null model of no relationship between the two sequences, in both the gc-rich and uniform case, and we compute the
score. Assuming asymptotic normality, we use a z-test to test the null hypothesis of no relationship, assuming mean zero and the estimated standard deviation.
Then we generate 2000 pairs of sequences from alternative model 1, with motif insertion probabilities g = 0.001, 0.005, 0.01, 0.05, 0.1. The
statistic is computed, and we carry out a z-test, based on the asymptotic normality of
. We repeat the procedure for the pattern transfer model, alternative model 2. We choose k = 4, 5, 6, because we know from our power simulations that
works best when the motif length is around 5.
We compare to the results that we obtain using the empirical distribution of
instead, where the empirical distribution function is based on 10,000 samples. In addition, we use the empirical distribution of
based on 100,000 samples.
Tables 3 and 4 show the estimated type 1 and type 2 error rates in the gc-rich and uniform case; recall that the type 1 error is the probability to reject the null hypothesis although it is true. The type 2 error, the probability to accept the null hypothesis although false, is estimated under the alternative model 1 and under the alternative model 2, with motif insertion probability and pattern transfer probability g taking on the values g = 0.005, 0.01, 0.05, 0.1. For each n and k, the first row gives the estimates from the z-test (abbreviated as z), and the second row gives the estimate from the empirical distribution function (abbreviated as e). Except for the puzzling case when n = 3200 with k = 6, the results are remarkably similar, and there is no clear advantage of using the empirical distribution function when based on a relatively small number of samples. The general observation is that the normal approximation for
gives a fast method for assessing statistical significance.
Table 3.
The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples
| |
|
Type 1 |
Type 2 |
||||
|---|---|---|---|---|---|---|---|
| Length | k | M1g1 | M1g2 | M2g3 | M2g4 | ||
| 3200 | 4 | z | 0.001 | 0.966 | 0.583 | 0.965 | 0.764 |
| e | 0.001 | 0.964 | 0.555 | 0.961 | 0.745 | ||
| t | 0.000 | 0.971 | 0.609 | 0.972 | 0.784 | ||
| 3200 | 5 | z | 0.002 | 0.958 | 0.583 | 0.881 | 0.365 |
| e | 0.001 | 0.977 | 0.658 | 0.922 | 0.450 | ||
| t | 0.001 | 0.975 | 0.647 | 0.916 | 0.433 | ||
| 3200 | 6 | z | 0.388 | 0.260 | 0.020 | 0.100 | 0.001 |
| e | 0.000 | 0.973 | 0.647 | 0.910 | 0.273 | ||
| t | 0.001 | 0.974 | 0.654 | 0.913 | 0.282 | ||
| 6400 | 4 | z | 0.000 | 0.904 | 0.112 | 0.973 | 0.794 |
| e | 0.000 | 0.907 | 0.114 | 0.976 | 0.800 | ||
| t | 0.001 | 0.897 | 0.109 | 0.972 | 0.788 | ||
| 6400 | 5 | z | 0.000 | 0.910 | 0.132 | 0.921 | 0.442 |
| e | 0.000 | 0.920 | 0.148 | 0.936 | 0.480 | ||
| t | 0.001 | 0.914 | 0.140 | 0.927 | 0.463 | ||
| 6400 | 6 | z | 0.028 | 0.635 | 0.022 | 0.607 | 0.049 |
| e | 0.000 | 0.931 | 0.205 | 0.943 | 0.319 | ||
| t | 0.001 | 0.921 | 0.191 | 0.936 | 0.294 | ||
| 12,800 | 4 | z | 0.000 | 0.592 | 0.000 | 0.975 | 0.794 |
| e | 0.001 | 0.545 | 0.000 | 0.963 | 0.752 | ||
| t | 0.001 | 0.593 | 0.000 | 0.975 | 0.794 | ||
| 12,800 | 5 | z | 0.000 | 0.627 | 0.000 | 0.919 | 0.433 |
| e | 0.000 | 0.696 | 0.000 | 0.949 | 0.514 | ||
| t | 0.001 | 0.648 | 0.000 | 0.929 | 0.461 | ||
| 12,800 | 6 | z | 0.002 | 0.493 | 0.000 | 0.820 | 0.139 |
| e | 0.000 | 0.697 | 0.002 | 0.940 | 0.304 | ||
| t | 0.001 | 0.684 | 0.002 | 0.933 | 0.286 | ||
| 25,600 | 4 | z | 0.001 | 0.070 | 0.000 | 0.961 | 0.798 |
| e | 0.002 | 0.062 | 0.000 | 0.954 | 0.767 | ||
| t | 0.002 | 0.067 | 0.000 | 0.957 | 0.782 | ||
| 25,600 | 5 | z | 0.001 | 0.106 | 0.000 | 0.917 | 0.452 |
| e | 0.000 | 0.121 | 0.000 | 0.929 | 0.484 | ||
| t | 0.001 | 0.105 | 0.000 | 0.916 | 0.447 | ||
| 25,600 | 6 | z | 0.004 | 0.083 | 0.000 | 0.868 | 0.201 |
| e | 0.003 | 0.100 | 0.000 | 0.887 | 0.234 | ||
| t | 0.001 | 0.132 | 0.000 | 0.923 | 0.294 | ||
“M1/M2” refers to alternative model 1/2; “g1/g2/g3/g4” refers to the cases where g = 0.005, 0.01, 0.05, 0.1, respectively, where g is the parameter of the Bernoulli random variable B. So “M1g3” means “alternative model 1, g = 0.05.” The first row gives the estimates from the z-test (abbreviated as z), the second row gives the estimate from the empirical distribution function (abbreviated as e), both based on 10,000 samples. The third row, abbreviated as t, gives the estimate from the empirical distribution function based on 100,000 samples.
Table 4.
The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples, but for the Uniform Case
| |
|
Type 1 |
Type 2 |
||||
|---|---|---|---|---|---|---|---|
| Length | k | M1g1 | M1g2 | M2g3 | M2g4 | ||
| 3200 | 4 | z | 0.001 | 0.978 | 0.574 | 0.961 | 0.746 |
| e | 0.003 | 0.966 | 0.518 | 0.945 | 0.694 | ||
| t | 0.001 | 0.980 | 0.593 | 0.967 | 0.760 | ||
| 3200 | 5 | z | 0.002 | 0.977 | 0.583 | 0.889 | 0.346 |
| e | 0.001 | 0.986 | 0.639 | 0.919 | 0.404 | ||
| t | 0.001 | 0.984 | 0.621 | 0.909 | 0.380 | ||
| 3200 | 6 | z | 0.048 | 0.744 | 0.198 | 0.438 | 0.018 |
| e | 0.001 | 0.986 | 0.711 | 0.912 | 0.237 | ||
| t | 0.001 | 0.985 | 0.678 | 0.900 | 0.218 | ||
| 6400 | 4 | z | 0.001 | 0.882 | 0.080 | 0.953 | 0.743 |
| e | 0.002 | 0.848 | 0.063 | 0.934 | 0.688 | ||
| t | 0.001 | 0.887 | 0.084 | 0.956 | 0.752 | ||
| 6400 | 5 | z | 0.001 | 0.888 | 0.107 | 0.891 | 0.333 |
| e | 0.002 | 0.874 | 0.093 | 0.876 | 0.304 | ||
| t | 0.001 | 0.899 | 0.123 | 0.903 | 0.363 | ||
| 6400 | 6 | z | 0.004 | 0.824 | 0.064 | 0.801 | 0.104 |
| e | 0.002 | 0.897 | 0.113 | 0.883 | 0.177 | ||
| t | 0.001 | 0.910 | 0.135 | 0.894 | 0.203 | ||
| 12,800 | 4 | z | 0.003 | 0.534 | 0.000 | 0.951 | 0.750 |
| e | 0.002 | 0.580 | 0.000 | 0.961 | 0.783 | ||
| t | 0.003 | 0.539 | 0.000 | 0.952 | 0.751 | ||
| 12,800 | 5 | z | 0.002 | 0.578 | 0.000 | 0.886 | 0.356 |
| e | 0.002 | 0.585 | 0.000 | 0.887 | 0.365 | ||
| t | 0.002 | 0.590 | 0.000 | 0.891 | 0.371 | ||
| 12,800 | 6 | z | 0.002 | 0.491 | 0.000 | 0.823 | 0.110 |
| e | 0.001 | 0.600 | 0.000 | 0.892 | 0.171 | ||
| t | 0.001 | 0.605 | 0.001 | 0.895 | 0.175 | ||
| 25,600 | 4 | z | 0.000 | 0.059 | 0.000 | 0.963 | 0.738 |
| e | 0.002 | 0.050 | 0.000 | 0.949 | 0.706 | ||
| t | 0.001 | 0.059 | 0.000 | 0.960 | 0.737 | ||
| 25,600 | 5 | z | 0.000 | 0.086 | 0.000 | 0.909 | 0.361 |
| e | 0.000 | 0.110 | 0.000 | 0.923 | 0.418 | ||
| t | 0.001 | 0.082 | 0.000 | 0.901 | 0.350 | ||
| 25,600 | 6 | z | 0.002 | 0.093 | 0.000 | 0.865 | 0.154 |
| e | 0.002 | 0.114 | 0.000 | 0.889 | 0.183 | ||
| t | 0.003 | 0.113 | 0.000 | 0.888 | 0.182 | ||
5. Phase Transition
In this section, we explore the effect of small deviation from the uniform distribution for the D2 statistic only. We restrict attention to the alphabet
and word size k = 1; both sequences are of the same length n. Again pα denotes the probability of the letter α. Then the standardized counts
and
both tend to standard normal variables when n tends to infinity. With this notation and noting that
, we obtain
![]() |
(7) |
When the distribution of the alphabet is uniform,
the second term in Equation (7) vanishes,
![]() |
and so it is asymptotically nonnormal (in fact, a sum of products of standard normal variables).
In the situation where
do not depend on n, the second term in Equation (7) dominates the first term; as
![]() |
the scaled limit is a normal distribution.
Next we assume that (pa(n), pg (n), pc (n), pt (n)) changes with n in such a way that there exists a function f (n) → 0 and constants Cl, l = a, g, c, t satisfying Ca + Cg + Cc + Ct = 0 such that
![]() |
for each letter
. Then
![]() |
(8) |
where γ(n) → 0 as n → ∞ and Θ(f (n)) indicates a term that has the same order as f (n).
Let f (n) = 1/n0.5 + ε. When ε < 0, the second term in (8) will dominate and
will tend to a normal distribution. When ε > 0 the first term dominates and
will tend to be nonnormal. Thus, we expect a phase transition from normal to nonnormal as ε changes from negative to positive. Intuitively, the ratio Rn of the coefficient of the first term over the coefficient of the second term in Equation (8) can be thought of as a “ratio of dominance”;
![]() |
Table 5 shows the decrease of Rn for increasing n (row labels) and decreasing ε (column labels).
Table 5.
The Ratio Rn of the Coefficient of the First Term Over the Coefficient of the Second Term in for Different Values of n and ε
| |
ε |
||||
|---|---|---|---|---|---|
| n | −0.01 | −0.05 | −0.10 | −0.15 | −0.20 |
| 210 × 100 | 0.386 | 0.243 | 0.137 | 0.077 | 0.043 |
| 220 × 100 | 0.360 | 0.172 | 0.068 | 0.027 | 0.011 |
| 230 × 100 | 0.336 | 0.122 | 0.034 | 0.010 | 0.003 |
To run simulations in the vicinity of this phase transition, we consider two types of probability vectors for the alphabet in
and f (n) = 1/n0.5 + ε. The type I probability vector is chosen as
, giving (Ca, Cc, Cg, Ct) = (1, − 1, 0, 0).
In the second scenario, type II, the probability vector perturbs all components,
, so that (Ca, Cg, Cc, Ct) = (1, − 1, 1, −1).
Under the type I model, we can show that the variance of
is approximately
. Thus, there exists Zn → N(0, 1) as n tends to infinity such that
![]() |
Equation (8) can be rewritten as
![]() |
(9) |
Under the type II model, the variance of
is approximately
and
![]() |
Hence, under the type II model, Equation (8) can be rewritten as
![]() |
(10) |
The ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is
-fold larger than that for Equation (10). Therefore, we expect that normality appears for relative small absolute ε values for the type II vector.
For given n and ε, we generate sequences of length n using both types of distribution vectors. The D2 scores for word size k = 1 are then tabulated. We use a Kolmogorov–Smirnov test to test the hypothesis that D2 is normally distributed, and the corresponding p-value is obtained. Here again we use the theoretical mean and variance for the test. Table 6 gives the p-values for different values of n and ε under the two models; again we report only the first four digits.
Table 6.
The p-Values of the Kolmogorov–Smirnov Test for Testing the Normality of D2 for Letter Distributions Which Are Close to Uniform; f(n) = 1/n0.5+ε
| |
n |
|||
|---|---|---|---|---|
|
ε |
25 × 100 |
210 × 100 |
215 × 100 |
220 × 100 |
Type I :
|
||||
| 0.1 | 0 | 0 | 0 | 0 |
| 0.01 | 0 | 0 | 0 | 0 |
| 0.001 | 0 | 0 | 0 | 0 |
| 1 e−04 | 0 | 0 | 0 | 0 |
| 1 e−05 | 0 | 0 | 0 | 0 |
| −1 e−05 | 0 | 0 | 0 | 0 |
| −1 e−04 | 0 | 0 | 0 | 0 |
| −0.001 | 0.0001 | 0 | 0.0001 | 0 |
| −0.01 | 0 | 0 | 0.0002 | 0 |
| −0.05 | 0 | 0 | 0 | 0.0088 |
| −0.1 | 0.0424 | 0.1402 | 0.2754 | 0.6383 |
| −0.15 | 0.2557 | 0.1027 | 0.2041 | 0.9915 |
| −0.2 | 0.1198 | 0.6978 | 0.2258 | 0.4000 |
Type II :
|
||||
|---|---|---|---|---|
| 0.1 | 0 | 0 | 0 | 0 |
| 0.01 | 0.0005 | 0 | 0 | 0 |
| 0.001 | 0 | 0 | 0 | 0 |
| 1 e−04 | 0 | 0.0005 | 0 | 0 |
| 1 e−05 | 0 | 0.0015 | 0 | 0 |
| −1 e−05 | 0.0049 | 0 | 0 | 0 |
| −1 e−04 | 0.0020 | 0 | 0 | 0 |
| −0.001 | 0 | 0 | 0 | 0 |
| −0.01 | 0 | 0 | 0 | 0.0199 |
| −0.05 | 0.0002 | 0.0069 | 0.0005 | 0.3162 |
| −0.1 | 0.0866 | 0.0637 | 0.3941 | 0.3159 |
| −0.15 | 0.5832 | 0.5113 | 0.0326 | 0.5910 |
| −0.2 | 0.1015 | 0.5146 | 0.9437 | 0.4827 |
Table 6 indicates that under the type II model the distribution of D2 is not significantly different from normality when ε ≤ −0.05 and n = 220 × 100, while significantly different from normality when ε = −0.05 and n ≤ 215 × 100. On the other hand, under the type I model, the distribution of D2 is significantly different from normality even when ε = −0.05 and n = 220 × 100. The simulation results are consistent with our intuition. As shown in Table 5, the ratio Rn of the coefficient of the first term over that of the second term in Equation (8) is less than 0.1 if ε < −0.10 when n = 220 × 100. This can explain why normality of D2 appears if ε < −0.10 when n = 220 × 100 for both type I and II models. Further, as the ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is
-fold larger than that for Equation (10), normality of D2 begins to appear when ε < −0.05 for the type II model.
6. Discussion
The typically used statistic D2 asymptotically ignores the joint word occurrences in two sequences unless all letters are almost equally likely. In the latter scenario, a phase transition occurs. Hence, the statistic is neither robust nor informative, under the normal regime. The main advantage of D2 is that it is easy to compute.
The proposed
statistic instead is also easy to compute, but it can be compared with a normal distribution to assess significance, and it performs well in a power study.
The
statistic is more powerful than
in our simulation study and is also easy to compute, but its asymptotic distribution does not have a convenient form; instead, it would best be assessed using simulation, which are time consuming, as the tail of the distribution would need to be estimated.
Our recommendation is to discard D2, to use
instead when computing time is limited, and to ideally use
for sequence comparison based on k-tuple content.
Our results allow for a number of generalizations. The normal approximation for the word counts in each individual sequences does not assume that the underlying letter distribution is the same as in the other sequence. Hence, the normal approximation for
also holds when the sequences do not follow the same underlying distribution on the letters.
Huang (2002) gave a related normal approximation for one sequence in the more general situation that the sequence is generated by a homogeneous Markov chain. Kantorovitz et al. (2007b) already successfully adapted D2z to the Markov case. Also for
the generalization of our results for that setting should be straightforward; the error bounds would need to be adjusted. Burden et al. (2008) generalized D2z to allow for mismatches, on the four-letter alphabet {a, c, g, t} under the Bernoulli model that pa = pt and pc = pg; they called an m-neighborhood of a word w of length k the set of all words which differ by at most m letters from w. The generalized statistic then counts the number of all m-neighborhood matches of all k-words between two sequences. With our normal approximation for all word counts,
could be generalized similarly to allow for a certain number of mismatches. The quality of the normal approximation will depend on the number m of permitted mismatches.
We also indicate that more than two sequences could be compared in a similar fashion. Quine (1994) stated the result that if
are independent normal random variables with zero means and variances
, then
![]() |
where both sums are over all integers
(Melnykov and Chen, 2007). This suggests the extension of the
statistic for multiple sequence comparison by taking the products of the individual word counts and standardizing it as earlier. Then still a normal approximation is valid.
Similarly, we could extend
as the sum, over all words, of the product of more than two standardized word counts. Springer and Thompson (1966) gave a formula for the density of the product of independent standard normals. Again the covariance structure of the word counts within one sequence would make it recommendable to assess the limiting distribution via simulation.
Supplementary Material
Acknowledgments
G.R. was supported in part by EPSRC grant no. GR/R52183/01, and by BBSRC and EPSRC through OCISB. D.C. was supported by a Overseas Postdoctoral Fellowship from the National University of Singapore. F.S. was supported by NIH grant no. P50 HG 002790 and R21AG032743. M.S.W. was supported by NIH grant no. P50 HG 002790 and R21AG032743.
Disclosure Statement
No competing financial interests exist.
References
- Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2008;18:1–21. [Google Scholar]
- Forêt S. Kantorovitz M. Burden C. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 2006;7(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recognit. 2009;42:539–548. [Google Scholar]
- Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probab. 2002;34:559–586. [Google Scholar]
- Kantorovitz M.R. An example of a stationary, triplewise independent triangular array for which the CLT fails. Statist. Probab. Lett. 2007;77:539–542. [Google Scholar]
- Kantorovitz M.R. Booth H.S. Burden C.J., et al. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab. 2007a;44:788–805. [Google Scholar]
- Kantorovitz M.R. Robinson G.E. Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007b;23:i249–i255. doi: 10.1093/bioinformatics/btm211. [DOI] [PubMed] [Google Scholar]
- Lilliefors H.W. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 1967;62:399–402. [Google Scholar]
- Lippert R.A. Huang H. Waterman M.S. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA. 2002;99:13980–13989. doi: 10.1073/pnas.202468099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundstrom R. Stochastic models and statistical methods for DNA sequence data. Department of Mathematics, University of Utah; Salt Lake City, UT: 1990. [Ph.D. thesis]. [Google Scholar]
- Melnykov I. Chen J.T. A connection between self-normalized products and stable laws. Stat. Probab. Lett. 2007;77:1662–1665. [Google Scholar]
- Quine M.P. A result of Shepp. Appl. Math. Lett. 1994;7:33–34. [Google Scholar]
- Shepp L. Problem 62-9: Normal functions of normal random variables. SIAM Rev. 1964;6:459. [Google Scholar]
- Springer M.D. Thompson W.E. The distribution of products of independent random variables. SIAM J. Appl. Math. 1966;14:511–526. [Google Scholar]
- Stuart A. Ord J.K. Kendall's Advanced Theory of Statistics. 5th. Vol. 1. Griffin; London: 1987. [Google Scholar]
- Waterman M.S. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC; Boca Raton, FL: 1995. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











































