Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2009 Dec;16(12):1615–1634. doi: 10.1089/cmb.2009.0198

Alignment-Free Sequence Comparison (I): Statistics and Power

Gesine Reinert 1, David Chew 2, Fengzhu Sun 3,,4, Michael S Waterman 3,,4,
PMCID: PMC2818754  NIHMSID: NIHMS164900  PMID: 20001252

Abstract

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D2 word count statistic, which we call Inline graphic and Inline graphic. For Inline graphic, which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, Inline graphic, outperforms Inline graphic in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of Inline graphic, we cannot provide a closed form for power calculations.

Key words: alignment-free, normal approximation, normal distribution, sequence alignment, word count statistics

1. Introduction

Comparison of the similarities between two segments of biological sequences using k-tuples (also called k-grams or k-words) arises from the need for rapid sequence comparison. Such methods are often employed in cDNA sequence comparisons. Today next-generation sequencing methods are producing unprecedented volumes of sequence data. Therefore, we expect that the use of k-tuples will play an increasingly important role for molecular sequence and genome comparisons in the current era. This article will explore in some detail the statistic which one of these methods is based upon, along with other substantially superior statistics.

One of the most widely used statistics for sequence comparison based on k-tuples is the so-called D2 statistic, which is based on the joint k-tuple content in the two sequences. If two sequences are closely related, we would expect the k-tuple content of both sequences to be very similar.

More formally, suppose that two sequences, Inline graphic and Inline graphic, say, are composed of letters that are drawn from a finite alphabet Inline graphic of size d. For a Inline graphic, let pa denote the probability of letter a. For Inline graphic, let

graphic file with name M12.gif

count the number of occurrences of w in A, and similarly, Yw counts the number of occurrences of w in B. Here, Inline graphic; similarly, we put for later use Inline graphic. Then D2 is defined by

graphic file with name M15.gif

The null model is typically chosen to be such that the letters are independent and identically distributed (i.i.d.), and that the two sequences are independent. Using this model, Lippert et al. (2002) derived a Poisson approximation, a compound Poisson approximation, and a normal approximation for D2; the normal approximation is only valid under the assumption that not all letters of the alphabet are equally likely. In the case that all letters are equally likely, the D2 statistic looks asymptotically like the sum of products of independent normal variables. Lippert et al. (2002) also found that the D2 statistic is dominated by background noise, in the nonuniform case.

In the work of Kantorovitz et al. (2007a), it was shown that in the regime that all letters are equally likely, the standardized statistic

graphic file with name M16.gif

is asymptotically normally distributed when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. In clustering some biologically related sequences, Kantorovitz et al. (2007b) found that D2z outperforms D2. The heuristic argument is that the background models for the two sequences may be different, and the D2 statistic should hence be normalized to account for the different background distributions of the sequences. Yet in the nonuniform case the issue about the variability being dominated by the noise in the single sequences remains.

In this article, we propose a new statistic, which is a self-standardized version of D2. In general, Shepp (1964) observed that, if X and Y are independent mean zero normals, X with variance Inline graphic and Y with variance Inline graphic, then Inline graphic is again normal, with variance Inline graphic. For Inline graphic is the probability of occurrence of w, and the centralized count variable is denoted as

graphic file with name M22.gif

We introduce the new count statistic as

graphic file with name M23.gif (1)

Here we set Inline graphic. The superscript “S” stands for “Shepp,” and also for “self-standardized.” We shall see that, under reasonable assumptions, Inline graphic will be approximately normally distributed.

In practice we shall usually have to replace pa, the (unobserved) letter probabilities, by Inline graphic, the relative count of letter a in the concatenation of the two sequences, based on the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution. We then estimate the probability of occurrence of Inline graphic by Inline graphic. In our simulations, we always estimate the letter probabilities, even when we assume that all letters are equally likely.

We also study the following version of the word count statistic:

graphic file with name M29.gif (2)

which in our simulations outperforms not only D2 but also Inline graphic, in terms of power for detecting the relatedness between the two sequences. This statistic comes about by considering Inline graphic, but as the variance is costly to compute, it is replaced by the estimated mean of the word occurrence across the two sequences when the probability of the word pattern is small; this approach can be justified by considering a Poisson approximation for the individual word counts. We justify in Section 2 that Inline graphic can be viewed as the sum of the products of independent normal variables, and we suggest how to simulate from its asymptotic distribution, for which we do not have a closed-form expression.

To explain the problem with D2,

graphic file with name M33.gif (3)

Approximately, if n and m are large, under the null model, Inline graphic should follow a mean zero normal distribution with variance of order n with respect to sequence length. The distribution of Inline graphic should be approximately, for large n and m, the distribution of the sum of products of pairs of independent mean zero normal variables with a variance of order O(nm).

If all letters are equally likely, then pw = dk for all words w, and hence, Inline graphic, giving that Inline graphic. The variability in D2 is the same as the variability in Inline graphic, and indeed as in Inline graphic.

When not all letters are equally likely, as the variance of Xw is of order n and the variance of Yw is of order m, the variance of Inline graphic is of order O(n2m), and similarly the variance of Inline graphic is of order O(nm2). Hence, the variability in D2 is dominated by the variability in Inline graphic and Inline graphic. Thus, in this case, the variability in D2 is dominated by the terms that reflect the noise in the single sequences only.

The asymptotic normality of D2 for both nucleic acid and amino acid sequences has been studied empirically by Forêt et al. (2009). However, no power study was undertaken; our argument shows that in the nonuniform case the asymptotic normality of D2 only stems from the asymptotic normality of the underlying word counts in the respective sequences.

Even in the regime that all letters are equally likely, if we only leave the “last” word Inline graphic out, forming the statistic Inline graphic, then Inline graphic, which is not constant. So even if we just leave one word out of the whole set of possible words, if the sequence lengths are large and all other parameters are fixed, then the variability, now in Inline graphic, will be dominated by the variability in the single sequences. Hence, the D2 statistic is, in general, not useful for assessing whether the two underlying sequences are related.

This article is structured as follows. In Section 2, we discuss the distribution of Inline graphic and Inline graphic under the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution, and we shall present simulation results for testing the normality of D2, Inline graphic, and Inline graphic.

In Section 3, we study the power of the statistics D2, Inline graphic, Inline graphic, and D2z under two alternative scenarios. The first scenario is that the two sequences contain a common motif, whereas the second scenario is a pattern transfer model; we pick a word in the first sequence and use it to replace a word in the second sequence. Our results illustrate not only the poor performance of D2, but also the encouraging performances of Inline graphic and of Inline graphic.

Section 4 illustrates how the asymptotic normality of Inline graphic gives a fast method for assessing statistical significance, as only its standard deviation has to be approximated and not the empirical distribution itself.

There is a caveat—if the distribution on the alphabet is very close to uniform, and if some other conditions are satisfied which relate to having a large number of summands of products of pairs of independent normals, then D2 will behave like the sum of products of normally distributed variables, similar to the uniform case; when the deviation from uniform increases, the asymptotic normality for D2 holds. This phase transition is explored in Section 5.

We summarize our results in Section 6, and we briefly indicate generalizations to Markov chain models as well as to multiple sequence comparisons.

The proofs for Section 2 are presented in the Supplementary Material (see online supplementary material at www.liebertonline.com). The code for simulating from the distributions is available at www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html

2. The Distributions of Inline graphic and Inline graphic Under the Null Model

Here the null model is that the letters are i.i.d. and the two sequences are independent. We assume, as in Huang (2002), that Inline graphic. For D2, Forêt et al. (2009) studied the empirical distribution via simulations, and they found that a gamma distribution outperforms the normal distribution in general. For longer sequences they showed that the normal approximation itself would be appropriate.

2.1. Inline graphic and asymptotic normality

First, we focus on the word counts in a single sequence. Let

graphic file with name M61.gif

be the vector of centered word counts with the last word Inline graphic left out; note that

graphic file with name M63.gif (4)

can be recovered from the set Inline graphic. Huang (2002) showed a multivariate normal approximation for the word count vector in a single sequence. The limiting covariance matrix C needs some notation; see section 12.1 in Waterman (1995), with results derived by Lundstrom (1990). For Inline graphic, we define, for Inline graphic,

graphic file with name M67.gif

which is the probability that w occurs, given that Inline graphic has occurred. For words Inline graphic and Inline graphic, the overlap indicator is defined as

graphic file with name M71.gif

This overlap indicator equals 1 if the last kj letters of u overlap exactly the first kj letters of v. Then the approximating covariance matrix C is given by

graphic file with name M72.gif (5)

A similar normal approximation is valid for Inline graphic. As we assume that A and B have the same letter probability distribution, both have the same limiting covariance matrix C.

Thus, we obtain the following approximation for Inline graphic (for the proof and the precise bounds, see online Supplementary Material at www.liebertonline.com). We use the abbreviation MVN(μ, C) to denote a multivariate normal distribution with mean vector μ and covariance matrix C. Also, a function Inline graphic is called Lipschitz, with Lipschitz constant 1, if for all real x and y, |h(x) − h(y)| ≤ |xy|.

Theorem 2.1

Assume mn and Inline graphic. Let Inline graphic and Inline graphic be two independent (dk − 1)-dimensional normal vectors. In analogy to Equation (4), put, for i = 1, 2,

graphic file with name M79.gif

Let

graphic file with name M80.gif

Then, Dlim is mean zero normally distributed, and, for any function h which is bounded and Lipschitz with Lipschitz constant 1, as n → ∞ with d = d(n) and k = k(d, n),

graphic file with name M81.gif

The bound in Theorem 2.1 may not be optimal; indeed, it is based on a multivariate normal approximation for word counts, Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), which is of order Inline graphic. The purpose of the bound is to illustrate the trade-off between alphabet size, word length, and sequence length. If d, the alphabet size, is very large, then even moderately long words will be rare unless the sequence is very long.

Because of the complicated dependence, we were not able to give a closed-form expression for the variance of Dlim. Theorem 2.1, however, justifies using a z-test for the null model, based on the statistic Inline graphic, using the estimated standard deviation.

2.2. Inline graphic and the product of independent normals

The statistic Inline graphic given in Equation (2) is motivated by estimating the standardized counts

graphic file with name M86.gif

approximating Inline graphic by its mean Inline graphic, with the argument that 1 − pw will be close to 1 when k is reasonably large and the word w is relatively rare. From Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), we obtain a multivariate normal approximation for the standardized count vectors Inline graphic and for Inline graphic. Although the covariances within the vectors will not disappear, for each w, Inline graphic and Inline graphic are independent and would be approximated by independent univariate standard normal variables. From the Stuart and Ord (1987), we know the distribution of the product of two independent standard normal variables [see also Springer and Thompson (1966)].

Lemma 2.1

Let X and Y be two independent standard normal random variables. Then the product W = XY has probability density

graphic file with name M93.gif (6)

where Inline graphicdt denotes the modified Bessel function of the third kind.

Thus, the distribution of each summand Inline graphic will approximately have density (6). The covariance structure will result in an approximation with a complicated distribution which would be easily assessed by simulation from many normal vectors with covariance matrix C given in Equation (5), standardizing and taking products.

2.3. The case that all letters are equally likely

In the case that all letters are equally likely, both Lippert et al. (2002) and Kantorovitz (2007) observed that D2 will not follow a normal distribution. Kantorovitz et al. (2007a) showed that Inline graphic is asymptotically normal, however, when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. However, when the word length is fixed, D2, D2z, and Inline graphic may not tend to normal as sequence length tends to infinity. Note that in the case that all letters are equally likely, all of D2, D2z, and Inline graphic agree up to constants.

2.4. Simulations

To illustrate the quality of the normal approximation, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet Inline graphic. We consider two types of distributions on the letters: the uniform distribution Inline graphic and a gc rich, nonuniform distribution Inline graphic; the latter distribution is the same as that used in Lippert et al. (2002) and Forêt et al. (2006) to study D2. Similar to Lippert et al. (2002) and Forêt et al. (2006), for each n = 2j × 102, where Inline graphic, and for each Inline graphic, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. Forêt et al. (2006) found as optimal tuple length k = 7, for n = 800, 1600, and 3200; optimal in the sense that for this choice of k, the statistic D2 will be closest to normal. All results are based on a sample size of 10,000; we use the same simulated sequences for all three scores, D2, Inline graphic, and Inline graphic. As D2z differs from D2 only by an additive and a multiplicative constant, we do not include D2z in these simulations.

We then use the Lilliefors test (Lilliefors, 1967) to assess whether the distributions are close to normal. The Lilliefors test is a modification of the Kolmogorov–Smirnov goodness-of-fit test, which is implemented using the sample mean and standard deviation as the mean and standard deviation of the theoretical limiting normal distribution. In contrast to the Kolmogorov–Smirnov test, statistical significance is based on the Lilliefors distribution; see also Forêt et al. (2009) for a discussion why not to use an unmodified Kolmogorov–Smirnov test when the standard deviation is estimated. A p-value of less than 0.05 indicates that we would reject the null model at the 5% significance level. Under the null model, in 100 tests we would expect about five tests resulting in a p-value of less than 5%. Precision is up to four decimal places. For easier readability, a value of 0.0000 is recorded simply as 0.

We will first discuss the nonuniform case, when asymptotic normality has been shown to hold for all three statistics D2, Inline graphic, and Inline graphic, when first sequence length and then word length tend to infinity. For short words, only Inline graphic has been shown to be approximately normal when sequence length tends to infinity. The regime of interest here is that zwords are not too rare; for long words, say Inline graphic, with Inline graphic, a compound Poisson approximation is more appropriate (Lippert et al., 2002).

2.4.1. The nonuniform case

In the nonuniform case, for all three statistics, the larger the sequence length and the smaller k, the closer the distribution is to normality; the performance is rather different though. Recall that the sequence length is 2j × 100; for easier readability, we denote the 2j × 100 column in the table just by the value for j.

Table 1 summarizes the p-values for the Lilliefors tests in the nonuniform case for D2, Inline graphic, and Inline graphic. For D2, Table 1 shows that even for k = 1 we would reject the hypothesis of normality at the 5% level as long as the sequence length is not at least 3200 bp (j = 5). For k ≥ 2, the required sequence length would be around 25,600 bp (j = 8).

Table 1.

Lilliefors Tests in the Nonuniform Case

 
j = 0
j = 1
j = 2
j = 3
j = 4
j = 5
j = 6
j = 7
j = 8
k p-values for D2
1 0 0 0 0 0.0024 0.0938 0.1109 0.0807 0.3957
2 0 0 0 0 0.0023 0.0069 0.0106 0.0102 0.6252
3 0 0 0 0 0.0001 0.0010 0.0001 0.0131 0.2650
4 0 0 0 0 0 0 0.0002 0.0182 0.1027
5 0 0 0 0 0 0 0.0001 0.0063 0.0604
6 0 0 0 0 0 0 0.0001 0.0001 0.1181
7 0 0 0 0 0 0 0 0 0.0916
8 0 0 0 0 0 0 0 0 0.0523
9 0 0 0 0 0 0 0 0.0059 0.0221
10 0 0 0 0 0 0 0 0.0002 0.2982
k p-values forInline graphic
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0.0034 0.0593 0.0030 0.2612 0.0032 0.0982 0.0241 0.0210
4 0 0 0.0068 0.1399 0.0269 0.2058 0.1181 0.1160 0.4827
5 0 0 0 0 0.0219 0.7885 0.4417 0.4252 0.9629
6 0 0 0 0 0 0.0657 0.0829 0.4727 0.7407
7 0 0 0 0 0 0.0048 0.0049 0.5533 0.2175
8 0 0 0 0 0 0 0 0.0065 0.2870
9 0 0 0 0 0 0 0 0 0.0088
10 0 0 0 0 0 0 0 0 0.0002
k p-values forInline graphic
1 0.3728 0.5997 0.5025 0.4173 0.2135 0.7626 0.6838 0.2939 0.3014
2 0.0200 0.5862 0.2341 0.6421 0.4381 0.4952 0.0056 0.0859 0.9934
3 0.0850 0.6381 0.4737 0.0885 0.2534 0.4759 0.4301 0.3755 0.7175
4 0.0122 0.6068 0.1088 0.2496 0.2317 0.8684 0.3738 0.3374 0.5795
5 0 0.1302 0.0589 0.4513 0.6518 0.0257 0.2077 0.0963 0.5495
6 0 0.0002 0.0443 0.1475 0.6168 0.7280 0.2860 0.2407 0.7960
7 0 0 0.0319 0.0003 0.0278 0.3482 0.6177 0.7590 0.6117
8 0 0 0 0 0.8755 0.0997 0.1405 0.9000 0.1395
9 0 0 0 0 0.0069 0.0026 0.5661 0.9682 0.1670
10 0 0 0 0 0 0 0.0017 0.5321 0.5256

Table 1 also shows that the statistic Inline graphic would reject the hypothesis of normality not only for large k, but also for small k with large sequence length. This nonmonotonic behavior of Inline graphic indicates that, to declare statistical significance, the statistic should not be compared with a normal distribution.

In contrast, Table 1 displays that Inline graphic is reasonably close to normal even for a sequence of length 200 bp when k ≤ 4; for k = 8 a sequence of length 1600 bp would already look reasonably normal. Moreover, the statistics stay close to normal, with increasing sequence length and with increasing word length, and it thus displays the monotonicity which makes the statistic safe to apply.

We repeated the simulations using the Kolmogorov–Smirnov test with the known mean and variances, based on Kantorovitz et al. (2007b), instead of the Lilliefors test, for D2. Although the Kolmogorov–Smirnov test gave slightly larger p-values, thus indicating a slightly better fit to a normal distribution, the qualitative behavior remained (data not shown).

2.4.2. The uniform case

In the uniform case, our theoretical results predict that the limiting distribution of D2 would only look normal when the sequence length is large and k is large also, or at least in a moderate range. In contrast, Inline graphic would still be asymptotically normal even for small k when the sequence is large.

Table 2 confirms this predicted behavior. Note that from Equation (3) we can see that, in the uniform case, both Inline graphic and D2z are the same as D2 up to a multiplicative constant and an additive constant. Table 2 shows that the statistics D2, Inline graphic, and D2z do not monotonically approach the normal distribution. In contrast, we find that Inline graphic is close to normal even for sequences of length 100 bp when k ≤ 3, and it gets closer to the normal distribution with both increasing sequence length and decreasing k.

Table 2.

Lilliefors Test in the Uniform Case

 
j = 0
j = 1
j = 2
j = 3
j = 4
j = 5
j = 6
j = 7
j = 8
k p-values for D2
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0.0033 0.0189 0.2583 0.0274 0.0456 0.0395 0.0236
4 0 0 0 0.0329 0.4335 0.0667 0.6742 0.3441 0.5398
5 0 0 0 0 0.0204 0.2291 0.0991 0.0023 0.0789
6 0 0 0 0 0.0002 0.1659 0.0176 0.1619 0.2122
7 0 0 0 0 0 0.0037 0.0653 0.4166 0.3541
8 0 0 0 0 0 0 0 0.0189 0.0177
9 0 0 0 0 0 0 0 0.0054 0.0924
10 0 0 0 0 0 0 0 0 0.0007
k p-values forInline graphic
1 0.0041 0.0230 0.0695 0.5939 0.0740 0.4645 0.3077 0.3840 0.6881
2 0.8228 0.8110 0.4015 0.2622 0.9471 0.4388 0.1174 0.9353 0.3598
3 0.5814 0.0457 0.7268 0.6882 0.9166 0.1910 0.6372 0.0612 0.9833
4 0.0012 0.1845 0.7225 0.4598 0.8781 0.1207 0.5530 0.1731 0.4103
5 0 0 0.1518 0.8733 0.4540 0.4149 0.3344 0.0865 0.1448
6 0 0 0 0.0986 0.6773 0.1933 0.5861 0.3467 0.5809
7 0 0 0 0 0.0144 0.1410 0.2999 0.8761 0.3339
8 0 0 0 0 0 0.0002 0.0431 0.0862 0.0880
9 0 0 0 0 0 0 0.0086 0.6396 0.4303
10 0 0 0 0 0 0 0 0.0004 0.2524

3. Power Studies

In Section 2, we studied the distributions of D2, Inline graphic, and Inline graphic under the null model that the two sequences are i.i.d. having the same distribution. In this section, we will study the power of detecting the relationships between the two sequences under two alternative models for their relationships.

Note that as a result of estimating the mean, the term Inline graphic vanishes in the case when k = 1 for Inline graphic and Inline graphic. So we chose k ≥ 2 for a fair comparison of our statistics.

First, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet Inline graphic. We consider the same two types of distributions on the letters as earlier: the uniform distribution Inline graphic and a gc rich, nonuniform distribution Inline graphic. For each n = 2j × 102, where Inline graphic, and for each Inline graphic, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. All results are based on a sample size of 10,000.

The first alternative model renders the two sequences dependent through a common motif (W) which is randomly distributed across the two sequences. The second alternative model is inspired by horizontal gene transfer. We randomly choose a certain number of fragments in the first sequence and then replace the corresponding fragments (position-wise) in the second sequence by the letters in the first sequence. Again, as a consequence, the two sequences would no longer be independent. In more detail, the two models are chosen as follows:

  • The “common motif” model:

    A motif of length L = 5 is chosen, say w = agcca. Next, Bernoulli random variables Inline graphic, with P (Zi = 1) = g, are generated for Inline graphic, we insert the word W in place of Inline graphic in sequence 1. We avoid overlap by moving on to Zi + L, whenever Zi = 1. We repeat the process for sequence 2. The scores of the various statistics are then computed using the “newly” generated pair of sequences.

  • The “pattern transfer” model:

    We first choose L = 5 as the length of the segment to be “transferred” from sequence 1 to 2. Again, Bernoulli random variables Inline graphic, with P(Zi = 1) = g, are generated for Inline graphic. When Zi = 1, we pick the L-word Inline graphic in sequence 1 and replace Inline graphic in sequence 2 with it. Again, we disallow overlaps. For this model, we compute the scores of all the statistics using sequence 1 and “new” sequence 2.

The procedure described above is repeated 10,000 times, and the statistics calculated to yield the empirical distributions of the various statistics for each triplet of (k, n, g). As g values for the Bernoulli variables we chose g = 0.001, 0.005, 0.01, 0.05, and g = 0.1.

For each statistic, we set a type I level of α = 0.05. Using the empirical distribution S of the statistic under the null model we find s so that P(Ss) = α. For a given g value, the power of the statistics is then estimated by the proportion of times the score under the alternative model exceeds s.

We now consider the power curves of D2, Inline graphic, and Inline graphic for both models, as well as a comparison between these statistics. For alternative model 1, Figure 1 shows that for k = 2, the power of D2 is even smaller than 0.05, the type I error. Further, k = 6 has the best power.

FIG. 1.

FIG. 1.

Alternative Model 1: Power curves for D2 under the gc-rich distribution; g = 0.01. For k = 2, the power of D2 is smaller than 0.05, the type I error (indicated by the horizontal dashed line).

Figure 2 shows that k = 4 has the greatest power for Inline graphic under the first alternative model. For Inline graphic, Figure 3 shows that k = 5 has the greatest power for Inline graphic under the first alternative model, which corresponds to the length of the “common” motif which we assume relate the two sequences.

FIG. 2.

FIG. 2.

Alternative Model 1: Power curves for Inline graphic under the gc-rich distribution; g = 0.01. Note: k = 4 has the greatest power.

FIG. 3.

FIG. 3.

Alternative Model 1: Power curves for Inline graphic under the gc-rich distribution; g = 0.01.

Turning to a comparison of the power of our various statistics, we find that Inline graphic has greater power than Inline graphic for each k = 2, 4, 5, 6, 10 (result not shown).

We note that although in the uniform case, D2, Inline graphic, and D2z coincide up to multiplicative and additive constants, Figure 4 shows slight differences between D2z and Inline graphic. These differences stem from using the estimated parameters instead of the true model parameters in the test statistic D2z.

FIG. 4.

FIG. 4.

Alternative Model 1: Power curves for D2, D2z, Inline graphic, and Inline graphic under the uniform distribution; g = 0.01, k = 5. Note: For the uniform case, D2 and Inline graphic differ by only a constant.

Figure 5 shows a typical scenario for alternative model 1, where both Inline graphic and Inline graphic have greater power than D2 for given k and g and the power increases as the length, n, of the sequences increases. Even for a small g, we are able to notice the difference in the power of the various statistics. We also note that here D2z has higher power than D2.

FIG. 5.

FIG. 5.

Alternative Model 1: Power curves for D2, D2z, Inline graphic, and Inline graphic under the gc-rich model; k = 5, g = 0.01. Note: All of Inline graphic, D2z, and Inline graphic have greater power than D2 for given k and g.

For alternative model 2, the picture changes. Figure 6 shows that the power of D2 is poor for k = 2, 4, 5, 6; but when increasing the parameter k to 10, far beyond the length of the tuple which we transfer, the power increases dramatically.

FIG. 6.

FIG. 6.

Alternative Model 2: Power curves for D2 under the gc-rich distribution; g = 0.05. This graph suggests that the power increases with k.

In contrast, Figure 7 shows that for Inline graphic the power is moderate for all values of k in the plot, and it does not show a marked increase with sequence length. Using k = 10, instead of k = 6, seems to decrease the power slightly.

FIG. 7.

FIG. 7.

Alternative Model 2: Power curves for Inline graphic under the gc-rich distribution; g = 0.05.

For Inline graphic, Figure 8 shows that the power increases with k, and increasing sequence length slightly improves the power. For k = 10, the power approaches 1 for long sequences.

FIG. 8.

FIG. 8.

Alternative Model 2: Power curves for Inline graphic under the gc-rich distribution; g = 0.05.

For the alternative model 2, Figures 68 suggest that, under the gc-rich, nonuniform distribution, for D2 and Inline graphic, the greater the k value, the greater the power, even if this comes with a higher computational cost. We note that for k fixed, Inline graphic has greater power than D2. Moreover, Inline graphic has smaller power than D2 for k = 10 and long sequences. Also, we need a larger g value to see the differentiation of the power between the various k values for alternative model 2. This is due to the fact that in the first alternative model, a particular motif has a large contribution to the statistics. In the second model, however, the segment transferred from sequence 1 might be similar to the corresponding segment it replaces in sequence 2; hence, a greater g value is required before the sequences show similarity.

Under the alternative model 2, we find that for k ≤ 9 and g ≤ 0.05, D2 and D2z actually show a decrease in power as n increases, in certain intervals; this is illustrated in Figure 9.

FIG. 9.

FIG. 9.

Alternative Model 2: Power curves for D2, D2z, Inline graphic, and Inline graphic under the gc-rich distribution when k = 5, g = 0.05. Note: For k = 5, D2 has the least power and its power actually decreases as n increases.

For k = 10, D2 has higher power than Inline graphic, but lower power than Inline graphic; the higher power than Inline graphic comes at a great computational cost (result not shown).

Our findings suggest that D2 is not desirable as a statistic for sequence comparison. We conjecture that this is due to the fact that D2 is dominated by the normal components of the individual sequences and so is actually “measuring the sum of the departure of each sequence from the background” (Lippert et al., 2002) rather than the (dis)similarity between the two sequences. As n increases, D2 loses its detecting power as the two sequences become more similar.

As an aside, under the uniform distribution, in the alternative model 2, all three statistics behave similarly for k = 5, as expected, see Figure 10.

FIG. 10.

FIG. 10.

Alternative Model 2: Power curves for D2, D2z, Inline graphic, and Inline graphic under the uniform distribution when k = 5, g = 0.05.

4. Using Inline graphic to Test for Similarity

Although in our simulations Inline graphic is more powerful than Inline graphic, the statistic Inline graphic is still considerably more powerful than D2. For tests that would result in small p-values, as required for multiple tests for example, simulating the empirical distribution of the test statistics under the null hypothesis can be time consuming. In contrast to Inline graphic, the limiting distribution of Inline graphic is normal with mean zero, and hence, testing is straightforward; only the standard deviation needs to be estimated.

To illustrate the procedure, for fixed k and n, we generate 10,000 pairs of sequences under the gc-rich or uniform case and compute the Inline graphic scores for each pair. The standard deviation of Inline graphic is then estimated from these empirical scores.

Again, for fixed k and n, we generate 2000 pairs of sequences under the null model of no relationship between the two sequences, in both the gc-rich and uniform case, and we compute the Inline graphic score. Assuming asymptotic normality, we use a z-test to test the null hypothesis of no relationship, assuming mean zero and the estimated standard deviation.

Then we generate 2000 pairs of sequences from alternative model 1, with motif insertion probabilities g = 0.001, 0.005, 0.01, 0.05, 0.1. The Inline graphic statistic is computed, and we carry out a z-test, based on the asymptotic normality of Inline graphic. We repeat the procedure for the pattern transfer model, alternative model 2. We choose k = 4, 5, 6, because we know from our power simulations that Inline graphic works best when the motif length is around 5.

We compare to the results that we obtain using the empirical distribution of Inline graphic instead, where the empirical distribution function is based on 10,000 samples. In addition, we use the empirical distribution of Inline graphic based on 100,000 samples.

Tables 3 and 4 show the estimated type 1 and type 2 error rates in the gc-rich and uniform case; recall that the type 1 error is the probability to reject the null hypothesis although it is true. The type 2 error, the probability to accept the null hypothesis although false, is estimated under the alternative model 1 and under the alternative model 2, with motif insertion probability and pattern transfer probability g taking on the values g = 0.005, 0.01, 0.05, 0.1. For each n and k, the first row gives the estimates from the z-test (abbreviated as z), and the second row gives the estimate from the empirical distribution function (abbreviated as e). Except for the puzzling case when n = 3200 with k = 6, the results are remarkably similar, and there is no clear advantage of using the empirical distribution function when based on a relatively small number of samples. The general observation is that the normal approximation for Inline graphic gives a fast method for assessing statistical significance.

Table 3.

The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples

 
 
Type 1
Type 2
Length k     M1g1 M1g2 M2g3 M2g4
3200 4 z 0.001 0.966 0.583 0.965 0.764
    e 0.001 0.964 0.555 0.961 0.745
    t 0.000 0.971 0.609 0.972 0.784
3200 5 z 0.002 0.958 0.583 0.881 0.365
    e 0.001 0.977 0.658 0.922 0.450
    t 0.001 0.975 0.647 0.916 0.433
3200 6 z 0.388 0.260 0.020 0.100 0.001
    e 0.000 0.973 0.647 0.910 0.273
    t 0.001 0.974 0.654 0.913 0.282
6400 4 z 0.000 0.904 0.112 0.973 0.794
    e 0.000 0.907 0.114 0.976 0.800
    t 0.001 0.897 0.109 0.972 0.788
6400 5 z 0.000 0.910 0.132 0.921 0.442
    e 0.000 0.920 0.148 0.936 0.480
    t 0.001 0.914 0.140 0.927 0.463
6400 6 z 0.028 0.635 0.022 0.607 0.049
    e 0.000 0.931 0.205 0.943 0.319
    t 0.001 0.921 0.191 0.936 0.294
12,800 4 z 0.000 0.592 0.000 0.975 0.794
    e 0.001 0.545 0.000 0.963 0.752
    t 0.001 0.593 0.000 0.975 0.794
12,800 5 z 0.000 0.627 0.000 0.919 0.433
    e 0.000 0.696 0.000 0.949 0.514
    t 0.001 0.648 0.000 0.929 0.461
12,800 6 z 0.002 0.493 0.000 0.820 0.139
    e 0.000 0.697 0.002 0.940 0.304
    t 0.001 0.684 0.002 0.933 0.286
25,600 4 z 0.001 0.070 0.000 0.961 0.798
    e 0.002 0.062 0.000 0.954 0.767
    t 0.002 0.067 0.000 0.957 0.782
25,600 5 z 0.001 0.106 0.000 0.917 0.452
    e 0.000 0.121 0.000 0.929 0.484
    t 0.001 0.105 0.000 0.916 0.447
25,600 6 z 0.004 0.083 0.000 0.868 0.201
    e 0.003 0.100 0.000 0.887 0.234
    t 0.001 0.132 0.000 0.923 0.294

“M1/M2” refers to alternative model 1/2; “g1/g2/g3/g4” refers to the cases where g = 0.005, 0.01, 0.05, 0.1, respectively, where g is the parameter of the Bernoulli random variable B. So “M1g3” means “alternative model 1, g = 0.05.” The first row gives the estimates from the z-test (abbreviated as z), the second row gives the estimate from the empirical distribution function (abbreviated as e), both based on 10,000 samples. The third row, abbreviated as t, gives the estimate from the empirical distribution function based on 100,000 samples.

Table 4.

The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples, but for the Uniform Case

 
 
Type 1
Type 2
Length k     M1g1 M1g2 M2g3 M2g4
3200 4 z 0.001 0.978 0.574 0.961 0.746
    e 0.003 0.966 0.518 0.945 0.694
    t 0.001 0.980 0.593 0.967 0.760
3200 5 z 0.002 0.977 0.583 0.889 0.346
    e 0.001 0.986 0.639 0.919 0.404
    t 0.001 0.984 0.621 0.909 0.380
3200 6 z 0.048 0.744 0.198 0.438 0.018
    e 0.001 0.986 0.711 0.912 0.237
    t 0.001 0.985 0.678 0.900 0.218
6400 4 z 0.001 0.882 0.080 0.953 0.743
    e 0.002 0.848 0.063 0.934 0.688
    t 0.001 0.887 0.084 0.956 0.752
6400 5 z 0.001 0.888 0.107 0.891 0.333
    e 0.002 0.874 0.093 0.876 0.304
    t 0.001 0.899 0.123 0.903 0.363
6400 6 z 0.004 0.824 0.064 0.801 0.104
    e 0.002 0.897 0.113 0.883 0.177
    t 0.001 0.910 0.135 0.894 0.203
12,800 4 z 0.003 0.534 0.000 0.951 0.750
    e 0.002 0.580 0.000 0.961 0.783
    t 0.003 0.539 0.000 0.952 0.751
12,800 5 z 0.002 0.578 0.000 0.886 0.356
    e 0.002 0.585 0.000 0.887 0.365
    t 0.002 0.590 0.000 0.891 0.371
12,800 6 z 0.002 0.491 0.000 0.823 0.110
    e 0.001 0.600 0.000 0.892 0.171
    t 0.001 0.605 0.001 0.895 0.175
25,600 4 z 0.000 0.059 0.000 0.963 0.738
    e 0.002 0.050 0.000 0.949 0.706
    t 0.001 0.059 0.000 0.960 0.737
25,600 5 z 0.000 0.086 0.000 0.909 0.361
    e 0.000 0.110 0.000 0.923 0.418
    t 0.001 0.082 0.000 0.901 0.350
25,600 6 z 0.002 0.093 0.000 0.865 0.154
    e 0.002 0.114 0.000 0.889 0.183
    t 0.003 0.113 0.000 0.888 0.182

5. Phase Transition

In this section, we explore the effect of small deviation from the uniform distribution for the D2 statistic only. We restrict attention to the alphabet Inline graphic and word size k = 1; both sequences are of the same length n. Again pα denotes the probability of the letter α. Then the standardized counts Inline graphic and Inline graphic both tend to standard normal variables when n tends to infinity. With this notation and noting that Inline graphic, we obtain

graphic file with name M193.gif (7)

When the distribution of the alphabet is uniform, Inline graphic the second term in Equation (7) vanishes,

graphic file with name M195.gif

and so it is asymptotically nonnormal (in fact, a sum of products of standard normal variables).

In the situation where Inline graphic do not depend on n, the second term in Equation (7) dominates the first term; as

graphic file with name M197.gif

the scaled limit is a normal distribution.

Next we assume that (pa(n), pg (n), pc (n), pt (n)) changes with n in such a way that there exists a function f (n) → 0 and constants Cl, l = a, g, c, t satisfying Ca + Cg + Cc + Ct = 0 such that

graphic file with name M198.gif

for each letter Inline graphic. Then

graphic file with name M200.gif (8)

where γ(n) → 0 as n → ∞ and Θ(f (n)) indicates a term that has the same order as f (n).

Let f (n) = 1/n0.5 + ε. When ε < 0, the second term in (8) will dominate and Inline graphic will tend to a normal distribution. When ε > 0 the first term dominates and Inline graphic will tend to be nonnormal. Thus, we expect a phase transition from normal to nonnormal as ε changes from negative to positive. Intuitively, the ratio Rn of the coefficient of the first term over the coefficient of the second term in Equation (8) can be thought of as a “ratio of dominance”;

graphic file with name M203.gif

Table 5 shows the decrease of Rn for increasing n (row labels) and decreasing ε (column labels).

Table 5.

The Ratio Rn of the Coefficient of the First Term Over the Coefficient of the Second Term in for Different Values of n and ε

 
ε
n 0.01 0.05 0.10 0.15 0.20
210 × 100 0.386 0.243 0.137 0.077 0.043
220 × 100 0.360 0.172 0.068 0.027 0.011
230 × 100 0.336 0.122 0.034 0.010 0.003

To run simulations in the vicinity of this phase transition, we consider two types of probability vectors for the alphabet in Inline graphic and f (n) = 1/n0.5 + ε. The type I probability vector is chosen as Inline graphic, giving (Ca, Cc, Cg, Ct) = (1, − 1, 0, 0).

In the second scenario, type II, the probability vector perturbs all components, Inline graphic, so that (Ca, Cg, Cc, Ct) = (1, − 1, 1, −1).

Under the type I model, we can show that the variance of Inline graphic is approximately Inline graphic. Thus, there exists ZnN(0, 1) as n tends to infinity such that

graphic file with name M209.gif

Equation (8) can be rewritten as

graphic file with name M210.gif (9)

Under the type II model, the variance of Inline graphic is approximately Inline graphic and

graphic file with name M213.gif

Hence, under the type II model, Equation (8) can be rewritten as

graphic file with name M214.gif (10)

The ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is Inline graphic-fold larger than that for Equation (10). Therefore, we expect that normality appears for relative small absolute ε values for the type II vector.

For given n and ε, we generate sequences of length n using both types of distribution vectors. The D2 scores for word size k = 1 are then tabulated. We use a Kolmogorov–Smirnov test to test the hypothesis that D2 is normally distributed, and the corresponding p-value is obtained. Here again we use the theoretical mean and variance for the test. Table 6 gives the p-values for different values of n and ε under the two models; again we report only the first four digits.

Table 6.

The p-Values of the Kolmogorov–Smirnov Test for Testing the Normality of D2 for Letter Distributions Which Are Close to Uniform; f(n) = 1/n0.5+ε

 
n
ε
25 × 100
210 × 100
215 × 100
220 × 100
  Type I : Inline graphic
0.1 0 0 0 0
0.01 0 0 0 0
0.001 0 0 0 0
1 e−04 0 0 0 0
1 e−05 0 0 0 0
−1 e−05 0 0 0 0
−1 e−04 0 0 0 0
−0.001 0.0001 0 0.0001 0
−0.01 0 0 0.0002 0
−0.05 0 0 0 0.0088
−0.1 0.0424 0.1402 0.2754 0.6383
−0.15 0.2557 0.1027 0.2041 0.9915
−0.2 0.1198 0.6978 0.2258 0.4000
  Type II : Inline graphic
0.1 0 0 0 0
0.01 0.0005 0 0 0
0.001 0 0 0 0
1 e−04 0 0.0005 0 0
1 e−05 0 0.0015 0 0
−1 e−05 0.0049 0 0 0
−1 e−04 0.0020 0 0 0
−0.001 0 0 0 0
−0.01 0 0 0 0.0199
−0.05 0.0002 0.0069 0.0005 0.3162
−0.1 0.0866 0.0637 0.3941 0.3159
−0.15 0.5832 0.5113 0.0326 0.5910
−0.2 0.1015 0.5146 0.9437 0.4827

Table 6 indicates that under the type II model the distribution of D2 is not significantly different from normality when ε ≤ −0.05 and n = 220 × 100, while significantly different from normality when ε = −0.05 and n ≤ 215 × 100. On the other hand, under the type I model, the distribution of D2 is significantly different from normality even when ε = −0.05 and n = 220 × 100. The simulation results are consistent with our intuition. As shown in Table 5, the ratio Rn of the coefficient of the first term over that of the second term in Equation (8) is less than 0.1 if ε < −0.10 when n = 220 × 100. This can explain why normality of D2 appears if ε < −0.10 when n = 220 × 100 for both type I and II models. Further, as the ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is Inline graphic-fold larger than that for Equation (10), normality of D2 begins to appear when ε < −0.05 for the type II model.

6. Discussion

The typically used statistic D2 asymptotically ignores the joint word occurrences in two sequences unless all letters are almost equally likely. In the latter scenario, a phase transition occurs. Hence, the statistic is neither robust nor informative, under the normal regime. The main advantage of D2 is that it is easy to compute.

The proposed Inline graphic statistic instead is also easy to compute, but it can be compared with a normal distribution to assess significance, and it performs well in a power study.

The Inline graphic statistic is more powerful than Inline graphic in our simulation study and is also easy to compute, but its asymptotic distribution does not have a convenient form; instead, it would best be assessed using simulation, which are time consuming, as the tail of the distribution would need to be estimated.

Our recommendation is to discard D2, to use Inline graphic instead when computing time is limited, and to ideally use Inline graphic for sequence comparison based on k-tuple content.

Our results allow for a number of generalizations. The normal approximation for the word counts in each individual sequences does not assume that the underlying letter distribution is the same as in the other sequence. Hence, the normal approximation for Inline graphic also holds when the sequences do not follow the same underlying distribution on the letters.

Huang (2002) gave a related normal approximation for one sequence in the more general situation that the sequence is generated by a homogeneous Markov chain. Kantorovitz et al. (2007b) already successfully adapted D2z to the Markov case. Also for Inline graphic the generalization of our results for that setting should be straightforward; the error bounds would need to be adjusted. Burden et al. (2008) generalized D2z to allow for mismatches, on the four-letter alphabet {a, c, g, t} under the Bernoulli model that pa = pt and pc = pg; they called an m-neighborhood of a word w of length k the set of all words which differ by at most m letters from w. The generalized statistic then counts the number of all m-neighborhood matches of all k-words between two sequences. With our normal approximation for all word counts, Inline graphic could be generalized similarly to allow for a certain number of mismatches. The quality of the normal approximation will depend on the number m of permitted mismatches.

We also indicate that more than two sequences could be compared in a similar fashion. Quine (1994) stated the result that if Inline graphic are independent normal random variables with zero means and variances Inline graphic, then

graphic file with name M229.gif

where both sums are over all integers Inline graphic (Melnykov and Chen, 2007). This suggests the extension of the Inline graphic statistic for multiple sequence comparison by taking the products of the individual word counts and standardizing it as earlier. Then still a normal approximation is valid.

Similarly, we could extend Inline graphic as the sum, over all words, of the product of more than two standardized word counts. Springer and Thompson (1966) gave a formula for the density of the product of independent standard normals. Again the covariance structure of the word counts within one sequence would make it recommendable to assess the limiting distribution via simulation.

Supplementary Material

Supplemental data
Supp_Data.pdf (77.9KB, pdf)

Acknowledgments

G.R. was supported in part by EPSRC grant no. GR/R52183/01, and by BBSRC and EPSRC through OCISB. D.C. was supported by a Overseas Postdoctoral Fellowship from the National University of Singapore. F.S. was supported by NIH grant no. P50 HG 002790 and R21AG032743. M.S.W. was supported by NIH grant no. P50 HG 002790 and R21AG032743.

Disclosure Statement

No competing financial interests exist.

References

  1. Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2008;18:1–21. [Google Scholar]
  2. Forêt S. Kantorovitz M. Burden C. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 2006;7(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recognit. 2009;42:539–548. [Google Scholar]
  4. Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probab. 2002;34:559–586. [Google Scholar]
  5. Kantorovitz M.R. An example of a stationary, triplewise independent triangular array for which the CLT fails. Statist. Probab. Lett. 2007;77:539–542. [Google Scholar]
  6. Kantorovitz M.R. Booth H.S. Burden C.J., et al. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab. 2007a;44:788–805. [Google Scholar]
  7. Kantorovitz M.R. Robinson G.E. Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007b;23:i249–i255. doi: 10.1093/bioinformatics/btm211. [DOI] [PubMed] [Google Scholar]
  8. Lilliefors H.W. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 1967;62:399–402. [Google Scholar]
  9. Lippert R.A. Huang H. Waterman M.S. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA. 2002;99:13980–13989. doi: 10.1073/pnas.202468099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lundstrom R. Stochastic models and statistical methods for DNA sequence data. Department of Mathematics, University of Utah; Salt Lake City, UT: 1990. [Ph.D. thesis]. [Google Scholar]
  11. Melnykov I. Chen J.T. A connection between self-normalized products and stable laws. Stat. Probab. Lett. 2007;77:1662–1665. [Google Scholar]
  12. Quine M.P. A result of Shepp. Appl. Math. Lett. 1994;7:33–34. [Google Scholar]
  13. Shepp L. Problem 62-9: Normal functions of normal random variables. SIAM Rev. 1964;6:459. [Google Scholar]
  14. Springer M.D. Thompson W.E. The distribution of products of independent random variables. SIAM J. Appl. Math. 1966;14:511–526. [Google Scholar]
  15. Stuart A. Ord J.K. Kendall's Advanced Theory of Statistics. 5th. Vol. 1. Griffin; London: 1987. [Google Scholar]
  16. Waterman M.S. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC; Boca Raton, FL: 1995. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data
Supp_Data.pdf (77.9KB, pdf)

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES