Alignment-Free Sequence Comparison (I): Statistics and Power

Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman

doi:10.1089/cmb.2009.0198

. 2009 Dec;16(12):1615–1634. doi: 10.1089/cmb.2009.0198

Alignment-Free Sequence Comparison (I): Statistics and Power

Gesine Reinert ¹, David Chew ², Fengzhu Sun ^3,,⁴, Michael S Waterman ^3,,^4,^✉

PMCID: PMC2818754 NIHMSID: NIHMS164900 PMID: 20001252

Abstract

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D₂ statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D₂ statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D₂ word count statistic, which we call Inline graphic and . For , which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, , outperforms in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of Inline graphic , we cannot provide a closed form for power calculations.

Key words: alignment-free, normal approximation, normal distribution, sequence alignment, word count statistics

1. Introduction

Comparison of the similarities between two segments of biological sequences using k-tuples (also called k-grams or k-words) arises from the need for rapid sequence comparison. Such methods are often employed in cDNA sequence comparisons. Today next-generation sequencing methods are producing unprecedented volumes of sequence data. Therefore, we expect that the use of k-tuples will play an increasingly important role for molecular sequence and genome comparisons in the current era. This article will explore in some detail the statistic which one of these methods is based upon, along with other substantially superior statistics.

One of the most widely used statistics for sequence comparison based on k-tuples is the so-called D₂ statistic, which is based on the joint k-tuple content in the two sequences. If two sequences are closely related, we would expect the k-tuple content of both sequences to be very similar.

More formally, suppose that two sequences, Inline graphic and , say, are composed of letters that are drawn from a finite alphabet of size d. For a , let p_a denote the probability of letter a. For , let

count the number of occurrences of w in A, and similarly, Y_w counts the number of occurrences of w in B. Here, Inline graphic ; similarly, we put for later use . Then D₂ is defined by

The null model is typically chosen to be such that the letters are independent and identically distributed (i.i.d.), and that the two sequences are independent. Using this model, Lippert et al. (2002) derived a Poisson approximation, a compound Poisson approximation, and a normal approximation for D₂; the normal approximation is only valid under the assumption that not all letters of the alphabet are equally likely. In the case that all letters are equally likely, the D₂ statistic looks asymptotically like the sum of products of independent normal variables. Lippert et al. (2002) also found that the D₂ statistic is dominated by background noise, in the nonuniform case.

In the work of Kantorovitz et al. (2007a), it was shown that in the regime that all letters are equally likely, the standardized statistic

is asymptotically normally distributed when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. In clustering some biologically related sequences, Kantorovitz et al. (2007b) found that D2z outperforms D₂. The heuristic argument is that the background models for the two sequences may be different, and the D₂ statistic should hence be normalized to account for the different background distributions of the sequences. Yet in the nonuniform case the issue about the variability being dominated by the noise in the single sequences remains.

In this article, we propose a new statistic, which is a self-standardized version of D₂. In general, Shepp (1964) observed that, if X and Y are independent mean zero normals, X with variance Inline graphic and Y with variance , then is again normal, with variance . For is the probability of occurrence of w, and the centralized count variable is denoted as

We introduce the new count statistic as

(1)

Here we set Inline graphic . The superscript “S” stands for “Shepp,” and also for “self-standardized.” We shall see that, under reasonable assumptions, will be approximately normally distributed.

In practice we shall usually have to replace p_a, the (unobserved) letter probabilities, by Inline graphic , the relative count of letter a in the concatenation of the two sequences, based on the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution. We then estimate the probability of occurrence of by . In our simulations, we always estimate the letter probabilities, even when we assume that all letters are equally likely.

We also study the following version of the word count statistic:

(2)

which in our simulations outperforms not only D₂ but also Inline graphic , in terms of power for detecting the relatedness between the two sequences. This statistic comes about by considering , but as the variance is costly to compute, it is replaced by the estimated mean of the word occurrence across the two sequences when the probability of the word pattern is small; this approach can be justified by considering a Poisson approximation for the individual word counts. We justify in Section 2 that Inline graphic can be viewed as the sum of the products of independent normal variables, and we suggest how to simulate from its asymptotic distribution, for which we do not have a closed-form expression.

To explain the problem with D₂,

(3)

Approximately, if n and m are large, under the null model, Inline graphic should follow a mean zero normal distribution with variance of order n with respect to sequence length. The distribution of should be approximately, for large n and m, the distribution of the sum of products of pairs of independent mean zero normal variables with a variance of order O(nm).

If all letters are equally likely, then p_w = d^−k for all words w, and hence, Inline graphic , giving that . The variability in D₂ is the same as the variability in , and indeed as in .

When not all letters are equally likely, as the variance of X_w is of order n and the variance of Y_w is of order m, the variance of Inline graphic is of order O(n²m), and similarly the variance of is of order O(nm²). Hence, the variability in D₂ is dominated by the variability in and . Thus, in this case, the variability in D₂ is dominated by the terms that reflect the noise in the single sequences only.

The asymptotic normality of D₂ for both nucleic acid and amino acid sequences has been studied empirically by Forêt et al. (2009). However, no power study was undertaken; our argument shows that in the nonuniform case the asymptotic normality of D₂ only stems from the asymptotic normality of the underlying word counts in the respective sequences.

Even in the regime that all letters are equally likely, if we only leave the “last” word Inline graphic out, forming the statistic , then , which is not constant. So even if we just leave one word out of the whole set of possible words, if the sequence lengths are large and all other parameters are fixed, then the variability, now in , will be dominated by the variability in the single sequences. Hence, the D₂ statistic is, in general, not useful for assessing whether the two underlying sequences are related.

This article is structured as follows. In Section 2, we discuss the distribution of Inline graphic and under the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution, and we shall present simulation results for testing the normality of D₂, , and .

In Section 3, we study the power of the statistics D₂, Inline graphic , , and D2z under two alternative scenarios. The first scenario is that the two sequences contain a common motif, whereas the second scenario is a pattern transfer model; we pick a word in the first sequence and use it to replace a word in the second sequence. Our results illustrate not only the poor performance of D₂, but also the encouraging performances of Inline graphic and of .

Section 4 illustrates how the asymptotic normality of Inline graphic gives a fast method for assessing statistical significance, as only its standard deviation has to be approximated and not the empirical distribution itself.

There is a caveat—if the distribution on the alphabet is very close to uniform, and if some other conditions are satisfied which relate to having a large number of summands of products of pairs of independent normals, then D₂ will behave like the sum of products of normally distributed variables, similar to the uniform case; when the deviation from uniform increases, the asymptotic normality for D₂ holds. This phase transition is explored in Section 5.

We summarize our results in Section 6, and we briefly indicate generalizations to Markov chain models as well as to multiple sequence comparisons.

The proofs for Section 2 are presented in the Supplementary Material (see online supplementary material at www.liebertonline.com). The code for simulating from the distributions is available at www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html

2. The Distributions of and Under the Null Model

Here the null model is that the letters are i.i.d. and the two sequences are independent. We assume, as in Huang (2002), that Inline graphic . For D₂, Forêt et al. (2009) studied the empirical distribution via simulations, and they found that a gamma distribution outperforms the normal distribution in general. For longer sequences they showed that the normal approximation itself would be appropriate.

2.1. and asymptotic normality

First, we focus on the word counts in a single sequence. Let

be the vector of centered word counts with the last word Inline graphic left out; note that

(4)

can be recovered from the set Inline graphic . Huang (2002) showed a multivariate normal approximation for the word count vector in a single sequence. The limiting covariance matrix C needs some notation; see section 12.1 in Waterman (1995), with results derived by Lundstrom (1990). For , we define, for ,

which is the probability that w occurs, given that Inline graphic has occurred. For words and , the overlap indicator is defined as

This overlap indicator equals 1 if the last k − j letters of u overlap exactly the first k − j letters of v. Then the approximating covariance matrix C is given by

(5)

A similar normal approximation is valid for Inline graphic . As we assume that A and B have the same letter probability distribution, both have the same limiting covariance matrix C.

Thus, we obtain the following approximation for Inline graphic (for the proof and the precise bounds, see online Supplementary Material at www.liebertonline.com). We use the abbreviation MVN(μ, C) to denote a multivariate normal distribution with mean vector μ and covariance matrix C. Also, a function is called Lipschitz, with Lipschitz constant 1, if for all real x and y, |h(x) − h(y)| ≤ |x − y|.

Theorem 2.1

Assume m ≥ n and Inline graphic . Let and be two independent (d^k − 1)-dimensional normal vectors. In analogy to Equation (4), put, for i = 1, 2,

Let

Then, D_lim is mean zero normally distributed, and, for any function h which is bounded and Lipschitz with Lipschitz constant 1, as n → ∞ with d = d(n) and k = k(d, n),

▪

The bound in Theorem 2.1 may not be optimal; indeed, it is based on a multivariate normal approximation for word counts, Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), which is of order Inline graphic . The purpose of the bound is to illustrate the trade-off between alphabet size, word length, and sequence length. If d, the alphabet size, is very large, then even moderately long words will be rare unless the sequence is very long.

Because of the complicated dependence, we were not able to give a closed-form expression for the variance of D_lim. Theorem 2.1, however, justifies using a z-test for the null model, based on the statistic Inline graphic , using the estimated standard deviation.

2.2. and the product of independent normals

The statistic Inline graphic given in Equation (2) is motivated by estimating the standardized counts

approximating Inline graphic by its mean , with the argument that 1 − p_w will be close to 1 when k is reasonably large and the word w is relatively rare. From Corollary 6.1 (see online Supplementary Material at www.liebertonline.com), we obtain a multivariate normal approximation for the standardized count vectors Inline graphic and for . Although the covariances within the vectors will not disappear, for each w, and are independent and would be approximated by independent univariate standard normal variables. From the Stuart and Ord (1987), we know the distribution of the product of two independent standard normal variables [see also Springer and Thompson (1966)].

Lemma 2.1

Let X and Y be two independent standard normal random variables. Then the product W = XY has probability density

(6)

where Inline graphic dt denotes the modified Bessel function of the third kind.

Thus, the distribution of each summand Inline graphic will approximately have density (6). The covariance structure will result in an approximation with a complicated distribution which would be easily assessed by simulation from many normal vectors with covariance matrix C given in Equation (5), standardizing and taking products.

2.3. The case that all letters are equally likely

In the case that all letters are equally likely, both Lippert et al. (2002) and Kantorovitz (2007) observed that D₂ will not follow a normal distribution. Kantorovitz et al. (2007a) showed that Inline graphic is asymptotically normal, however, when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. However, when the word length is fixed, D₂, D2z, and may not tend to normal as sequence length tends to infinity. Note that in the case that all letters are equally likely, all of D₂, D2z, and Inline graphic agree up to constants.

2.4. Simulations

To illustrate the quality of the normal approximation, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet Inline graphic . We consider two types of distributions on the letters: the uniform distribution and a gc rich, nonuniform distribution ; the latter distribution is the same as that used in Lippert et al. (2002) and Forêt et al. (2006) to study D₂. Similar to Lippert et al. (2002) and Forêt et al. (2006), for each n = 2^j × 10², where Inline graphic , and for each , we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. Forêt et al. (2006) found as optimal tuple length k = 7, for n = 800, 1600, and 3200; optimal in the sense that for this choice of k, the statistic D₂ will be closest to normal. All results are based on a sample size of 10,000; we use the same simulated sequences for all three scores, D₂, Inline graphic , and . As D2z differs from D₂ only by an additive and a multiplicative constant, we do not include D2z in these simulations.

We then use the Lilliefors test (Lilliefors, 1967) to assess whether the distributions are close to normal. The Lilliefors test is a modification of the Kolmogorov–Smirnov goodness-of-fit test, which is implemented using the sample mean and standard deviation as the mean and standard deviation of the theoretical limiting normal distribution. In contrast to the Kolmogorov–Smirnov test, statistical significance is based on the Lilliefors distribution; see also Forêt et al. (2009) for a discussion why not to use an unmodified Kolmogorov–Smirnov test when the standard deviation is estimated. A p-value of less than 0.05 indicates that we would reject the null model at the 5% significance level. Under the null model, in 100 tests we would expect about five tests resulting in a p-value of less than 5%. Precision is up to four decimal places. For easier readability, a value of 0.0000 is recorded simply as 0.

We will first discuss the nonuniform case, when asymptotic normality has been shown to hold for all three statistics D₂, Inline graphic , and , when first sequence length and then word length tend to infinity. For short words, only has been shown to be approximately normal when sequence length tends to infinity. The regime of interest here is that zwords are not too rare; for long words, say , with , a compound Poisson approximation is more appropriate (Lippert et al., 2002).

2.4.1. The nonuniform case

In the nonuniform case, for all three statistics, the larger the sequence length and the smaller k, the closer the distribution is to normality; the performance is rather different though. Recall that the sequence length is 2^j × 100; for easier readability, we denote the 2^j × 100 column in the table just by the value for j.

Table 1 summarizes the p-values for the Lilliefors tests in the nonuniform case for D₂, Inline graphic , and . For D₂, Table 1 shows that even for k = 1 we would reject the hypothesis of normality at the 5% level as long as the sequence length is not at least 3200 bp (j = 5). For k ≥ 2, the required sequence length would be around 25,600 bp (j = 8).

Table 1.

Lilliefors Tests in the Nonuniform Case

	j = 0	j = 1	j = 2	j = 3	j = 4	j = 5	j = 6	j = 7	j = 8
k	p-values for D₂
1	0	0	0	0	0.0024	0.0938	0.1109	0.0807	0.3957
2	0	0	0	0	0.0023	0.0069	0.0106	0.0102	0.6252
3	0	0	0	0	0.0001	0.0010	0.0001	0.0131	0.2650
4	0	0	0	0	0	0	0.0002	0.0182	0.1027
5	0	0	0	0	0	0	0.0001	0.0063	0.0604
6	0	0	0	0	0	0	0.0001	0.0001	0.1181
7	0	0	0	0	0	0	0	0	0.0916
8	0	0	0	0	0	0	0	0	0.0523
9	0	0	0	0	0	0	0	0.0059	0.0221
10	0	0	0	0	0	0	0	0.0002	0.2982

k
1	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0
3	0.0034	0.0593	0.0030	0.2612	0.0032	0.0982	0.0241	0.0210
4	0	0.0068	0.1399	0.0269	0.2058	0.1181	0.1160	0.4827
5	0	0	0	0.0219	0.7885	0.4417	0.4252	0.9629
6	0	0	0	0	0.0657	0.0829	0.4727	0.7407
7	0	0	0	0	0.0048	0.0049	0.5533	0.2175
8	0	0	0	0	0	0	0.0065	0.2870
9	0	0	0	0	0	0	0	0.0088
10	0	0	0	0	0	0	0	0.0002

k	p-values for
1	0.3728	0.5997	0.5025	0.4173	0.2135	0.7626	0.6838	0.2939	0.3014
2	0.0200	0.5862	0.2341	0.6421	0.4381	0.4952	0.0056	0.0859	0.9934
3	0.0850	0.6381	0.4737	0.0885	0.2534	0.4759	0.4301	0.3755	0.7175
4	0.0122	0.6068	0.1088	0.2496	0.2317	0.8684	0.3738	0.3374	0.5795
5	0	0.1302	0.0589	0.4513	0.6518	0.0257	0.2077	0.0963	0.5495
6	0	0.0002	0.0443	0.1475	0.6168	0.7280	0.2860	0.2407	0.7960
7	0	0	0.0319	0.0003	0.0278	0.3482	0.6177	0.7590	0.6117
8	0	0	0	0	0.8755	0.0997	0.1405	0.9000	0.1395
9	0	0	0	0	0.0069	0.0026	0.5661	0.9682	0.1670
10	0	0	0	0	0	0	0.0017	0.5321	0.5256

Open in a new tab

Table 1 also shows that the statistic Inline graphic would reject the hypothesis of normality not only for large k, but also for small k with large sequence length. This nonmonotonic behavior of indicates that, to declare statistical significance, the statistic should not be compared with a normal distribution.

In contrast, Table 1 displays that Inline graphic is reasonably close to normal even for a sequence of length 200 bp when k ≤ 4; for k = 8 a sequence of length 1600 bp would already look reasonably normal. Moreover, the statistics stay close to normal, with increasing sequence length and with increasing word length, and it thus displays the monotonicity which makes the statistic safe to apply.

We repeated the simulations using the Kolmogorov–Smirnov test with the known mean and variances, based on Kantorovitz et al. (2007b), instead of the Lilliefors test, for D₂. Although the Kolmogorov–Smirnov test gave slightly larger p-values, thus indicating a slightly better fit to a normal distribution, the qualitative behavior remained (data not shown).

2.4.2. The uniform case

In the uniform case, our theoretical results predict that the limiting distribution of D₂ would only look normal when the sequence length is large and k is large also, or at least in a moderate range. In contrast, Inline graphic would still be asymptotically normal even for small k when the sequence is large.

Table 2 confirms this predicted behavior. Note that from Equation (3) we can see that, in the uniform case, both Inline graphic and D2z are the same as D₂ up to a multiplicative constant and an additive constant. Table 2 shows that the statistics D₂, , and D2z do not monotonically approach the normal distribution. In contrast, we find that is close to normal even for sequences of length 100 bp when k ≤ 3, and it gets closer to the normal distribution with both increasing sequence length and decreasing k.

Table 2.

Lilliefors Test in the Uniform Case

	j = 0	j = 1	j = 2	j = 3	j = 4	j = 5	j = 6	j = 7	j = 8
k	p-values for D₂
1	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0
3	0	0	0.0033	0.0189	0.2583	0.0274	0.0456	0.0395	0.0236
4	0	0	0	0.0329	0.4335	0.0667	0.6742	0.3441	0.5398
5	0	0	0	0	0.0204	0.2291	0.0991	0.0023	0.0789
6	0	0	0	0	0.0002	0.1659	0.0176	0.1619	0.2122
7	0	0	0	0	0	0.0037	0.0653	0.4166	0.3541
8	0	0	0	0	0	0	0	0.0189	0.0177
9	0	0	0	0	0	0	0	0.0054	0.0924
10	0	0	0	0	0	0	0	0	0.0007

k	p-values for
1	0.0041	0.0230	0.0695	0.5939	0.0740	0.4645	0.3077	0.3840	0.6881
2	0.8228	0.8110	0.4015	0.2622	0.9471	0.4388	0.1174	0.9353	0.3598
3	0.5814	0.0457	0.7268	0.6882	0.9166	0.1910	0.6372	0.0612	0.9833
4	0.0012	0.1845	0.7225	0.4598	0.8781	0.1207	0.5530	0.1731	0.4103
5	0	0	0.1518	0.8733	0.4540	0.4149	0.3344	0.0865	0.1448
6	0	0	0	0.0986	0.6773	0.1933	0.5861	0.3467	0.5809
7	0	0	0	0	0.0144	0.1410	0.2999	0.8761	0.3339
8	0	0	0	0	0	0.0002	0.0431	0.0862	0.0880
9	0	0	0	0	0	0	0.0086	0.6396	0.4303
10	0	0	0	0	0	0	0	0.0004	0.2524

Open in a new tab

3. Power Studies

In Section 2, we studied the distributions of D₂, Inline graphic , and under the null model that the two sequences are i.i.d. having the same distribution. In this section, we will study the power of detecting the relationships between the two sequences under two alternative models for their relationships.

Note that as a result of estimating the mean, the term Inline graphic vanishes in the case when k = 1 for and . So we chose k ≥ 2 for a fair comparison of our statistics.

First, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet Inline graphic . We consider the same two types of distributions on the letters as earlier: the uniform distribution and a gc rich, nonuniform distribution . For each n = 2^j × 10², where , and for each , we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. All results are based on a sample size of 10,000.

The first alternative model renders the two sequences dependent through a common motif (W) which is randomly distributed across the two sequences. The second alternative model is inspired by horizontal gene transfer. We randomly choose a certain number of fragments in the first sequence and then replace the corresponding fragments (position-wise) in the second sequence by the letters in the first sequence. Again, as a consequence, the two sequences would no longer be independent. In more detail, the two models are chosen as follows:

The “common motif” model:

A motif of length L = 5 is chosen, say w = agcca. Next, Bernoulli random variables , with P (Z_i = 1) = g, are generated for , we insert the word W in place of in sequence 1. We avoid overlap by moving on to Z_i _{+ L}, whenever Z_i = 1. We repeat the process for sequence 2. The scores of the various statistics are then computed using the “newly” generated pair of sequences.
The “pattern transfer” model:

We first choose L = 5 as the length of the segment to be “transferred” from sequence 1 to 2. Again, Bernoulli random variables , with P(Z_i = 1) = g, are generated for . When Z_i = 1, we pick the L-word in sequence 1 and replace in sequence 2 with it. Again, we disallow overlaps. For this model, we compute the scores of all the statistics using sequence 1 and “new” sequence 2.

The procedure described above is repeated 10,000 times, and the statistics calculated to yield the empirical distributions of the various statistics for each triplet of (k, n, g). As g values for the Bernoulli variables we chose g = 0.001, 0.005, 0.01, 0.05, and g = 0.1.

For each statistic, we set a type I level of α = 0.05. Using the empirical distribution S of the statistic under the null model we find s so that P(S ≥ s) = α. For a given g value, the power of the statistics is then estimated by the proportion of times the score under the alternative model exceeds s.

We now consider the power curves of D₂, Inline graphic , and for both models, as well as a comparison between these statistics. For alternative model 1, Figure 1 shows that for k = 2, the power of D₂ is even smaller than 0.05, the type I error. Further, k = 6 has the best power.

FIG. 1. — Alternative Model 1: Power curves for D₂ under the gc-rich distribution; g = 0.01. For k = 2, the power of D₂ is smaller than 0.05, the type I error (indicated by the horizontal dashed line).

Figure 2 shows that k = 4 has the greatest power for Inline graphic under the first alternative model. For , Figure 3 shows that k = 5 has the greatest power for under the first alternative model, which corresponds to the length of the “common” motif which we assume relate the two sequences.

FIG. 2. — Alternative Model 1: Power curves for under the gc-rich distribution; g = 0.01. Note: k = 4 has the greatest power.

FIG. 3. — Alternative Model 1: Power curves for under the gc-rich distribution; g = 0.01.

Turning to a comparison of the power of our various statistics, we find that Inline graphic has greater power than for each k = 2, 4, 5, 6, 10 (result not shown).

We note that although in the uniform case, D₂, Inline graphic , and D2z coincide up to multiplicative and additive constants, Figure 4 shows slight differences between D2z and . These differences stem from using the estimated parameters instead of the true model parameters in the test statistic D2z.

FIG. 4. — Alternative Model 1: Power curves for D₂, D₂z, , and under the uniform distribution; g = 0.01, k = 5. Note: For the uniform case, D₂ and differ by only a constant.

Figure 5 shows a typical scenario for alternative model 1, where both Inline graphic and have greater power than D₂ for given k and g and the power increases as the length, n, of the sequences increases. Even for a small g, we are able to notice the difference in the power of the various statistics. We also note that here D2z has higher power than D₂.

FIG. 5. — Alternative Model 1: Power curves for D₂, D₂z, , and under the gc-rich model; k = 5, g = 0.01. Note: All of , D2z, and have greater power than D₂ for given k and g.

For alternative model 2, the picture changes. Figure 6 shows that the power of D₂ is poor for k = 2, 4, 5, 6; but when increasing the parameter k to 10, far beyond the length of the tuple which we transfer, the power increases dramatically.

FIG. 6. — Alternative Model 2: Power curves for D₂ under the gc-rich distribution; g = 0.05. This graph suggests that the power increases with k.

In contrast, Figure 7 shows that for Inline graphic the power is moderate for all values of k in the plot, and it does not show a marked increase with sequence length. Using k = 10, instead of k = 6, seems to decrease the power slightly.

FIG. 7. — Alternative Model 2: Power curves for under the gc-rich distribution; g = 0.05.

For Inline graphic , Figure 8 shows that the power increases with k, and increasing sequence length slightly improves the power. For k = 10, the power approaches 1 for long sequences.

FIG. 8. — Alternative Model 2: Power curves for under the gc-rich distribution; g = 0.05.

For the alternative model 2, Figures 6–8 suggest that, under the gc-rich, nonuniform distribution, for D₂ and Inline graphic , the greater the k value, the greater the power, even if this comes with a higher computational cost. We note that for k fixed, has greater power than D₂. Moreover, has smaller power than D₂ for k = 10 and long sequences. Also, we need a larger g value to see the differentiation of the power between the various k values for alternative model 2. This is due to the fact that in the first alternative model, a particular motif has a large contribution to the statistics. In the second model, however, the segment transferred from sequence 1 might be similar to the corresponding segment it replaces in sequence 2; hence, a greater g value is required before the sequences show similarity.

Under the alternative model 2, we find that for k ≤ 9 and g ≤ 0.05, D₂ and D2z actually show a decrease in power as n increases, in certain intervals; this is illustrated in Figure 9.

FIG. 9. — Alternative Model 2: Power curves for D₂, D₂z, , and under the gc-rich distribution when k = 5, g = 0.05. Note: For k = 5, D₂ has the least power and its power actually decreases as n increases.

For k = 10, D₂ has higher power than Inline graphic , but lower power than ; the higher power than comes at a great computational cost (result not shown).

Our findings suggest that D₂ is not desirable as a statistic for sequence comparison. We conjecture that this is due to the fact that D₂ is dominated by the normal components of the individual sequences and so is actually “measuring the sum of the departure of each sequence from the background” (Lippert et al., 2002) rather than the (dis)similarity between the two sequences. As n increases, D₂ loses its detecting power as the two sequences become more similar.

As an aside, under the uniform distribution, in the alternative model 2, all three statistics behave similarly for k = 5, as expected, see Figure 10.

FIG. 10. — Alternative Model 2: Power curves for D₂, D2z, , and under the uniform distribution when k = 5, g = 0.05.

4. Using to Test for Similarity

Although in our simulations Inline graphic is more powerful than , the statistic is still considerably more powerful than D₂. For tests that would result in small p-values, as required for multiple tests for example, simulating the empirical distribution of the test statistics under the null hypothesis can be time consuming. In contrast to Inline graphic , the limiting distribution of is normal with mean zero, and hence, testing is straightforward; only the standard deviation needs to be estimated.

To illustrate the procedure, for fixed k and n, we generate 10,000 pairs of sequences under the gc-rich or uniform case and compute the Inline graphic scores for each pair. The standard deviation of is then estimated from these empirical scores.

Again, for fixed k and n, we generate 2000 pairs of sequences under the null model of no relationship between the two sequences, in both the gc-rich and uniform case, and we compute the Inline graphic score. Assuming asymptotic normality, we use a z-test to test the null hypothesis of no relationship, assuming mean zero and the estimated standard deviation.

Then we generate 2000 pairs of sequences from alternative model 1, with motif insertion probabilities g = 0.001, 0.005, 0.01, 0.05, 0.1. The Inline graphic statistic is computed, and we carry out a z-test, based on the asymptotic normality of . We repeat the procedure for the pattern transfer model, alternative model 2. We choose k = 4, 5, 6, because we know from our power simulations that works best when the motif length is around 5.

We compare to the results that we obtain using the empirical distribution of Inline graphic instead, where the empirical distribution function is based on 10,000 samples. In addition, we use the empirical distribution of based on 100,000 samples.

Tables 3 and 4 show the estimated type 1 and type 2 error rates in the gc-rich and uniform case; recall that the type 1 error is the probability to reject the null hypothesis although it is true. The type 2 error, the probability to accept the null hypothesis although false, is estimated under the alternative model 1 and under the alternative model 2, with motif insertion probability and pattern transfer probability g taking on the values g = 0.005, 0.01, 0.05, 0.1. For each n and k, the first row gives the estimates from the z-test (abbreviated as z), and the second row gives the estimate from the empirical distribution function (abbreviated as e). Except for the puzzling case when n = 3200 with k = 6, the results are remarkably similar, and there is no clear advantage of using the empirical distribution function when based on a relatively small number of samples. The general observation is that the normal approximation for Inline graphic gives a fast method for assessing statistical significance.

Table 3.

The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples

		Type 1		Type 2
Length	k			M1g1	M1g2	M2g3	M2g4
3200	4	z	0.001	0.966	0.583	0.965	0.764
		e	0.001	0.964	0.555	0.961	0.745
		t	0.000	0.971	0.609	0.972	0.784
3200	5	z	0.002	0.958	0.583	0.881	0.365
		e	0.001	0.977	0.658	0.922	0.450
		t	0.001	0.975	0.647	0.916	0.433
3200	6	z	0.388	0.260	0.020	0.100	0.001
		e	0.000	0.973	0.647	0.910	0.273
		t	0.001	0.974	0.654	0.913	0.282
6400	4	z	0.000	0.904	0.112	0.973	0.794
		e	0.000	0.907	0.114	0.976	0.800
		t	0.001	0.897	0.109	0.972	0.788
6400	5	z	0.000	0.910	0.132	0.921	0.442
		e	0.000	0.920	0.148	0.936	0.480
		t	0.001	0.914	0.140	0.927	0.463
6400	6	z	0.028	0.635	0.022	0.607	0.049
		e	0.000	0.931	0.205	0.943	0.319
		t	0.001	0.921	0.191	0.936	0.294
12,800	4	z	0.000	0.592	0.000	0.975	0.794
		e	0.001	0.545	0.000	0.963	0.752
		t	0.001	0.593	0.000	0.975	0.794
12,800	5	z	0.000	0.627	0.000	0.919	0.433
		e	0.000	0.696	0.000	0.949	0.514
		t	0.001	0.648	0.000	0.929	0.461
12,800	6	z	0.002	0.493	0.000	0.820	0.139
		e	0.000	0.697	0.002	0.940	0.304
		t	0.001	0.684	0.002	0.933	0.286
25,600	4	z	0.001	0.070	0.000	0.961	0.798
		e	0.002	0.062	0.000	0.954	0.767
		t	0.002	0.067	0.000	0.957	0.782
25,600	5	z	0.001	0.106	0.000	0.917	0.452
		e	0.000	0.121	0.000	0.929	0.484
		t	0.001	0.105	0.000	0.916	0.447
25,600	6	z	0.004	0.083	0.000	0.868	0.201
		e	0.003	0.100	0.000	0.887	0.234
		t	0.001	0.132	0.000	0.923	0.294

Open in a new tab

“M1/M2” refers to alternative model 1/2; “g1/g2/g3/g4” refers to the cases where g = 0.005, 0.01, 0.05, 0.1, respectively, where g is the parameter of the Bernoulli random variable B. So “M1g3” means “alternative model 1, g = 0.05.” The first row gives the estimates from the z-test (abbreviated as z), the second row gives the estimate from the empirical distribution function (abbreviated as e), both based on 10,000 samples. The third row, abbreviated as t, gives the estimate from the empirical distribution function based on 100,000 samples.

Table 4.

The Estimated Type 1 and 2 Error Rates When Applying the z-Test Using the Estimated Variance, Using 2000 Samples, but for the Uniform Case

		Type 1		Type 2
Length	k			M1g1	M1g2	M2g3	M2g4
3200	4	z	0.001	0.978	0.574	0.961	0.746
		e	0.003	0.966	0.518	0.945	0.694
		t	0.001	0.980	0.593	0.967	0.760
3200	5	z	0.002	0.977	0.583	0.889	0.346
		e	0.001	0.986	0.639	0.919	0.404
		t	0.001	0.984	0.621	0.909	0.380
3200	6	z	0.048	0.744	0.198	0.438	0.018
		e	0.001	0.986	0.711	0.912	0.237
		t	0.001	0.985	0.678	0.900	0.218
6400	4	z	0.001	0.882	0.080	0.953	0.743
		e	0.002	0.848	0.063	0.934	0.688
		t	0.001	0.887	0.084	0.956	0.752
6400	5	z	0.001	0.888	0.107	0.891	0.333
		e	0.002	0.874	0.093	0.876	0.304
		t	0.001	0.899	0.123	0.903	0.363
6400	6	z	0.004	0.824	0.064	0.801	0.104
		e	0.002	0.897	0.113	0.883	0.177
		t	0.001	0.910	0.135	0.894	0.203
12,800	4	z	0.003	0.534	0.000	0.951	0.750
		e	0.002	0.580	0.000	0.961	0.783
		t	0.003	0.539	0.000	0.952	0.751
12,800	5	z	0.002	0.578	0.000	0.886	0.356
		e	0.002	0.585	0.000	0.887	0.365
		t	0.002	0.590	0.000	0.891	0.371
12,800	6	z	0.002	0.491	0.000	0.823	0.110
		e	0.001	0.600	0.000	0.892	0.171
		t	0.001	0.605	0.001	0.895	0.175
25,600	4	z	0.000	0.059	0.000	0.963	0.738
		e	0.002	0.050	0.000	0.949	0.706
		t	0.001	0.059	0.000	0.960	0.737
25,600	5	z	0.000	0.086	0.000	0.909	0.361
		e	0.000	0.110	0.000	0.923	0.418
		t	0.001	0.082	0.000	0.901	0.350
25,600	6	z	0.002	0.093	0.000	0.865	0.154
		e	0.002	0.114	0.000	0.889	0.183
		t	0.003	0.113	0.000	0.888	0.182

Open in a new tab

5. Phase Transition

In this section, we explore the effect of small deviation from the uniform distribution for the D₂ statistic only. We restrict attention to the alphabet Inline graphic and word size k = 1; both sequences are of the same length n. Again p_α denotes the probability of the letter α. Then the standardized counts and both tend to standard normal variables when n tends to infinity. With this notation and noting that , we obtain

(7)

When the distribution of the alphabet is uniform, Inline graphic the second term in Equation (7) vanishes,

and so it is asymptotically nonnormal (in fact, a sum of products of standard normal variables).

In the situation where Inline graphic do not depend on n, the second term in Equation (7) dominates the first term; as

the scaled limit is a normal distribution.

Next we assume that (p_a(n), p_g (n), p_c (n), p_t (n)) changes with n in such a way that there exists a function f (n) → 0 and constants C_l, l = a, g, c, t satisfying C_a + C_g + C_c + C_t = 0 such that

for each letter Inline graphic . Then

(8)

where γ(n) → 0 as n → ∞ and Θ(f (n)) indicates a term that has the same order as f (n).

Let f (n) = 1/n^{0.5 + ε}. When ε < 0, the second term in (8) will dominate and Inline graphic will tend to a normal distribution. When ε > 0 the first term dominates and will tend to be nonnormal. Thus, we expect a phase transition from normal to nonnormal as ε changes from negative to positive. Intuitively, the ratio R_n of the coefficient of the first term over the coefficient of the second term in Equation (8) can be thought of as a “ratio of dominance”;

Table 5 shows the decrease of R_n for increasing n (row labels) and decreasing ε (column labels).

Table 5.

The Ratio R_n of the Coefficient of the First Term Over the Coefficient of the Second Term in for Different Values of n and ε

	ε
n	−0.01	−0.05	−0.10	−0.15	−0.20
2¹⁰ × 100	0.386	0.243	0.137	0.077	0.043
2²⁰ × 100	0.360	0.172	0.068	0.027	0.011
2³⁰ × 100	0.336	0.122	0.034	0.010	0.003

Open in a new tab

To run simulations in the vicinity of this phase transition, we consider two types of probability vectors for the alphabet in Inline graphic and f (n) = 1/n^{0.5 + ε}. The type I probability vector is chosen as , giving (C_a, C_c, C_g, C_t) = (1, − 1, 0, 0).

In the second scenario, type II, the probability vector perturbs all components, Inline graphic , so that (C_a, C_g, C_c, C_t) = (1, − 1, 1, −1).

Under the type I model, we can show that the variance of Inline graphic is approximately . Thus, there exists Z_n → N(0, 1) as n tends to infinity such that

Equation (8) can be rewritten as

(9)

Under the type II model, the variance of Inline graphic is approximately and

Hence, under the type II model, Equation (8) can be rewritten as

(10)

The ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is Inline graphic -fold larger than that for Equation (10). Therefore, we expect that normality appears for relative small absolute ε values for the type II vector.

For given n and ε, we generate sequences of length n using both types of distribution vectors. The D₂ scores for word size k = 1 are then tabulated. We use a Kolmogorov–Smirnov test to test the hypothesis that D₂ is normally distributed, and the corresponding p-value is obtained. Here again we use the theoretical mean and variance for the test. Table 6 gives the p-values for different values of n and ε under the two models; again we report only the first four digits.

Table 6.

The p-Values of the Kolmogorov–Smirnov Test for Testing the Normality of D₂ for Letter Distributions Which Are Close to Uniform; f(n) = 1/n^0.5+ε

	n
ε	2⁵ × 100	2¹⁰ × 100	2¹⁵ × 100	2²⁰ × 100
	Type I :
0.1	0	0	0	0
0.01	0	0	0	0
0.001	0	0	0	0
1 e−04	0	0	0	0
1 e−05	0	0	0	0
−1 e−05	0	0	0	0
−1 e−04	0	0	0	0
−0.001	0.0001	0	0.0001	0
−0.01	0	0	0.0002	0
−0.05	0	0	0	0.0088
−0.1	0.0424	0.1402	0.2754	0.6383
−0.15	0.2557	0.1027	0.2041	0.9915
−0.2	0.1198	0.6978	0.2258	0.4000

	Type II :
0.1	0	0	0	0
0.01	0.0005	0	0	0
0.001	0	0	0	0
1 e−04	0	0.0005	0	0
1 e−05	0	0.0015	0	0
−1 e−05	0.0049	0	0	0
−1 e−04	0.0020	0	0	0
−0.001	0	0	0	0
−0.01	0	0	0	0.0199
−0.05	0.0002	0.0069	0.0005	0.3162
−0.1	0.0866	0.0637	0.3941	0.3159
−0.15	0.5832	0.5113	0.0326	0.5910
−0.2	0.1015	0.5146	0.9437	0.4827

Open in a new tab

Table 6 indicates that under the type II model the distribution of D₂ is not significantly different from normality when ε ≤ −0.05 and n = 2²⁰ × 100, while significantly different from normality when ε = −0.05 and n ≤ 2¹⁵ × 100. On the other hand, under the type I model, the distribution of D₂ is significantly different from normality even when ε = −0.05 and n = 2²⁰ × 100. The simulation results are consistent with our intuition. As shown in Table 5, the ratio R_n of the coefficient of the first term over that of the second term in Equation (8) is less than 0.1 if ε < −0.10 when n = 2²⁰ × 100. This can explain why normality of D₂ appears if ε < −0.10 when n = 2²⁰ × 100 for both type I and II models. Further, as the ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is Inline graphic -fold larger than that for Equation (10), normality of D₂ begins to appear when ε < −0.05 for the type II model.

6. Discussion

The typically used statistic D₂ asymptotically ignores the joint word occurrences in two sequences unless all letters are almost equally likely. In the latter scenario, a phase transition occurs. Hence, the statistic is neither robust nor informative, under the normal regime. The main advantage of D₂ is that it is easy to compute.

The proposed Inline graphic statistic instead is also easy to compute, but it can be compared with a normal distribution to assess significance, and it performs well in a power study.

The Inline graphic statistic is more powerful than in our simulation study and is also easy to compute, but its asymptotic distribution does not have a convenient form; instead, it would best be assessed using simulation, which are time consuming, as the tail of the distribution would need to be estimated.

Our recommendation is to discard D₂, to use Inline graphic instead when computing time is limited, and to ideally use for sequence comparison based on k-tuple content.

Our results allow for a number of generalizations. The normal approximation for the word counts in each individual sequences does not assume that the underlying letter distribution is the same as in the other sequence. Hence, the normal approximation for Inline graphic also holds when the sequences do not follow the same underlying distribution on the letters.

Huang (2002) gave a related normal approximation for one sequence in the more general situation that the sequence is generated by a homogeneous Markov chain. Kantorovitz et al. (2007b) already successfully adapted D2z to the Markov case. Also for Inline graphic the generalization of our results for that setting should be straightforward; the error bounds would need to be adjusted. Burden et al. (2008) generalized D2z to allow for mismatches, on the four-letter alphabet {a, c, g, t} under the Bernoulli model that p_a = p_t and p_c = p_g; they called an m-neighborhood of a word w of length k the set of all words which differ by at most m letters from w. The generalized statistic then counts the number of all m-neighborhood matches of all k-words between two sequences. With our normal approximation for all word counts, Inline graphic could be generalized similarly to allow for a certain number of mismatches. The quality of the normal approximation will depend on the number m of permitted mismatches.

We also indicate that more than two sequences could be compared in a similar fashion. Quine (1994) stated the result that if Inline graphic are independent normal random variables with zero means and variances , then

where both sums are over all integers Inline graphic (Melnykov and Chen, 2007). This suggests the extension of the statistic for multiple sequence comparison by taking the products of the individual word counts and standardizing it as earlier. Then still a normal approximation is valid.

Similarly, we could extend Inline graphic as the sum, over all words, of the product of more than two standardized word counts. Springer and Thompson (1966) gave a formula for the density of the product of independent standard normals. Again the covariance structure of the word counts within one sequence would make it recommendable to assess the limiting distribution via simulation.

Supplementary Material

Supplemental data

Supp_Data.pdf^{(77.9KB, pdf)}

Acknowledgments

G.R. was supported in part by EPSRC grant no. GR/R52183/01, and by BBSRC and EPSRC through OCISB. D.C. was supported by a Overseas Postdoctoral Fellowship from the National University of Singapore. F.S. was supported by NIH grant no. P50 HG 002790 and R21AG032743. M.S.W. was supported by NIH grant no. P50 HG 002790 and R21AG032743.

Disclosure Statement

No competing financial interests exist.

References

Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2008;18:1–21. [Google Scholar]
Forêt S. Kantorovitz M. Burden C. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 2006;7(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recognit. 2009;42:539–548. [Google Scholar]
Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probab. 2002;34:559–586. [Google Scholar]
Kantorovitz M.R. An example of a stationary, triplewise independent triangular array for which the CLT fails. Statist. Probab. Lett. 2007;77:539–542. [Google Scholar]
Kantorovitz M.R. Booth H.S. Burden C.J., et al. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab. 2007a;44:788–805. [Google Scholar]
Kantorovitz M.R. Robinson G.E. Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007b;23:i249–i255. doi: 10.1093/bioinformatics/btm211. [DOI] [PubMed] [Google Scholar]
Lilliefors H.W. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 1967;62:399–402. [Google Scholar]
Lippert R.A. Huang H. Waterman M.S. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA. 2002;99:13980–13989. doi: 10.1073/pnas.202468099. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lundstrom R. Stochastic models and statistical methods for DNA sequence data. Department of Mathematics, University of Utah; Salt Lake City, UT: 1990. [Ph.D. thesis]. [Google Scholar]
Melnykov I. Chen J.T. A connection between self-normalized products and stable laws. Stat. Probab. Lett. 2007;77:1662–1665. [Google Scholar]
Quine M.P. A result of Shepp. Appl. Math. Lett. 1994;7:33–34. [Google Scholar]
Shepp L. Problem 62-9: Normal functions of normal random variables. SIAM Rev. 1964;6:459. [Google Scholar]
Springer M.D. Thompson W.E. The distribution of products of independent random variables. SIAM J. Appl. Math. 1966;14:511–526. [Google Scholar]
Stuart A. Ord J.K. Kendall's Advanced Theory of Statistics. 5th. Vol. 1. Griffin; London: 1987. [Google Scholar]
Waterman M.S. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC; Boca Raton, FL: 1995. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data

Supp_Data.pdf^{(77.9KB, pdf)}

[B1] Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2008;18:1–21. [Google Scholar]

[B2] Forêt S. Kantorovitz M. Burden C. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 2006;7(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recognit. 2009;42:539–548. [Google Scholar]

[B4] Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probab. 2002;34:559–586. [Google Scholar]

[B5] Kantorovitz M.R. An example of a stationary, triplewise independent triangular array for which the CLT fails. Statist. Probab. Lett. 2007;77:539–542. [Google Scholar]

[B6] Kantorovitz M.R. Booth H.S. Burden C.J., et al. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab. 2007a;44:788–805. [Google Scholar]

[B7] Kantorovitz M.R. Robinson G.E. Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007b;23:i249–i255. doi: 10.1093/bioinformatics/btm211. [DOI] [PubMed] [Google Scholar]

[B8] Lilliefors H.W. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 1967;62:399–402. [Google Scholar]

[B9] Lippert R.A. Huang H. Waterman M.S. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA. 2002;99:13980–13989. doi: 10.1073/pnas.202468099. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Lundstrom R. Stochastic models and statistical methods for DNA sequence data. Department of Mathematics, University of Utah; Salt Lake City, UT: 1990. [Ph.D. thesis]. [Google Scholar]

[B11] Melnykov I. Chen J.T. A connection between self-normalized products and stable laws. Stat. Probab. Lett. 2007;77:1662–1665. [Google Scholar]

[B12] Quine M.P. A result of Shepp. Appl. Math. Lett. 1994;7:33–34. [Google Scholar]

[B13] Shepp L. Problem 62-9: Normal functions of normal random variables. SIAM Rev. 1964;6:459. [Google Scholar]

[B14] Springer M.D. Thompson W.E. The distribution of products of independent random variables. SIAM J. Appl. Math. 1966;14:511–526. [Google Scholar]

[B15] Stuart A. Ord J.K. Kendall's Advanced Theory of Statistics. 5th. Vol. 1. Griffin; London: 1987. [Google Scholar]

[B16] Waterman M.S. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC; Boca Raton, FL: 1995. [Google Scholar]

PERMALINK

Alignment-Free Sequence Comparison (I): Statistics and Power

Gesine Reinert

David Chew

Fengzhu Sun

Michael S Waterman

Abstract

1. Introduction

2. The Distributions of and Under the Null Model

2.1. and asymptotic normality

Theorem 2.1

2.2. and the product of independent normals

Lemma 2.1

2.3. The case that all letters are equally likely

2.4. Simulations

2.4.1. The nonuniform case

Table 1.

2.4.2. The uniform case

Table 2.

3. Power Studies

FIG. 1.

FIG. 2.

FIG. 3.

FIG. 4.

FIG. 5.

FIG. 6.

FIG. 7.

FIG. 8.

FIG. 9.

FIG. 10.

4. Using to Test for Similarity

Table 3.

Table 4.

5. Phase Transition

Table 5.

Table 6.

6. Discussion

Supplementary Material

Acknowledgments

Disclosure Statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases