Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2006 Dec 18;7(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Sylvain Forêt 1,, Miriam R Kantorovitz 2, Conrad J Burden 1
PMCID: PMC1764478  PMID: 17254306

Abstract

Background

The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D2, and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied.

Results

We have computed the D2 optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D2 to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D2 statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA.

Conclusion

We have characterized the distribution of the D2 statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D2-based methods are most appropriate for large genomic sequences.

Background

The overwhelming amount of molecular data generated by the sequencing of whole genomes and EST libraries has triggered the development of numerous sequence comparison algorithms, aimed at detecting related sequences and at quantifying this relatedness. BLAST [1], FASTA [2] and other related algorithm are arguably the most popular programs for sequence comparison. These algorithms rely on the local alignment of the sequences under scrutiny and assume conservation of contiguity between homologous segments. This assumption, however, is often violated. Discontinuity can occur, for example, when spliced transcripts are matched to genomic sequences, when ESTs or cDNAs from different splice variants are compared or when genomic sequences are aligned that have undergone genome shuffling. Other alignment-based algorithms that do not assume conservation of contiguity, such as BLAT [3] or SIM4 [4], can compute scores, percentage similarity, but do not assess statistical significance.

Several types of alignment-free sequence comparison algorithms, reviewed in [5], can circumvent this problem. Among these alignment-free methods, techniques based on the number of k-words shared between two sequences, also known as the D2 statistic, are particularly noteworthy due to the speed of their implementation, their sensitivity and selectivity [6]. They have been extensively used to structure large collections of ESTs into clusters of similar sequences [7-9].

The rigorous study of the statistical distribution of D2 began with the computation of bounds of D2 variance and the characterization of asymptotic distributional regimes [10]. These results have been refined in a recent study [11]. Other studies [12,13] have focussed on a generalization of D2, the number of approximate k-word matches between two sequences, Inline graphic, where t is the number of mismatches per word. Bounds on thevariance and asymptotic distribution of Inline graphic have been determined [13].

The current paper summarizes the theoretical knowledge on the distribution of D2 and Inline graphic. The optimal word sizes of these statistics in a variety of conditions were computed, and the distributional regimes at optimal word size were deduced. Finally, the accuracy of Inline graphic as a measure of sequence similarity was compared with other measures using random sequences and sequences evolved from human genomic DNA.

Results and Discussion

Word matches measures

Exact matches

The theory developed for the number of exact word matches between two sequences is widely applicable to any kind of sequence with the only constraint that these are made of independent and identically distributed (i.i.d.) letters. Given an alphabet Inline graphic of d letters, let A = A1A2 An be a sequence of n i.i.d. letters of Inline graphic. Let fa be the probability of a letter taking the value a, and pk = Inline graphic.

D2 is defined as the number of k-words matches between two sequences A and B, and can be expressed

graphic file with name 1471-2105-7-S5-S21-i4.gif

where Y(i, j) is the k-word match indicator variable starting at position i in A and j in B, and the index set is

I = {(i, j) : 1 ≤ i Inline graphic, 1 ≤ j Inline graphic}

where, for convenience, (n - k + 1) and (m - k + 1) are written Inline graphic and Inline graphic respectively.

D2 can also be thought of as an inner product of the word count vectors. Let Inline graphic = {w1, w2, ⋯, wn} be the set of all k-words on Inline graphic. For w ∈ Inline graphic, let Inline graphic be the number of times the letter w appears in sequence A. The count vector for that sequence is Inline graphic. D2 can thus also be expressed as

graphic file with name 1471-2105-7-S5-S21-i10.gif

The mean of D2 can be easily computed (e.g. [14])

graphic file with name 1471-2105-7-S5-S21-i11.gif

When the letters of the alphabet are uniformly distributed, that is, fa = Inline graphic, for all a A.

graphic file with name 1471-2105-7-S5-S21-i13.gif

Lower and upper bounds for the variance of D2 were computed in [10], and an exact formula for the variance in the case of uniform letter distribution was given in [11]. Limiting distributions of D2 when n and k get large were derived in [10], namely (a) when k/Inline graphicn > 2, and E[D2] is not too small, D2 has a compound Poisson [15] asymptotic behaviour, and (b) when k/Inline graphicn < 1/6, and the letter distribution is non-uniform, D2 has a normal asymptotic behaviour. Numerical simulations in [10] showed that these theoretical bounds are not tight. The uniform letter distribution case was studied in [11], where it was proved that, for large enough k, the D2 statistic is approximately normal as n gets large.

Approximate matches

The theory for the number of approximate matches between two sequences has been developed in the more restricted framework of strand symmetric Bernoulli text [12,13]. In this framework, the letters are i.i.d. with frequencies

graphic file with name 1471-2105-7-S5-S21-i15.gif

graphic file with name 1471-2105-7-S5-S21-i16.gif

where η is the perturbation parameter, with -1 ≥ η ≥ 1. The distance between two words x and y of length k is defined as the number of character mismatches between x and y and is written δ(x, y). When δ(x, y) ≥ t, x is said to be a t-neighbour of y. If A is a strand symmetric Bernoulli text and w is a known word of the same length, n, as A, then the probability distribution of the distance δ(A, w) is:

Pr(δ(A, w) = t) =gt(n,η,c)

where gt is a perturbed binomial distribution [12], c is the GC count in w and η is the perturbation parameter of A. The cumulative distribution function of the distance is then

graphic file with name 1471-2105-7-S5-S21-i17.gif

The Inline graphic statistic is defined to be the number of t-neighbours of k-words between sequences A and B, and can be expressed as

graphic file with name 1471-2105-7-S5-S21-i18.gif

where Inline graphic is the indicator variable of t-neighbourhood for the k-words starting at position i in A and j in B. The expectation of Inline graphic can be expressed in terms of the perturbed binomial distribution [12]

graphic file with name 1471-2105-7-S5-S21-i20.gif

Upper and lower bounds for the variance of Inline graphic were computed in [13], and the following limit deduced: for large n, Inline graphic is asymptotically normal when k = αInline graphic(n) + C where 0 ≥ α <Inline graphic and C is a constant. When t = 0, this result is an improvement on the α = 1/6 result for perfect matches [10] reported above. Numerical simulations in [13] suggest that this asymptotic behaviour occurs for α as high as 2.

The results of the numerical simulations comparing the distribution of Inline graphic to a normal distribution, for various word sizes, sequences sizes and number of mismatches, in the case of non-uniform letters distribution, are shown in figure 1. Similar tables comparing D2 to the normal and compound Poisson distribution, in the case of exact word matches, for uniform and non-uniform letters distribution, can be found in [10].

Figure 1.

Figure 1

Kolmogorov-Smirnov p values for non-uniform Inline graphiccompared with normal. (adapted from [13]) The letter distribution is fA = fT=Inline graphic, fG = fC = Inline graphic, and the diagonal red line in each table is k = 2Inline graphicn + const.

Optimal word sizes

Our first goal is to characterize the distribution regime of Inline graphic when the word size is optimal. We write Inline graphic the optimal word size for sequences of size n and t mismatches. Numerical simulations were carried out to determine Inline graphic for various sequence lengths n = 2x × 102 with x = 1, 2, 3, 4, 5, and for different numbers of mismatches t = 0, 1, 2, 3, 4, 5. A summary of these simulations is given in table 1 and the detailed simulation results can be found in additional files 1 to 9. Sequences with a non-uniform letter distribution were used, with nucleotide frequencies fA = fT = Inline graphic, fG = fC = Inline graphic. Similar compositional biases are observed in several sequenced genomes, such as the honey bee Apis mellifera, the roundworm Caenorhabditis elegans or the zebra fish Danio rerio. The optimal word size was also computed in the uniform case for t = 0, 3, with very similar outcomes. We will therefore focus our discussion on the non-uniform case.

Table 1.

Optimal word sizes. Optimal word sizes for various sequence lengths and numbers of mismatches.

Mismatches Sequence length
200 400 800 1600 3200
0 6 7 7 7 7
1 8 10 10 10 10
2 10 12 12 12 12
3 12 14 14 14 14
4 14 16 16 16 16
5 16 18 18 18 18

Regardless of the number of mismatches, the optimal word size is quite stable when the lengths of the sequences vary across the range of sizes under consideration. We only noticed a slight decrease in Inline graphic for sequences smaller than 400 letters (see table 1). The behaviour of Inline graphic, however, can traverse different distribution regimes when the sequence length varies and the word size is fixed.

When our data for perfect word matches are compared to tables 1 and 3 in [10], it appears that the distribution of D2 at optimal word size is approximately a compound Poisson distribution when sequences are 400 letters long or smaller, and is approximately normal when sequences lengths are larger than 1600 letters.

In the case of approximate word matches, it can be extrapolated from figure 1 that the distribution of Inline graphic becomes normal for sequences larger than 1600 letters. The approximate behaviour of Inline graphic, when t > 0 and k/Inline graphicn > 2, however, is unknown. The distribution regime of Inline graphic at optimal word size for smaller sequences could therefore not be characterized.

Accuracy of Inline graphic measures

A previous study [16] compared the accuracy of various dissimilarity measures based on the number of words shared between 2 sequences. These authors give the values of the Spearman's rank statistic, A (see methods), obtained with these measures and compare them with that obtained using either BLAST or the Hamming distance. Similarly, we compared the efficiency of Inline graphic for different numbers of mismatches, t = 0,1, 2, 3, 4, 5 (figure 2). We used the same sequence size (600) as inthe above-cited study so as to allow comparison of the performance of Inline graphic with the measures assessed in that study. A for BLAST was computed based on the bit scores obtained with the default settings.

Figure 2.

Figure 2

Effect of the number of mismatches on the accuracy of Inline graphic. The curves show the accuracy of the Inline graphic statistics according to the word length and for various mismatches per word. In this simulation, sequence sizes of 600 letters were used. The dotted line shows the results obtained with BLAST.

Overall, the Inline graphic measures provide an accuracy similar to the dissimilarity measures computed in [16], with log (A) ranging from 9.3 to 9.6. This is better than BLAST whose log (A) is close to 9.9. It is noteworthy that, at optimal word size, the Inline graphic statistic gives better results when the number of mismatches allowed per word increases.

Application to non-iid sequences

In real biological sequences, letters are not independent and identically distributed. To test the applicability of our results to biological sequences, we conducted similar simulations, but instead of randomly generated sequences of known composition, we used sequences sampled from the human genome and made them evolve according to the K80 [17] model of nucleotide substitutions. The results are summarized in additional file 9. These results are quite encouraging in many respects. First, the optimal word sizes for these sequences are the same as or very close to the ones predicted with random sequences in the previous sections. The values of log (A) for near-optimum word sizes are lower than those computed previously, suggesting that similarity measures based on word counts may be more accurate on real sequences than on randomly generated sequences. Finally, the accuracy of the Inline graphic measures near optimal word size are consistently better than BLAST bit scores obtained with the default settings.

Conclusion

The Inline graphic statistic has different distribution regimes depending on the word size k, the sequences lengths n, the number of mismatches t and the sequence composition fATGC. We computed the optimal word size for various combinations of these parameters that influence the distribution of Inline graphic. For sequences smaller than 400 letters and when no mismatches are allowed, the distribution of Inline graphic is close to a compound Poisson. In the case of sequence larger than 1600 letters, the distribution of Inline graphic is approximately normal. When estimating the significance of sequence similarity using Inline graphic at its optimal word size, the null distribution would thus have to be adjusted according to the size of the sequences. It is worth noting that the size of typical ESTs or whole genome shotgun sequencing traces (500–800 letters) is in the transition region between these two limiting distributions.

We have also shown that the accuracy of the Inline graphic statistic is similar to that of other similarity measures based on the number of words shared between two sequences. In particular, allowing for mismatches between words increases the accuracy of the Inline graphic. This improvement, however, comes at a high computation cost. The algorithmic complexity of Inline graphic is o(kn), when t = 0, however, when t > 0, the worst case complexity is o(knm). In our simulations, Inline graphic was more accurate than BLAST (see figure 2). We expect this difference to be accentuated on real sequences where shuffling and transposition occurs, breaking the contiguity generally assumed by alignment-based sequence comparison algorithms.

The assumption of i.i.d. distributed letters made to ease the characterization of the Inline graphic distribution may not hold in real sequences. This may have an impact on the assessment of the statistical significance of the number of words shared between two sequences. The optimal word sizes computed in this study, however, seem to be valid for real sequences evolved under a relatively realistic model of nucleotides substitution.

Taken together, the statistical theory, the algorithmic complexity and the simulation results suggest that the best application of the Inline graphic statistics would be the identification of similarities between large genomic sequences using exact word matches. Comparison of such sequences using alignment-based methods is computationally expensive, since these algorithms typically have a o(n2) complexity, hence the development of heuristics speeding up these comparisons, such as BLAT [3] or MEGABLAST [18]. The linear (o(n)) nature of the Inline graphic algorithm and the normal asymptotic behavior of Inline graphic, when sequences are large compared to word size, would allow sequence similarity to be assessed in a rigorous statistical framework, with significant improvement in run time.

Finally, the known asymptotic behavior of the Inline graphic statistic could be used to improve the assessment of significant matches during the initial (search) stage of existing alignment algorithms such as BLAST or BLAT.

The use of Inline graphic in the context of biological sequences is not limited to the sequence comparison discussed in this paper. It has been proposed that it could be used to choose discriminative microarray probes [12]. Other possible applications may include the detection of transcription factor binding sites, microRNAs and dsRNA targets.

Methods

Word size optimization

The optimization of the word size, for a given sequence length and number of mismatches, was carried out according to a method similar to that introduced in [16]. In brief, a family of sequences was generated by first creating a random mother sequence. 100 sons were then derived by mutating γ% of the mother sequence, where γ = 1, 2, . . ., 100%. Only point mutations were used: substitution, insertion and deletion of a single letter. The Inline graphic statistic between the mother and each son were computed. Two rankings of the sons were then produced, one based on Inline graphic, and another based on γ. The accuracy of Inline graphic-based sequence comparison was estimated by looking at the discrepancy between these two rankings by means of the Spearman's rank statistic A. The optimal word size is that for which A is minimal. For each condition, the data presented here are the average of A for 100 to 400 families.

Similar simulations were also carried out by randomly selecting a mother sequence from the human genome chromosome 1, version NCBI36.40 (available from http://www.ensembl.org). Sons where then derived according to a K80 [17] model of evolution.

A program written in ANSI C was written to compute the spearmanRS statistic. Data were post-processed using the R environment. The simulations were carried out on a Debian Gnu/Linux desktop computer. The source code is available from our k-words website [19].

Fit to the normal distribution

Kolmogorov-Smirnov p-values [20] for the standardized statistic Inline graphic compared with the standard normal distribution for sample sizes of 2,500 sequence pairs were computed. The results are shown in figure 1. In general, samples which are a close approximation to the normal distribution have p-values distributed uniformly in the interval [0, 1], whereas samples which are a poor approximation have small p-values.

Authors contributions

SF drafted most of the manuscript and performed the numerical simulations. MK made all the theoretical calculations. CB conceived of and coordinated the project and drafted parts of the manuscript. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1

Word size optimization for exact word matches, uniform sequences

Click here for file (15.5KB, eps)
Additional file 2

Word size optimization for exact word matches, non-uniform sequences

Click here for file (13.2KB, eps)
Additional file 3

Word size optimization for approximate word matches with 1 mismatch, non-uniform sequences

Click here for file (13.4KB, eps)
Additional file 4

Word size optimization for approximate word matches with 2 mismatch, non-uniform sequences

Click here for file (14.4KB, eps)
Additional file 5

Word size optimization for approximate word matches with 3 mismatch, uniform sequences

Click here for file (17.4KB, eps)
Additional file 6

Word size optimization for approximate word matches with 3 mismatch, non-uniform sequences

Click here for file (14.7KB, eps)
Additional file 7

Word size optimization for approximate word matches with 4 mismatch, non-uniform sequences

Click here for file (14.5KB, eps)
Additional file 8

Word size optimization for approximate word matches with 5 mismatch, non-uniform sequences

Click here for file (14.4KB, eps)
Additional file 9

Word size optimization for exact and approximate word matches, sequences evolving according to a Kimura80 substitution model. The dotted lines show the results obtained with blast.

Click here for file (20KB, eps)

Acknowledgments

Acknowledgements

This work was funded in part by ARC Discovery Grant DP0559260.

This article has been published as part of BMC Bioinformatics Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S5.

Contributor Information

Sylvain Forêt, Email: sylvain.foret@rsbs.anu.edu.au.

Miriam R Kantorovitz, Email: ruth@math.uiuc.edu.

Conrad J Burden, Email: conrad.burden@anu.edu.au.

References

  1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. doi: 10.1016/0076-6879(90)83007-v. [DOI] [PubMed] [Google Scholar]
  3. Kent WJ. BLAT-the BLAST-like alignment tool. Genome Res. 2002;12:656–64. doi: 10.1101/gr.229202. 10.1101/gr.229202. Article published online before March 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998;8:967–74. doi: 10.1101/gr.8.9.967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics. 2003;19:513–23. doi: 10.1093/bioinformatics/btg005. [DOI] [PubMed] [Google Scholar]
  6. Hide W, Burke J, Davison DB. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J Comput Biol. 1994;1:199–215. doi: 10.1089/cmb.1994.1.199. [DOI] [PubMed] [Google Scholar]
  7. Burke J, Davison D, Hide W. d2_cluster: a validated method for clustering EST and full-length cDNAsequences. Genome Res. 1999;9:1135–42. doi: 10.1101/gr.9.11.1135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W. STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res. 2001;29:234–8. doi: 10.1093/nar/29.1.234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Carpenter JE, Christoffels A, Weinbach Y, Hide WA. Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J Comput Chem. 2002;23:755–7. doi: 10.1002/jcc.10025. [DOI] [PubMed] [Google Scholar]
  10. Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Proc Natl Acad Sci U S A. 2002;99:13980–9. doi: 10.1073/pnas.202468099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kantorovitz MR, Booth HS, Burden CJ, Wilson SR. Asymptotic behavior of k-word matches between two uniformly distributed sequences. preprint. 2006.
  12. Melko OM, Mushegian AR. Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes. Bioinformatics. 2004;20:67–74. doi: 10.1093/bioinformatics/btg374. [DOI] [PubMed] [Google Scholar]
  13. Burden CJ, Kantorovitz MR, Wilson SR. Approximate word matches between two random sequences. preprint. 2006. [DOI] [PMC free article] [PubMed]
  14. Waterman MS. Introduction to Computational Biology. Chapman and Hall; 1995. [Google Scholar]
  15. Barbour A, Chryssaphinou O. Compound Poisson approximation: a user guide. Annals of Applied Probability. 2001;11:964–1002. doi: 10.1214/aoap/1015345355. [DOI] [Google Scholar]
  16. Wu TJ, Huang YH, Li LA. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics. 2005;21:4125–32. doi: 10.1093/bioinformatics/bti658. [DOI] [PubMed] [Google Scholar]
  17. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16:111–20. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
  18. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–14. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]
  19. Source code for k-words http://dayhoff.anu.edu.au/~sf/k_words.
  20. Conover WJ. Practical Nonparametric Statistics. John Wiley and Sons; 1999. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1

Word size optimization for exact word matches, uniform sequences

Click here for file (15.5KB, eps)
Additional file 2

Word size optimization for exact word matches, non-uniform sequences

Click here for file (13.2KB, eps)
Additional file 3

Word size optimization for approximate word matches with 1 mismatch, non-uniform sequences

Click here for file (13.4KB, eps)
Additional file 4

Word size optimization for approximate word matches with 2 mismatch, non-uniform sequences

Click here for file (14.4KB, eps)
Additional file 5

Word size optimization for approximate word matches with 3 mismatch, uniform sequences

Click here for file (17.4KB, eps)
Additional file 6

Word size optimization for approximate word matches with 3 mismatch, non-uniform sequences

Click here for file (14.7KB, eps)
Additional file 7

Word size optimization for approximate word matches with 4 mismatch, non-uniform sequences

Click here for file (14.5KB, eps)
Additional file 8

Word size optimization for approximate word matches with 5 mismatch, non-uniform sequences

Click here for file (14.4KB, eps)
Additional file 9

Word size optimization for exact and approximate word matches, sequences evolving according to a Kimura80 substitution model. The dotted lines show the results obtained with blast.

Click here for file (20KB, eps)

Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES