Importance Sampling of Word Patterns in DNA and Protein Sequences

Hock Peng Chan; Nancy Ruonan Zhang; Louis HY Chen

doi:10.1089/cmb.2008.0233

. 2010 Dec;17(12):1697–1709. doi: 10.1089/cmb.2008.0233

Importance Sampling of Word Patterns in DNA and Protein Sequences

Hock Peng Chan ^1,^*, Nancy Ruonan Zhang ^3,^*,^✉, Louis HY Chen ²

PMCID: PMC3787731 PMID: 21128856

Abstract

Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs.

Key words: importance sampling, Monte Carlo, motifs, palindromes, position-specific weight matrices

1. Introduction

Searching for matches to a word pattern, also called a motif, is an important task in computational biology. The word pattern usually represents a functional site, such as a transcription factor binding site (TFBS) in a promoter region of a DNA sequence or a ligand docking site in a protein sequence. Statistical significance of over-representation of these word patterns provides valuable clues to biologists. Consequently, much work has been done on the use of asymptotic limiting distributions to approximate these p-values (Prum et al., 1995; Reinert et al., 2000; Régnier, 2000; Robin et al., 2002; Huang et al., 2004; Leung et al., 2005; Mitrophanov and Borodovsky, 2006; Pape et al., 2008). However, the approximations may not be accurate for short words or for words consisting of repeats and most theoretical approximations work only in specific settings. String-based recursive methods can provide exact p-values (Gusfield, 1997), but they can be computationally expensive when the number of words in the word pattern is large.

Direct Monte Carlo algorithms for estimating p-values of word patterns are easy to implement but are inefficient for the estimation of very small p-values, because in such cases, almost all the simulated sequences do not contain the required number of word patterns. We propose in this article importance sampling algorithms that insert the desired word patterns, either randomly or controlled by a hidden Markov model, on the simulated sequences. The algorithms are described in Section 2 and are illustrated on several word patterns of biological interest: palindromes and inverted repeats in Section 3, high-scoring words with respect to position-specific weight matrices (PSWMs) in Section 4, and co-occurrences of motifs in Section 5. Numerical results show that variance reduction of several orders of magnitude are achieved when applying the proposed importance sampling algorithms on small p-values. The technical details are consolidated in Appendices A–D and include a proof of the asymptotic optimality of the importance sampling algorithms (see Appendix D).

2. Importance Sampling Of Word Patterns

2.1. Word counting

Let |B| denote the number of elements in a set B. By selecting randomly from a finite set B, we shall mean that each Inline graphic has probability |B|⁻¹ of being selected. For any two sequences and , the notation vu shall denote the concatenated sequence . We also denote the length of v by ℓ(v)( = m). Although we assume implicitly an alphabet χ = {a, c, g, t}, representing the four nucleotide bases of DNA sequences, the algorithms can be applied on any countable alphabet, for example the alphabet of 20 amino acids in protein sequences.

We will represent the word pattern of interest by a set of words Inline graphic and assume that . Let denote a sequence of DNA bases under investigation and let N_m be the maximum number of non-overlapping words from in . We say that there exists a word in at the end of s_m if for some j > 0. Moreover, the smallest such j is the length of the shortest word at the end of s_m. We have the recursive relations, for m ≥ 1,

(2.1)

with the initialization N₀ = 0. We denote N_n simply by N. It is also possible to modify (2.1) to handle the counting of possibly overlapping words.

2.2. Monte Carlo evaluation of statistical significance

We begin by describing direct Monte Carlo. To evaluate the signifiance of observing c word patterns in an observed sequence s, we generate independent copies of the sequence from a Markov chain with transition probabilities estimated either from s or from a local neighborhood of s. The proportion of times {N ≥ c} occurs among the independent copies of s is then the direct Monte Carlo estimate of the p-value p_c := P{N ≥ c}.

It is quite common for many sequences to be analyzed simultaneously. Hence to correct for the effect of multiple comparisons, a very small p-value is required for any one sequence before statistical significance can be concluded. Direct Monte Carlo is well-known to be very inefficient for estimating small probabilities in general and many importance sampling schemes have been proposed to overcome this drawback, for example, in sequential analysis (Siegmund, 1976), communication systems (Cottrell et al., 1983), bootstrapping (Johns, 1988; Do and Hall, 1992), signal detection (Lai and Shan, 1999), moderate deviations (Fuh and Hu, 2004), and scan statistics (Chan and Zhang, 2007). In this article, we provide change of measures that are effective for the importance sampling of word patterns.

For ease of exposition, assume that the background sequence of bases follows a first-order Markov chain with positive transition probabilities

(2.2)

Let π be the stationary distribution, and let Inline graphic . Before executing the importance sampling algorithms, we first create a word bank of the desired word pattern, with each word in the word bank taking the value with probability q(v) > 0. The procedure for the selection of q and construction of the word banks will be elaborated in Sections 3–5. For completeness, we define q(v) = 0 when Inline graphic . Let β(v) = q(v)/σ(v). For ease of computation, we shall generate a dummy variable s₀ before generating s and denote by s₀. The first importance sampling algorithm, for the estimation of p₁ only, is as follows.

Algorithm A (forc = 1):

Select a word v randomly from the word bank. Hence the word takes the value with probability q(v).
Select i₀ randomly from .
Generate s₀ from the stationary distribution and sequentially from the underlying Markov chain. Let and generate sequentially from the underlying Markov chain.

Let Inline graphic and . Recall that β(v) = 0 for . Then

(2.3)

is the likelihood ratio of generating s₀ from Algorithm A and from the underlying Markov chain (with no insertion of word patterns). If Algorithm A is run independently K times, with the kth copy of s₀ generated denoted by Inline graphic , then

(2.4)

is unbiased for p_c. The term Inline graphic is superfluous when using Algorithm A since at least one word pattern from is generated in every copy of s₀.

We restrict Algorithm A to c = 1 because the random insertion of more than one word patterns into the simulated sequence can result in a hard to compute likelihood ratio. To handle more general c, we use a hidden Markov model device in Algorithm B below, with hidden states X_i taking either value 0 (do not insert word pattern) or 1 (insert word pattern), so that the likelihood ratio can be computed recursively. Let

(2.5)

be the word insertion probability at position i + 1 along the DNA sequence. For example, the user can simply select ρ_i = c/n for all i so that approximately c word patterns are inserted in each generated sequence s₀. Each copy of s₀ is generated in the following manner.

Algorithm B (for c ≥ 1):

Let i = 0, generate s₀ from the stationary distribution and X₀ satisfying (2.5).
(a) If X_i = 1, select a word v randomly from the word bank. If ℓ(v) ≤ n − i, that is, if the word v can fit into the remaining sequence, let , generate X_i_+ℓ(v) according to (2.5), increment i by ℓ(v) and go to step 3.
(b) If the word selected in 2(a) cannot fit into the remaining sequence or if X_i = 0, generate s_i₊₁ from the underlying Markov chain and X_i₊₁ satisfying (2.5). Increment i by 1 and go to step 3.
If i < n, repeat step 2. Otherwise, end the recursion.

Let Inline graphic be the likelihood ratio of generating from Algorithm B and from the underlying Markov chain. Let be the probability that a randomly chosen word from the word bank has length not exceeding j. Then

(2.6)

with L_i = 0 for i ≤ 0.

The estimator (2.4), with L = L_n, is unbiased if and only if all configurations of s₀ satisfying N ≥ c can be generated via Algorithm B. To ensure this, it suffices for us to impose the constraint

(2.7)

so that we do not force the insertion of too many word patterns.

3. Palindromic Patterns and Inverted Repeats

Masse et al. (1992) reported clusters of palindromic patterns near origins of replication of viruses. There has been much work done to estimate their significance, for example, using Poisson and compound Poisson approximations (Leung et al., 1994, 2005). The four nucleotides can be divided into two complementary base pairs with a and t forming a pair and c and g forming the second pair. We denote this relation by writing a^c = t, t^c = a, c^c = g and g^c = c. For a word Inline graphic , we define its complement . A palindromic pattern of length ℓ = 2m is a DNA sequence that can be expressed in the form . For example, v = acgcgt is a palindromic pattern. Note that the complement of v, that is the word obtained by replacing each letter of v by its complement, is tgcgca, which is just v read backwards. This interesting property explains the terminology “palindromic pattern.”

Inverted repeats can be derived from palindromic patterns by inserting a DNA sequence of length d in the exact middle of the pattern. The class of word patterns for inverted repeats can be expressed in the form

(3.1)

with 0 ≤ d₁ ≤ d₂. When d₁ = d₂ = 0, then (3.1) is the class of all palindromic patterns of length 2m.

The construction of word banks for palindromic patterns is straightforward. It all boils down to generating u_m in some suitable manner. We advocate generating u_m with probability proportional to Inline graphic or and show how this can be done in Appendix A.

Having a word bank for palindromic patterns allows us to create a word bank for inverted repeats easily. The procedure is as follows.

Select randomly from a word bank of palindromic patterns and d randomly from .
Let z₀ = u_m and generate sequentially from the underlying Markov chain.
Store the word into the word bank for inverted repeats.

This procedure allows γ_j, see (2.6), to be computed easily. In particular, γ_j = (j − d₁ + 1)/(d₂ − d₁ + 1) for d₁ ≤ j ≤ d₂, γ_j = 0 for j < d₁ and γ_j = 1 for j > d₂.

4. Position-Specific Weight Matrix

PSWMs are commonly used to derive fixed-length word patterns or motifs that transcription factors bind onto and usually range from four to twenty bases long. Databases such as TRANSFAC, JASPAR, and SCPD curate PSWMs for families of transcription factors. For example, the PSWM for the SWI5 transcription factor in the yeast genome (Zhu and Zhang, 1999) is

(4.1)

Let w_i(v) denote the entry in a PSWM that corresponds to base v at column i and let m be the number of columns in the PSWM. For any word v_m (of length m), a score

is computed and words with high scores are of interest. We let Inline graphic be the set of all v_m with score not less than a pre-specified threshold level t. In other words,

(4.2)

is a set of motifs for the PSWM associated with a given transcription factor. The matrix is derived from the frequencies of the four bases at various positions of known instances of the TFBS, which are usually confirmed by biological experiments. Huang et al. (2004) provide a good review of the construction of PSWMs.

In principle, we can construct a word bank for Inline graphic by simply generating words of length m from the underlying Markov chain and discarding words that do not belong to the motif. However, for t large, such a procedure involves discarding a large proportion of the generated words. It is more efficient to generate the words with a bias towards larger scores. In Appendix B, we show how, for any given θ > 0, a tilted Markov chain can be constructed to generate words v with probability mass function

(4.3)

where Λ(θ) is a computable normalizing constant. If words with scores less than t are discarded, then the probability mass function of non-discarded words is

(4.4)

where ξ is an unknown normalizing constant that can be estimated by the reciprocal of the fraction of non-discarded words. There are two conflicting demands placed on the choice of θ. As θ increases, the expected score of words generated under q_θ(v) increases. We would thus like θ to be large so that the fraction of discarded words is small. However at the same time, we would also like θ to be small, so that the variation of β(v) = q(v)/σ(v) over Inline graphic is small. Since

(4.5)

we suggest choosing the root of the equation Inline graphic . See Appendix B for more detail on the the computation of Λ(θ) and the numerical search of the root.

4.1. Example 1

We illustrate here the need for alternatives to analytical p-value approximations by applying Algorithm A on some special word patterns. Let P_π denotes probability with v₁ following stationary distribution π. Huang et al. (2004) suggested an approximation, which for c = 1 reduces to

(4.6)

Consider Inline graphic independent and identically distributed random variables taking values a, c, g and t with equal probabilities. Let

(4.7)

(4.8)

and consider counting of words with score at least t for t = 9, 10 and 11. The approximation (4.6) is the same for both (4.7) and (4.8) but we know that the p-value when the PSWM is (4.7) should be smaller due to the tendency of the word patterns to clump together. Of course, declumping corrections can be applied to this special case but this is not so straightforward for general PSWMs. Table 1 compares the analytical, direct Monte Carlo and importance sampling approximations of P{N ≥ 1} for (4.7) and (4.8) with n = 200. The simulations reveal substantial over-estimation of p-values for W_rep when using (4.6). Algorithm A is able to maintain its accuracy over the range of t considered whereas direct Monte Carlo has acceptable accuracy only for t = 9.

Table 1.

Comparisons of Analytical, Direct Monte Carlo, and Importance Sampling Approximations for P{N ≥ 1} with n = 200 in Example 1

t	9	10	11
Analytical	7.1 × 10⁻²	7.1 × 10⁻³	4.2 × 10⁻⁴
W_rep
Direct MC	(3.6 ± 0.6) × 10⁻²	(5 ± 2) × 10⁻³	0
Algorithm A	(3.0 ± 0.1) × 10⁻²	(4.0 ± 0.2) × 10⁻³	(2.7 ± 0.1) × 10⁻⁴
W_norep
Direct MC	(6.7 ± 0.8) × 10⁻²	(9 ± 3) × 10⁻³	(1 ± 1) × 10⁻³
Algorithm A	(7.5 ± 0.2) × 10⁻²	(6.9 ± 0.2) × 10⁻³	(4.1 ± 0.1) × 10⁻⁴

Open in a new tab

Each Monte Carlo entry is obtained using 1000 simulation runs and are expressed in the form Inline graphic standard error.

4.2. Example 2

We implement Algorithm B here with

(4.9)

where x⁺ = max{0, x}. We choose ρ_i in this manner to encourage word insertion when there are few bases left to be generated and the desired number of word patterns has not yet been observed. The motif consists of all words of length 12 having score at least 50 with respect to the PSWM (4.1). The transition matrix for generating the DNA sequence is

(4.10)

and the length of the sequence investigated is n = 700. We see from Table 2 variance reduction of 10–100 times in the simulation of probabilities of order 10⁻¹ to 10⁻³. For smaller probabilities, direct Monte Carlo does not provide an estimate whereas estimates from the importance sampling algorithm retain their accuracy. Although importance sampling takes about two times the computing time of direct Monte Carlo for each simulation run, the savings in computing time to achieve the same level of accuracy are quite substantial.

Table 2.

Inline graphic Standard Error for Example 2 with 1000 Copies of s₀ Generated for Both Direct Monte Carlo and Importance Sampling Using Algorithm B

c	Direct MC	Algorithm B
1	(9.6 ± 0.9) × 10⁻²	(9.1 ± 0.3) × 10⁻²
2	(3 ± 2) × 10⁻³	(4.2 ± 0.2) × 10⁻³
3	0	(1.3 ± 0.1) × 10⁻⁴
4	0	(2.6 ± 0.3) × 10⁻⁶

Open in a new tab

5. Co-Occurrences of Motifs

For a more detailed sequence analysis of promoter regions, one can search for cis-regulatory modules (CRM) instead of single motifs. We define CRM to be a collection of fixed length motifs that are located in a fixed order in proximity to each other. They are signals for co-operative binding of transcription factors, and are important in the study of combinatorial regulation of genes. CRMs have been used successfully to gain a deeper understanding of gene regulation (Chiang et al., 2003; Zhou and Wong, 2004; Zhang et al., 2007). We focus here on the simplest type of CRM: A co-occurring pair of high scoring words separated by a gap sequence of variable length. Let S₁( · ) be the score of a word of length m calculated with respect to a PSWM W₁, and S₂( · ) the score of a word of length r calculated with respect to a PSWM W₂. Let 0 ≤ d₁ < d₂ < ∞ be the prescribed limits of the length of the gap and t₁, t₂ threshold levels for W₁ and W₂, respectively. The family of words for the co-occurring motifs is

(5.1)

In Section 4, we showed how word banks for the motifs Inline graphic and are created. Let q_i be the probability mass function for . A word bank for can then be created by repeating the following steps.

Select v_m and u_r independently from their respective word banks.
Select d randomly from . Generate sequentially from the underlying Markov chain, initialized at z₀ = v_m.
Store w = v_mz_du_r into the word bank.

Let q be the probability mass function of the stored words. Then

(5.2)

and hence β(w) = q(w)/σ(w) = (d₂ − d₁ + 1)⁻¹β₁(v_m)β₂(u_r)/σ(z_du₁).

5.1. Example 3

The transcription factors SFF (with PSWM W₁) and MCM1 (with PSWM W₂) are regulators of the cell cycle in yeast, and are known to co-operate at close distance in the promoter regions of the genes they regulate (Spellman et al., 1998). Their PSWMs can be obtained from the database SCPD. Define Inline graphic by (5.1) with t₁ = 48, t₂ = 110, d₁ = 0 and d₂ = 100. We would like to estimate the probability that the motif appears at least once within a promoter sequence of length n = 700. The estimated probability using Algorithm A is 3.4 × 10⁻³ with a standard error of 3 × 10⁻⁴. The corresponding standard error for 1000 direct Monte Carlo runs would have been about 2 × 10⁻³, which is large relative to the underlying probability.

5.2. Structured motifs

These co-occurring motifs considered in Robin et al. (2002) consist essentially of fixed word patterns x_m and y_r separated by a variable length gap, with an allowance for the mutation of up to one base in x_my_r. The motif can be expressed as

(5.3)

We create a word for the word bank of Inline graphic in the following manner.

Select k randomly from . If k = 0, then there is no mutation and we let v_mu_r = x_my_r. Otherwise, change the kth base of x_my_r equally likely into one of the three other bases and denote the mutated sequence as v_mu_r.
Select d randomly from and generate the bases of sequentially from the underlying Markov chain, initialized at z₀ = v_m.

We perform a simulation study on eight structural motifs selected for their high frequency of occurrences in part of the Bacillus subtilis DNA dataset. We consider (d₁, d₂) = (16, 18) and (5, 50), with length of DNA sequence n = 100, and a Markov chain with transition matrix

In Table 3, we compare importance sampling estimates of P{N ≥ 1} using Algorithm A with analytical p-value estimates from Robin et al. (2002) and direct Monte Carlo p-value estimates. The analytical p-value estimates are computed numerically via recursive methods with computation time that grows exponentially with d₂ − d₁, and are displayed only for the case (d₁, d₂) = (16, 18).

Table 3.

Comparison of Direct Monte Carlo, Importance Sampling, and Analytical Estimates of P{N ≥ 1} for Structured Motifs

d₁	d₂	x	y	Direct MC	Algorithm A	Analytic
16	18	gttgaca	atataat	(2 ± 1) × 10⁻⁴	(1.038 ± 0.006) × 10⁻⁴	1.01 × 10⁻⁴
		gttgaca	tataata	0	(9.00 ± 0.05) × 10⁻⁵	8.82 × 10⁻⁵
		tgttgac	tataata	(20 ± 10) × 10⁻⁵	(9.39 ± 0.05) × 10⁻⁵	9.20 × 10⁻⁵
		ttgaca	ttataat	(9 ± 3) × 10⁻⁴	(6.65 ± 0.03) × 10⁻⁴	6.55 × 10⁻⁴
		ttgacaa	tacaat	(4 ± 2) × 10⁻⁴	(4.64 ± 0.02) × 10⁻⁴	4.57 × 10⁻⁴
		ttgacaa	tataata	(2 ± 1) × 10⁻⁴	(1.798 ± 0.009) × 10⁻⁴	1.78 × 10⁻⁴
		ttgacag	tataat	(5 ± 2) × 10⁻⁴	(3.62 ± 0.02) × 10⁻⁴	3.59 × 10⁻⁴
		ttgacg	tataat	(10 × 3) × 10⁻⁴	(9.90 ± 0.06) × 10⁻⁴	9.76 × 10⁻⁴
		combined p-value		(2.0 ± 0.4) × 10⁻³	(2.96 ± 0.03) × 10⁻³
5	50	gttgaca	atataat	(1 ± 0.3) × 10⁻³	(1.265 ± 0.008) × 10⁻³
		gttgaca	tataata	(0.4 ± 0.2) × 10⁻³	(1.103 ± 0.007) × 10⁻³
		tgttgac	tataata	(1.8 ± 0.4) × 10⁻³	(1.150 ± 0.007) × 10⁻³
		ttgaca	ttataat	(7.4 ± 0.9) × 10⁻³	(7.88 ± 0.05) × 10⁻³
		ttgacaa	tacaat	(5.0 ± 0.7) × 10⁻³	(5.50 ± 0.04) × 10⁻³
		ttgacaa	tataata	(1.5 ± 0.4) × 10⁻³	(2.21 ± 0.01) × 10⁻³
		ttgacag	tataat	(3.1 ± 0.6) × 10⁻³	(4.23 ± 0.03) × 10⁻³
		ttgacg	tataat	(0.9 ± 0.1) × 10⁻²	(1.126 ± 0.008) × 10⁻²
		combined p-value		(2.7 ± 0.2) × 10⁻²	(3.30 ± 0.04) × 10⁻²

Open in a new tab

For both direct Monte Carlo and importance sampling, 10,000 simulation runs are executed for each entry and the results are displayed in the form Inline graphic standard error.

We illustrate here how the importance sampling algorithms can be modified to handle more complex situations, for example, to obtain a combined p-value for all eight motifs. Consider more generally p = P{max_1≤j≤J(N^(j) − c_j) ≥ 0}, where N^(j) is the total word count from the motif Inline graphic and c_j is a positive integer. Let L^(j) be the likelihood ratio when applying either Algorithm A or B with insertion of words from . For the kth simulation run, we execute the following steps.

Select j_k randomly from .
Generate using either Algorithm A or B, with insertion of words from .

Then

(5.4)

is unbiased for p (see Appendix C). The key feature in (5.4) is the correction term Inline graphic . Without this term, is an unbiased estimator for the Bonferroni upper bound . The correction term adjusts the estimator downwards when more than one thresholds c_j are exceeded.

We see from Table 3 that the variance reduction is substantial when importance sampling is used. In fact, the direct Monte Carlo estimate is often unreliable. Such savings in computation time is valuable both to the end user and also to the researcher trying to test the reliability of his or her analytical estimates on small p-values. We observe for example that the numerical estimates for (d₁, d₂) = (16, 18) given in Robin et al. (2002) are quite accurate but tends to underestimate the true underlying probability.

6. Discussion

The examples given here are not meant to be exhaustive but they do indicate how we can proceed in situations not covered here. For example, if we would like the order of the two words in a CRM to be arbitrary, we can include an additional permutation step in the construction of the word bank. In Section 5.2, we also showed how to simulate p-values of the maximum count over a set of word patterns. As we gain biological understanding, the models that we formulate for DNA and protein functional sites become more complex. Over the years, they have evolved from deterministic words to consensus sequences to PSWMs and then to motif modules. As probabilistic models for promoter architecture gets more complex and context specific, importance sampling methods are likely to be more widely adopted in the computation of p-values.

7. Appendix

A. Generating palindromes and invented repeats

We first show how words v_m can be generated with probability mass function

with Inline graphic a computable normalizing constant. Apply the backward recursive relations

(A.1)

initialized with η_m(x) = 1 for all x. Then Inline graphic . Let Q be the desired probability measure for generating v_m with probability mass function q. Then the Markovian property

(A.2)

allows us to generate v_i sequentially via transition matrices.

To generate words v_m with probability mass function Inline graphic , let η_m(x) = σ(xx^c) instead of η_m(x) = 1 and proceed with (A.1) and (A.2).

B. Generating highscoring motifs from PSWMs

Let S be the score with respect to a given PSWM W and let θ > 0. We provide here a quick recursive algorithm for generating v_m from the probability mass function

(A.3)

with Λ(θ) = Σ_vme^θS^(v_m⁾π(v₁)σ(v_m) a computable normalizing constant. Since log Λ(θ) is convex, the solution of Inline graphic can be found using a bijection search. We take note of the backward recursive relations

(A.4)

from which we can compute Inline graphic . Let Q denote the desired probability measure for generating from q_θ. By (A.3) and (A.4), we can simply generate the letters v_i sequentially, using transition matrices defined by the Markovian relations

(A.5)

C. Unbiasedness of in (5.4)

We shall show here that Inline graphic in (5.4) is unbiased for p = P{max_1≤j≤J(N^(j) − c_j) ≥ 0}. Let A_j = {s₀ : N^(j)(s₀) ≥ c_j} and let Q_j be a probability measure such that L^(j)(s₀) = Q_j(s₀)/P(s₀) > 0 for any s₀ . Let . Then with the convention 0/0 = 0,

and hence Inline graphic is indeed unbiased.

D. Asymptotic optimality

To estimate p := P{N(s) ≥ c} using direct Monte Carlo, simply generate K independent copies of s, denoted by Inline graphic , under the original probability measure P, and let

To simulate p using importance sampling, we need to first select a probability measure Q ≠ P for generating Inline graphic . The estimate of p is then

We require Q(s) > 0 whenever N(s) ≥ c, so as to ensure that Inline graphic is unbiased for p.

The relative error (RE) of a Monte Carlo estimator Inline graphic or , is given by . We say that is asymptotically optimal if for any , we can satisfy with log K = o(| log p|) as p → 0 (Sadowsky and Bucklew, 1990; Dupuis and Wang, 2005). Since , direct Monte Carlo is not asymptotically optimal. The question we would like to answer here is: Under what conditions are Algorithms A and B asymptotically optimal?

The examples described in Sections 3–5 involve word families that can be characterized as Inline graphic . We may also include an additional subscript m to a previously defined quantity to highlight its dependence on m, for example p_m, q_m, β_m and n_m. We say that x_m and y_m have similar logarithmic value relative to m, and write x_m ≃ y_m, if | log x_m − log y_m| = o(m) as m → ∞ . It is not hard to see that if x_m ≃ y_m and y_m ≃ z_m, then x_m ≃ z_m. In Algorithm A, it is assumed implicitly that Inline graphic and we shall also assume n_m ≥ cℓ_max when using Algorithm B. To fix the situation, let ρ_i = c/n_m for all i in Algorithm B. Let and . Let denote the greatest integer function, P_x denote probability conditioned on s₁ = x or v₁ = x and P_π denote probability conditioned on s₁ or v₁ following the stationary distribution.

In the following lemma, we provide conditions for asymptotic optimality and check them in Appendices D.1–D.3 for the word families discussed in Sections 3–5.

Lemma 1

If log n_m ≃ 1 and

(A.6)

(A.7)

(A.8)

then both Algorithms A and B are asymptotically optimal.

Proof. Let Inline graphic . Since , by (A.7) and (A.8),

(A.9)

By (6.1), | log p_m| ≥ m| log α| for all large m and hence it suffices for us to show K_m ≃ 1.

If n_m ≃ 1, then by (A.9) and the inequalities Inline graphic ,

(A.10)

Consider next the case n_m/ℓ_max → ∞ . Since logn_m ≃ 1, there exists integers ξ_m such that ξ_m ≃ 1, ξ_m = o(n_m) and log n_m = o(ξ_m). Let κ_m = ⌊ n_m/(ℓ_max + ξ_m) ⌋ and Inline graphic . By (A.6), α^m ≥ p_m ≥ (g_mσ_min)^c and hence g_m → 0. Since the underlying Markov chain is uniformly ergodic,

(A.11)

By considering the sub-cases of at least c words Inline graphic starting at positions 1, , it follows from (A.11) that

By (A.6), κ_mg_m → 0 and this implies κ_mr_m → 0. Since (ℓ_max + ξ_m) ≃ 1, it follows that κ_m ≃ n_m and hence by the inequalities

(A.10) again holds. By using a subsequence argument if necessary, it follows that (A.10) holds as long as log n_m ≃ 1.

For Algorithm A, by (2.3) and (2.4),

and the desired relation K_m ≃ 1 follows from (A.10) with c = 1.

For Algorithm B, it follows from (2.6) that if N(s) ≥ c, then L(s) ≥ (1 − c/n_m)^n_m [cβ_min /(n_mσ_max)]^c and hence by (2.4),

and again K_m ≃ 1 follows from (A.10). ▪

D.1. Inverted repeats

Consider the word family (3.1) with d₂ ≃ 1. Then (A.7) holds. Since Inline graphic , (A.6) holds when n_m = O(γ^m) for some . It remains to check (A.8). Since ,

(A.12)

Let u_m be generated with probability proportional to Inline graphic when creating the word bank . Then there exists a constant C > 0 such that

Hence β_min ≃ β_max and (A.8) follows form (A.12).

D.2. Word patterns derived from PSWMs

For the word family (4.2), condition (A.7) is always satisfied. Let the entries of the PSWM be non-negative integers and assume that the column totals are fixed at some C > 0. It follows from large deviations theory (Dembo and Zeitouni, 1998) that if t( = t_m) ≥ E_πS(v) + ζm for some ζ > 0, then

(A.13)

Since p_m ≤ n_mP_π{S(v) ≥ t}, (A.6) holds if n_m = O(γ^m) for some γ < λ ⁻¹.

To simplify the analysis in checking (A.8), select the tilting parameter θ( = θ_m) to be the root of E_qθ[S(v)] = t + δ_m for some positive δ_m = o(m) satisfying m^−1/2δ_m → ∞ as m → ∞ , instead of the root of E_qθ[S(v)] = t, as suggested in the statement containing (4.5). The implicit assumption is that Inline graphic for all m. Since the entries of the transition matrices derived in Appendix B are uniformly bounded away from zero, it follows from a coupling argument that Cov_qθ(w_i(v_i),w_j(v_j)) = O(τ^|i−j|) for some 0 < τ < 1 and hence Var_qθ(S(v)) = O(m). By (4.3) and Chebyshev's inequality,

(A.14)

for all large m. Since ξ > 1 in (4.4), Inline graphic and (A.8) follows from (A.12) and (A.14).

D.3. Co-occurrences of motifs

Consider the word family (5.1) with (r/m) bounded away from zero and infinity and d₂ ≃ 1. We check that (A.7) holds. If t₁ ≥ ES₁(v) + ζm for some ζ > 0, then (A.13) holds with S replaced by S₁, t replaced by t₁ and hence (A.6) holds if n_m = O(γ^m) for some γ < λ⁻¹.

Let θ_j be the root of E_θj[S_j(v)] = t_j + δ_m for some positive δ_m = o(m) with m^1/2δ_m → ∞ , j = 1 and 2, assuming that Inline graphic , where m₁ = m and m₂ = r. Let and let Λ⁽¹⁾(θ₁), Λ⁽²⁾(θ₂) be their respective normalizing constants, see (4.3). By the arguments in (A.14),

By (5.2), Inline graphic and hence (A.8) follows from (A.12).

Acknowledgments

This research was partially supported by the National University of Singapore (grants C-389-000-010-101 and R-155-062-112).

Disclosure Statement

No competing financial interests exist.

References

Chan H.P. Zhang N.R. Scan statistics with weighted observations. J. Am. Statist. Assoc. 2007;102:595–602. [Google Scholar]
Chiang D.Y. Moses A.M. Kellis M., et al. Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts. Genome Biol. 2003;4:R43. doi: 10.1186/gb-2003-4-7-r43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cottrell M. Fort J.C. Malgouyres G. Large deviations and rare events in the study of stochastic algorithms. IEEE Trans. Automat. Contr. 1983;28:907–920. [Google Scholar]
Dembo A. Zeitouni O. Large Deviations: Techniques and Applications. Springer; New York: 1998. [Google Scholar]
Do K.A. Hall P. Distribution estimating using concomitant of order statistics, with applications to Monte Carlo simulation for the bootstrap. J.R. Statist. Soc. B. 1992;54:595–607. [Google Scholar]
Dupuis P. Wang H. Dynamic importance sampling for uniformly recurrent Markov chains. Ann. Appl. Probabil. 2005;15:1–38. [Google Scholar]
Fuh C.D. Hu I. Efficient importance sampling for events of moderate deviations with applications. Biometrika. 2004;91:471–490. [Google Scholar]
Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press; London: 1997. [Google Scholar]
Huang H. Kao M. Zhou X., et al. Determination of local statistical significance of patterns in Markov sequences with applications to promoter element identification. J. Comput. Biol. 2004;11:1–14. doi: 10.1089/106652704773416858. [DOI] [PubMed] [Google Scholar]
Johns M.V. Importance sampling for bootstrap confidence intervals. J. Am. Statist. Assoc. 1988;83:709–714. [Google Scholar]
Lai T.L. Shan J.Z. Efficient recursive algorithms for detection of abrupt changes insignals and control systems. IEEE Trans. Automat. Contr. 1999;44:952–966. [Google Scholar]
Leung M.Y. Choi K.P. Xia A, et al. Nonrandom clusters of palindromes in herpesvirus genomes. J. Comput. Biol. 2005;12:331–354. doi: 10.1089/cmb.2005.12.331. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leung M.Y. Schachtel G.A. Yu H.S. Scan statistics and DNA sequence analysis: the search for an origin of replication in a virus. Nonlinear World. 1994;1:445–471. [Google Scholar]
Masse M.J.O. Karlin S. Schachtel G.A., et al. Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. Proc. Natl Acad. Sci. USA. 1992;89:5246–5250. doi: 10.1073/pnas.89.12.5246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitrophanov A.Y. Borodovsky M. Statistical significance in biological sequence analysis. Briefings Bioinform. 2006;7:2–24. doi: 10.1093/bib/bbk001. [DOI] [PubMed] [Google Scholar]
Pape U. Rahmann S. Sun F., et al. Compound Poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J. Comput. Biol. 2008;15:547–564. doi: 10.1089/cmb.2007.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prum B. Rodolphe F. de Turckheim E. Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J.R. Statist. Soc. B. 1995;57:205–220. [Google Scholar]
Régnier M. A unified approach to word occurrence probabilities. Dis. Appl. Math. 2000;104:259–280. [Google Scholar]
Reinert G. Schbath S. Waterman M. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]
Robin S. Daudin J. Richard H., et al. Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 2002;9:761–773. doi: 10.1089/10665270260518254. [DOI] [PubMed] [Google Scholar]
Sadowsky J.S. Bucklew J.A. On large deviations theory and asymptotically efficient Monte Carlo estimation. IEEE Trans. Inform. Theory. 1990;36:579–588. [Google Scholar]
Siegmund D. Importance sampling in the Monte Carlo study of sequential test. Ann. Statist. 1976;4:673–684. [Google Scholar]
Spellman P.T. Sherlock G. Zhang M.Q., et al. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang N.R. Wildermuth M.C. Speed T.P. Transcription factor binding site prediction with multivariate gene expression data. Ann. Appl. Statist. 2008;2:332–365. [Google Scholar]
Zhou Q. Wong W. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl Acad. Sci. USA. 2004;101:12114–112119. doi: 10.1073/pnas.0402858101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu J. Zhang M.Q. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999;15:607–611. doi: 10.1093/bioinformatics/15.7.607. [DOI] [PubMed] [Google Scholar]

[B1] Chan H.P. Zhang N.R. Scan statistics with weighted observations. J. Am. Statist. Assoc. 2007;102:595–602. [Google Scholar]

[B2] Chiang D.Y. Moses A.M. Kellis M., et al. Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts. Genome Biol. 2003;4:R43. doi: 10.1186/gb-2003-4-7-r43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Cottrell M. Fort J.C. Malgouyres G. Large deviations and rare events in the study of stochastic algorithms. IEEE Trans. Automat. Contr. 1983;28:907–920. [Google Scholar]

[B4] Dembo A. Zeitouni O. Large Deviations: Techniques and Applications. Springer; New York: 1998. [Google Scholar]

[B5] Do K.A. Hall P. Distribution estimating using concomitant of order statistics, with applications to Monte Carlo simulation for the bootstrap. J.R. Statist. Soc. B. 1992;54:595–607. [Google Scholar]

[B6] Dupuis P. Wang H. Dynamic importance sampling for uniformly recurrent Markov chains. Ann. Appl. Probabil. 2005;15:1–38. [Google Scholar]

[B7] Fuh C.D. Hu I. Efficient importance sampling for events of moderate deviations with applications. Biometrika. 2004;91:471–490. [Google Scholar]

[B8] Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press; London: 1997. [Google Scholar]

[B9] Huang H. Kao M. Zhou X., et al. Determination of local statistical significance of patterns in Markov sequences with applications to promoter element identification. J. Comput. Biol. 2004;11:1–14. doi: 10.1089/106652704773416858. [DOI] [PubMed] [Google Scholar]

[B10] Johns M.V. Importance sampling for bootstrap confidence intervals. J. Am. Statist. Assoc. 1988;83:709–714. [Google Scholar]

[B11] Lai T.L. Shan J.Z. Efficient recursive algorithms for detection of abrupt changes insignals and control systems. IEEE Trans. Automat. Contr. 1999;44:952–966. [Google Scholar]

[B12] Leung M.Y. Choi K.P. Xia A, et al. Nonrandom clusters of palindromes in herpesvirus genomes. J. Comput. Biol. 2005;12:331–354. doi: 10.1089/cmb.2005.12.331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Leung M.Y. Schachtel G.A. Yu H.S. Scan statistics and DNA sequence analysis: the search for an origin of replication in a virus. Nonlinear World. 1994;1:445–471. [Google Scholar]

[B14] Masse M.J.O. Karlin S. Schachtel G.A., et al. Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. Proc. Natl Acad. Sci. USA. 1992;89:5246–5250. doi: 10.1073/pnas.89.12.5246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Mitrophanov A.Y. Borodovsky M. Statistical significance in biological sequence analysis. Briefings Bioinform. 2006;7:2–24. doi: 10.1093/bib/bbk001. [DOI] [PubMed] [Google Scholar]

[B16] Pape U. Rahmann S. Sun F., et al. Compound Poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J. Comput. Biol. 2008;15:547–564. doi: 10.1089/cmb.2007.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Prum B. Rodolphe F. de Turckheim E. Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J.R. Statist. Soc. B. 1995;57:205–220. [Google Scholar]

[B18] Régnier M. A unified approach to word occurrence probabilities. Dis. Appl. Math. 2000;104:259–280. [Google Scholar]

[B19] Reinert G. Schbath S. Waterman M. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]

[B20] Robin S. Daudin J. Richard H., et al. Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 2002;9:761–773. doi: 10.1089/10665270260518254. [DOI] [PubMed] [Google Scholar]

[B21] Sadowsky J.S. Bucklew J.A. On large deviations theory and asymptotically efficient Monte Carlo estimation. IEEE Trans. Inform. Theory. 1990;36:579–588. [Google Scholar]

[B22] Siegmund D. Importance sampling in the Monte Carlo study of sequential test. Ann. Statist. 1976;4:673–684. [Google Scholar]

[B23] Spellman P.T. Sherlock G. Zhang M.Q., et al. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Zhang N.R. Wildermuth M.C. Speed T.P. Transcription factor binding site prediction with multivariate gene expression data. Ann. Appl. Statist. 2008;2:332–365. [Google Scholar]

[B25] Zhou Q. Wong W. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl Acad. Sci. USA. 2004;101:12114–112119. doi: 10.1073/pnas.0402858101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Zhu J. Zhang M.Q. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999;15:607–611. doi: 10.1093/bioinformatics/15.7.607. [DOI] [PubMed] [Google Scholar]

PERMALINK

Importance Sampling of Word Patterns in DNA and Protein Sequences

Hock Peng Chan

Nancy Ruonan Zhang

Louis HY Chen

Abstract

1. Introduction

2. Importance Sampling Of Word Patterns

2.1. Word counting

2.2. Monte Carlo evaluation of statistical significance

Algorithm A (forc = 1):

Algorithm B (for c ≥ 1):

3. Palindromic Patterns and Inverted Repeats

4. Position-Specific Weight Matrix

4.1. Example 1

Table 1.

4.2. Example 2

Table 2.

5. Co-Occurrences of Motifs

5.1. Example 3

5.2. Structured motifs

Table 3.

6. Discussion

7. Appendix

A. Generating palindromes and invented repeats

B. Generating highscoring motifs from PSWMs

C. Unbiasedness of in (5.4)

D. Asymptotic optimality

Lemma 1

D.1. Inverted repeats

D.2. Word patterns derived from PSWMs

D.3. Co-occurrences of motifs

Acknowledgments

Disclosure Statement

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Importance Sampling of Word Patterns in DNA and Protein Sequences

Hock Peng Chan

Nancy Ruonan Zhang

Louis HY Chen

Abstract

1. Introduction

2. Importance Sampling Of Word Patterns

2.1. Word counting

2.2. Monte Carlo evaluation of statistical significance

Algorithm A (forc = 1):

Algorithm B (for c ≥ 1):

3. Palindromic Patterns and Inverted Repeats

4. Position-Specific Weight Matrix

4.1. Example 1

Table 1.

4.2. Example 2

Table 2.

5. Co-Occurrences of Motifs

5.1. Example 3

5.2. Structured motifs

Table 3.

6. Discussion

7. Appendix

A. Generating palindromes and invented repeats

B. Generating highscoring motifs from PSWMs

C. Unbiasedness of in (5.4)

D. Asymptotic optimality

Lemma 1

D.1. Inverted repeats

D.2. Word patterns derived from PSWMs

D.3. Co-occurrences of motifs

Acknowledgments

Disclosure Statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases