Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads

Zhiyuan Zhai; Gesine Reinert; Kai Song; Michael S Waterman; Yihui Luan; Fengzhu Sun

doi:10.1089/cmb.2012.0029

. 2012 Jun;19(6):839–854. doi: 10.1089/cmb.2012.0029

Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads

Zhiyuan Zhai ¹, Gesine Reinert ², Kai Song ³, Michael S Waterman ^4,⁵, Yihui Luan ^1,^✉, Fengzhu Sun ^4,^5,^✉

PMCID: PMC3375642 PMID: 22697250

Abstract

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).

Key words: algorithms, genome analysis, HMM, next generation sequencing, statistical models

1. Introduction

The study of the occurrences of word patterns in sequences has played an important role in molecular sequence analysis. Here, we shall use word pattern of length k and k-tuple interchangeably; often word patterns are also just referred to as words. For a given k, the frequencies of word patterns of length k form a vector, referred to as sequence signature (Campbell et al., 1999). Sequence signatures of genomic sequences of varying characteristics are usually different. For example, coding and non-coding sequences usually have different signatures and thus sequence signatures can be useful features to distinguish coding from non-coding sequences (Uberbacher and Mural, 1991). Sequence signatures within different parts of a genome tend to be similar, while they differ significantly between genomes (Karlin and Mrázek, 1997, Nekrutenko and Li, 2000). Thus, sequence signatures have been used to study the evolutionary relationship between different genomic sequences (Jun et al., 2010, Karlin and Mrázek, 1997, Sims et al., 2009, Wu et al., 2009), to study horizontal gene transfer (Dalevi et al., 2006, Dufraigne et al., 2005), and to bin sequence reads from metagenomic studies so that reads in the same bin tend to have similar sequence signatures (McHardy et al., 2006). The sequence signatures can also be employed to detect enrichment for short words. For example, the upstream regions of co-regulated genes usually share common transcription factor binding sites (TFBS) referred to as motifs, and thus motifs are usually enriched within these sequences. Finding enriched word patterns within these sequences is a powerful tool for the identification of TFBS (Pavesi et al., 2004).

Due to the many applications of sequence signatures, extensive studies have been carried out to study the distribution of the number of occurrences of word patterns in one or multiple long sequences consisting of independent identically distributed (i.i.d.) letters and sequences generated by both Markov and hidden Markov models (HMM). Several excellent reviews (Reinert et al., 2000, 2005, Schbath, 2000, Schbath and Robin, 2009) and a book (Robin et al., 2005) on this topic are available. The distribution of the number of occurrences of a pattern can also be studied using the so-called “imbedded Markov chain” techniques (Kleffe and Langbecker, 1990, Nuel, 2006, Shan and Zheng, 2009). However, the computation of p-values using these techniques can be very time consuming and impractical for long sequences. We recently studied the power of detecting enriched patterns when motifs are randomly distributed along the genome using HMM (Zhai et al., 2010).

In all these studies, one or several long sequences are available and the word pattern occurrences along these long sequences are studied. Rather than providing a few long sequences, recent developments in sequencing technologies make it possible to sequence a large number of relatively short reads (e.g., 30–80 bp for Illumina/Solexa and 300–500 bp for Roche 454) efficiently and economically. These new sequencing technologies have revolutionized current studies of many biological problems including locating genomic regions of TFBS, histone modification, and chromatin structure using ChIP-Seq, resequencing of known genomes for the identification of genetic polymorphisms, and sequencing of unknown genomes. For the applications of these NGS technologies, see recent reviews (Maclean et al., 2009, Mardis, 2008a,b). Although many computational methods have been developed to analyze NGS data, to our knowledge no studies on the distribution of the number of occurrences of word patterns in the sequence reads generated from NGS have been carried out. In this article, we fill this gap. The main challenge compared to word counts in sequences is that, in NGS, two random processes are involved, namely not only the the randomness in the background genome sequence but also the random sampling of the reads from the background sequence.

The study of the distribution of the numbers of occurrences of word patterns from NGS read data has several important applications, in particular, when the complete genome sequences are not available. First, such distributions are important for the comparison of genomes when NGS short reads are available for each genome (Song et al., 2012). Second, they can be used to identify enriched or depleted patterns in genomes whose complete genomes are not known. Such enriched or depleted patterns can be used to characterize the genome sequences. Third, the null distributions of the numbers of occurrences of patterns can be used to identify enriched patterns in ChIP-Seq experiments and such enriched patterns can be useful for the identification of TFBS.

In this article, we not only study the distributions of the numbers of occurrences of word patterns from NGS read data under a suitable null model, but we also address the issue of the power of the count statistics against an alternative model which assumes that there are motifs present in the sequence. Our methodology builds on Zhai et al. (2010), but differs from that article in the consideration of NGS data and the consideration of both strands of the genome sequences. In the study of word patterns for long sequences, both strands are rarely considered except in Pape et al. (2008). For NGS, the consideration of both strands are essential since the reads can come from both the forward strand and the reverse strand of the genome sequences. We provide simpler approximate distribution for the number of occurrences of word patterns for NGS data than the approximations given in Pape et al. (2008).

The article is organized as follows. In Section 2, we first present the probability models for the background sequence and the sampling process of reads using NGS. Then the results for normal and compound Poisson approximations for the number of occurrences of patterns in NGS reads are given. As the approximations assume that both the length of the reads and the length of the background sequence go to infinity, whereas in reality they are reasonably short, we also give bounds on the approximation errors. We consider both single strand and double strand models. This section forms the core theoretical results of the article. In Section 3, we first present simulation results to show the validity of the theoretical results for both common and rare patterns, and then use the theoretical results to analyze a ChIP-Seq data set from Valouev et al. (2008). It is surprising to see that, even in the control data, some TFBS signals can be identified, indicating that some residue ChIP effects are present in the control data. Using ChIP-Seq data, we are able to identify the consensus patterns of the motif of interest. The article concludes with some discussion on the limitations of the approach and future research directions. Many of the proofs are given in Supplementary Materials (available online at www.liebertonline.com/cmb).

2. Methods

2.1. Probabilistic models for the background sequence and sampling of sequence reads using NGS

In NGS, a large number of M reads are randomly sampled from the genome. For studying the distribution of the number of occurrences of patterns among the M reads, two random processes are involved. The first randomness comes from the generation of the background genome sequence and the second randomness comes from random sampling of the reads from the background sequence.

As in previous studies reviewed in Robin et al. (2005) and Schbath and Robin (2009), the background sequence is modeled as a homogeneous ergodic Markov chain taking states in the set Inline graphic with transition probability matrix T = (t_ll_′)_L×L. The Markov chain has a unique stationary distribution π₀. The results in this paper can also be extended to sequences generated by hidden Markov models without too much difficulty.

Next, we model the sampling of reads along the genome sequence using NGS. As it was shown in Zhang et al. (2008) that the homogeneous Lander-Waterman model (Lander and Waterman, 1988) for genomic mapping does not model the read distributions along the genome well, we use a modified version of the Lander-Waterman model to describe the distribution of reads along the genome. We assume that the sampled reads have the same length of β bp. A total of M reads are independently sampled from the genome of length n bp. Each read starts at position i with probability Inline graphic , where , with .

Let Inline graphic be any word pattern of length w with . Then is the probability of w. Let be the number of occurrences of w in these M reads. To calculate the mean of , note first that in each read of length β, the expected number of occurrences of w is (β − w + 1)P(w). As there are M reads, we obtain that

(1)

We study the approximate distribution of Inline graphic and the approximate joint distribution of (), where indicates the set of word patterns. We consider both single strand and double strand models. In the single strand model, we assume that the reads just come from one strand. In the double strand model, the reads can come from either strand of the genome. We allow for the occurrences to overlap. For example if the sequence is 5′-CAATAATATAATAG-3′ and the word is ATA, then we count four occurrences in the single strand model. A clump of pattern w is a consecutive region of the sequence with overlapping occurrences of w. For the example given above, there are three clumps of occurrences, one clump (ATATA) of size two and two clumps of size one each. Counting the occurrences of ATA in the complementary sequence 5′-CTATTATATTATTG-3′ also, there are 4 + 1 = 5 occurrences of w in the double strand model. Note that we always count from the 5′ end to the 3′ end of the sequences.

2.2. Normal approximation for the number of occurrences of frequent patterns in randomly sampled NGS reads

In this subsection, we present our results for calculating the covariance of Inline graphic and under the models described in Subsection 2.1 for any two word patterns u and v. Proposition 2.1 presents a formula to calculate . The covariance can then be derived using Equation (1). While the covariance of word counts for a single sequence read can be found in Waterman (1995), Proposition 12.1, the following Proposition 2.1 takes the randomness in the starting positions of the sequence reads into account.

Proposition 2.1

Let be the underlying sequence of length n. Let u and v be two word patterns of length u and v, respectively, with u ≤ v. Assume that β ≥ u + v. Randomly choose M reads of length β from a genome of length n base pairs according to the model in Subsection 2.1 and let Inline graphic and be the numbers of occurrences of word patterns u and v in these reads, respectively. Then can be calculated by

where Inline graphic , and N_w[i, i + β − 1] the number of occurrences of word pattern w in .

Formulas for calculating Inline graphic are given in the supplementary materials, Proposition A.1; they are based on a slight modification of the proof for Proposition 12.1. in Waterman (1995).

Proof of Proposition 2.1

Let C_w(m) be the number of occurrences of word pattern w in the m-th read, Inline graphic . Then

Let

According to our model, it is easy to see that for word patterns u and v, the counts (C_u(m),C_v(m)) have the same distribution for all Inline graphic . Similarly, for any m ≠ m′, (C_u(m),C_v(m′)) have the same distribution. Thus,

and

Since the Markovian sequence is stationary, C_u(1) has the same distribution as N_u[1,β]. Thus,

Conditioning on the locations of the first and second reads, we have

The proposition is proved. ▪

For the special case that the background sequence is i.i.d., we have the following corollary.

Corollary 2.1

Suppose that the background sequence is i.i.d. With the same notation as in Proposition 2.1, we have

1. The covariance of and can be calculated as
2. If and M depends on n such that lim_n→∞M/n = θ, then

In particular, if the reads are uniformly sampled from the genomic sequence, i.e. λ_i = 1/(n − β + 1), then Inline graphic .

Corollary 2.1 follows by noting that for the i.i.d. case and η ≥ β,

The second part follows directly by taking the limit of Cov( Inline graphic in Part 1) over M as n tends to infinity.

Given the approximate mean and variance of Inline graphic , it is tempting to approximate the distributions of using a normal distribution. The approximation is based on the heuristic that the counts in different reads are independent unless the reads overlap, and if the words are not too long, the count in each read would be approximately normally distributed.

As reads are not very long, the approximation error may not be negligible, and hence we give an upper bound on the approximation error. Our result is phrased in terms of

where Z denotes a standard normal variable. Thus Inline graphic (standardized count ≤ x) , and a bound on d_K can be used to obtain a conservative p-value for the observed standardized count.

Here, we employ Theorem 2.6 in Chen and Shao (2004), and assume the i.i.d. model for the underlying background sequence. Then N_w[i, i + β − 1] and N_w[j, j + β − 1] are independent unless |i − j| ≤ β, where N_w[l, l′] is the number of occurrences of word w in the interval [l, l′] along the long sequence. Using the notation Inline graphic and , the following result holds.

Theorem 2.1

Assume the i.i.d. model for the background sequence and let Z be standard normally distributed. Then for a word w of length w,

where Inline graphic .

In the case that all letters are equally likely and independent, and all Inline graphic , the bound will be of order O((ln n)⁻¹) when the word length is not too large, w < log _L(β/ ln(n)), while the read length β = c₁L^w ln n for a constant c₁ > 1 and the number of reads M = c₂n/(ln n) for a constant c₂ > 0. This type of regime is rather specific, for example the above regime with n = 5,000 and w = 4 on a 4-letter alphabet would require β > 2,181, while n / (ln n) = 587; if n = 20,000 and w = 7 we would need β > 162, 259, while n / (ln n) = 2019.5. Moreover, the above regime would require that M/n is small. Hence, it is no surprise that the normal approximation does not work well in many situations.

In particular, in many practical applications of NGS, the coverage of the sequenced reads is moderate to high depending on the biological applications. Thus, the normal approximation may not work well in these situations. The theorems also explain the poor performance of normal approximation in our simulations in Section 3. We emphasize that the bound may not be the best possible in all settings.

A similar result is available for multivariate word counts. The generalization to a Markovian sequence is straightforward, using the arguments from Huang (2002) for the joint counts starting at a specified position, and a local dependence argument as above.

Finally, note that if Mλ_i is large for some i, then Inline graphic might be better approximated by the sum of products of normally distributed variables, which is approximately normal only when the number of summands is large.

2.3. Compound Poisson approximation for the number of occurrences of rare patterns in randomly sampled NGS reads

For rare patterns along the background sequence, the normal approximation as described in Subsection 2.2 is not appropriate; instead, we present a compound Poisson approximation for the number of occurrences of such patterns. This compound Poisson approximation takes clumps of occurrences directly into account. Recall that a clump of word w is a maximum consecutive region of the background sequence with overlapping occurrences of w. For the clumps, we first introduce some notation. Let Inline graphic be the p-th prefix composed of the first p letters of w. The set of periods of the word w of length w (Guibas and Odlyzko, 1981, Lothaire, 1983) is defined by

The set of principal periods of word pattern w, Inline graphic , are those periods that cannot be written as multiples of other periods. It was shown in Reinert and Schbath (1998) and Schbath (1995) that the number of clumps, N_c_,n, in a Markovian sequence of length n can be approximated by a Poisson random variable with mean , where

(2)

We also consider X_i, the number of occurrences of word pattern w in the i-th clump. Let

be the set of all possible ways a clump of size k can occur. It was shown in Reinert and Schbath (1998) and Schbath (1995) that

(3)

where

(4)

A compound Poisson approximation for Inline graphic can be motivated as follows. Let Z_i be the number of reads covering the first occurrence of w in the i-th clump. We can reasonably assume that the read will cover the whole clump as the clump size is generally not long. Then we may approximate

(5)

We note that the above equation may slightly over-estimate the number of occurrences of w in the M reads since we only require that the reads cover the first w, not the whole clump. However, the approximation is reasonable since the sequence reads are generally much longer than the length of clumps and the reads covering the first w are most likely covering the whole clump.

Next, we study the distribution of Z_i. If the i-th clump starts at j, then the number of reads containing the first w in the clump is a binomial random variable Inline graphic which is asymptotically Poisson with mean . Since the occurrences of clumps follow asymptotically a Poisson process, the starting point of the i^th clump is approximately uniformly distributed in [1, n − w + 1]. Thus, the independent random variables with distribution

(6)

are a reasonable approximation to the random variables Z_i.

The next theorem makes the heuristic argument precise. Recall that the total variation distance between two Inline graphic -valued random variables X and Y (defined on the same probability space) is defined by

Thus, if the total variation distance between X and Y is small, then for any subset, A, of the nonnegative integers, the difference between the probability for X to be in A and that for Y is also small. A bound in total variation distance can be applied to get conservative p-values for counts via the formula

To state the results we need some more notation. Let α = α₂ be the second-largest eigenvalue of the transition matrix T; the Perron-Frobenius Theorem ensures that |α| < 1. Let D be the matrix with the eigenvalues of T on the diagonal, ordered such that the first entry is α₁ = 1, and zero entries everywhere else. Then we decompose T = PDP⁻¹ such that the first column of P is Inline graphic . For all , let J_t denote the L × L matrix such that all its entries are equal to 0 except J_t(t, t) = 1, and we define

Let π(x) be the probability of letter x and

Let π_min be the smallest value of Inline graphic . Put

(7)

Theorem 2.2

Let Inline graphic have Poisson distribution with mean , let have distribution (6), let have distribution (3) and assume that all these variables are independent. Then

Here B(T, w, n) is given in (7).

Let Inline graphic . Then the probability can be calculated using the recursion (Panjer, 1981, Willmot and Panjer, 1987)

(8)

with initial value Inline graphic and .

While B(T, w, n) has a complicated expression, when the Markov chain is reasonably well mixed then its leading term will be of the order Inline graphic . The bound in Theorem 2.2 will be small when, firstly, the compound Poisson approximation for intervals of length β is good; secondly, the distribution of starting points of reads is relatively homogeneous; thirdly, the number M of reads is not too large compared to n; and fourthly, Inline graphic is small.

Theorem 6.6.4 in Reinert et al. (2005) gives the analogous approximation for counts of different words Inline graphic , where r is an integer. The bounds are of similar flavor but involve more notation which considers the possible overlaps between different words. We omit the result here.

2.4. Extending the approximations to the double-strand model

In the above subsections, we assume that the sequence under study is single-stranded for simplicity. However, DNA sequences are double-stranded and the sequence reads from NGS can come from either strand and it is not known which strand the reads come from. To take both strands into consideration, we consider both the reads and their complements. Among the M pairs of reads, the number of occurrences of w, Inline graphic , is equal to the number of occurrences of the complement of w, , because we consider the complement of each read. Next we study the distribution of for any word pattern w by considering the following scenarios.

We first assume a palindrome such that Inline graphic , for example, w = ACGT or CGCG. For such word patterns, it is obvious that . Next we assume . For each pair of complementary reads, we consider the one from the forward strand. Thus, we have a new set of M reads all from the forward strand. Note that the word pattern w occurs in one of the strands of a pair of complementary reads if the forward read contains either w or Inline graphic . Thus, the total number of occurrences of w in the M pairs of complementary reads equals the number of occurrences of either w or along the forward reads. Thus, we are interested in the joint word counts of w and its complement , but in contrast to Reinert and Schbath (1998) we allow for mixed clumps of occurrences, that is, the clumps can be composed of combinations of w and its complement. A compound Poisson approximation, with bounds, for the joint count of w and its complement can be found in Roquain and Schbath (2007). The approximation is valid only for non-palindromes; it also requires that the word and its complement have non-zero probability of appearing in the sequence. Here we illustrate how such a compound Poisson approximation for word counts in single reads can be combined for NGS data.

Let

where Inline graphic can be either w or , and

and Inline graphic is a subset of by removing those that are multiple of other numbers. By the definition, we have, for any word pattern w,

Let C_k(w) be the subset of C_k such that the first word Inline graphic , and be the subset of C_k such that the first word Define and .

Given the above notation, we consider the distribution of clumps starting with w and Inline graphic , respectively. Here a clump is defined as a maximum region with overlapping occurrences of either w or . Note that a clump starting with w occurs at a position i if 1) w occurs at position i, 2) sequences in do not occur just before i. Thus, a w-clump starts at a particular position with probability

(9)

Similarly, the probability that a Inline graphic -clump starts at a particular position

(10)

We refer to the clumps starting with w the w-clumps and those starting with Inline graphic the -clumps. Both the w-clumps and -clumps can be approximated by a Poisson process with parameters and , respectively. The joint of the two approximate Poisson processes can again be approximated by a Poisson process with parameters . Thus, the number of clumps including both the w-clumps and the Inline graphic -clumps, , can be approximated by a Poisson random variable with mean .

We order the w- and Inline graphic -clumps from the 5′-end to the 3′-end. Let I_i = 1 and I_i = 0 be the events that the i-th clump is a w-clump and -clump, respectively. We have

(11)

Next we study the distribution of the number of occurrences of w or Inline graphic in w-clumps and -clumps, separately. A k-clump starting with w is referred as a k-w-clump. Similarly, a k-clump starting with is referred as a k--clump. Then a k-w-clump occurs at a specific position i if (1) C_k(w) occurs at position i, (2) sequences in do not occur just before i, and (3) C_k₊₁(w) does not occur at position i. Note that when we deduct the probability of the second and the third events, we deduct the probability of the event SC_k₊₁(w) twice. Thus, we need to add the probability of this event, giving

(12)

Similarly, we have for Inline graphic

(13)

Let X_i and Inline graphic be the numbers of occurrences of w or in a w-clump and -clump, respectively. Then the distributions of X_i and are

(14)

Similar as above, let Inline graphic be the number of reads covering the first occurrence of w or in the i-th clump.

The distribution of Inline graphic can be calculated by

The number of occurrences of w along the M pairs of reads can be approximated by

(15)

The argument is made precise in the next proposition.

Theorem 2.3

Let Inline graphic have a Poisson distribution with mean let have distribution (6), let have distribution (14), let have distribution (11) and assume that all these variables are independent. Then

Here

where C > 0 and C′ > 0 are two explicit constants that only depend on the transition matrix T, and α is the second largest eigenvalue in modulus of T.

2.5. The approximate power of detecting enriched patterns using the compound Poisson approximation for the distribution of the number of occurrences of word patterns

The framework for the normal and compound Poisson approximations for the number of occurrences of word patterns can equally be applicable to sequences generated by hidden Markov models. In particular, a regulatory sequence with many instances of transcription factor binding sites can be modeled by a hidden Markov model as in Zhai et al. (2010). Specifically, the long background sequence is modeled as an i.i.d. sequence. In addition, instances of a motif with a given position weight matrix are randomly inserted into the background sequence with probability 1 − ρ at each position. We refer to 1 − ρ as motif density. Next generation sequencing is then used to sample M reads from the long sequence as modeled in Subsection 2.1. Based on the reads, we want to test the null hypothesis H₀ : ρ = 1, i.e, no motif instances are inserted, versus the alternative hypothesis H₁ : ρ < 1, i.e, motif instances are inserted in the underlying background sequence. Consider the dominant pattern w^(d) in the motif consisting of the nucleotide with the highest probability in each position. We can use Inline graphic as a statistic to test the hypotheses. For a given type I error α, we can obtain a threshold t_α, that is, the smallest value of t such that

(16)

based on the theory for Inline graphic developed above, where is the approximate probability distribution of when no motifs are inserted, i.e. ρ = 1.

The approximate power of the test statistic when ρ < 1 is given by

(17)

where Inline graphic is the probability distribution under the alternative model. The approximate power for detecting the enrichment of certain patterns under the double-strand model can be calculated similarly. In the following, for convenience we use the term “power” to mean approximate power.

3. Results: Simulation Studies and Biological Applications

3.1. Evaluation of the accuracy of the normal and compound Poisson approximations using simulations

We study nucleotide sequences consisting of four states (A, C, G, T) and consider three relatively short patterns (“TAT,” “ACGT,” and “CGCG”) and two relatively long patterns (“ACGTATC” and “AAGAAGAA”). The pattern “ACGT” does not have any periods and the patterns “TAT” and “CGCG” have a period 2. The pattern “ACGTATC” does not have any periods and the pattern “AAGAAGAA” has three periods {3, 6, 7} with principal periods {3, 7}. For each pattern, we compare the histogram of the simulated number of occurrences of a pattern with the theoretical compound Poisson approximation probability mass function given in Section 2. For patterns having relatively high number of occurrences, e.g, mean at least 50 in the cases we consider, we also plot the density function of the normal approximation.

In all our simulations, we use the following parameters: the nucleotide frequencies of (A, C, G, T) are (a) GC-rich (0.15, 0.35, 0.35, 0.15), (b) uniform (0.25, 0.25, 0.25, 0.25), and (c) GC-poor (0.35, 0.15, 0.15, 0.35). These settings allow us to see the effect of nucleotide frequency on the accuracy of the theoretical approximations. The sequence length n is chosen as either 5000 for the two short patterns or 20,000 for the two long patterns. The number of sequence reads M is either 500 for the two short patterns or 4000 for the two long patterns. The read length β = 80. We assume that the sequence reads are either homogeneously or heterogeneously chosen from the long sequence. In heterogeneous sampling of the reads, we divide the sequence into 100 consecutive blocks. For each block, we sample a random number from the gamma distribution Γ(1, 20) and the sampling probability λ_i for each position in the block is proportional to the chosen number (Zhang et al., 2008). We consider both the single- and double-strand models in our simulation studies. The number of simulations for each case is 10,000. Note that the sequence length and the number of reads simulated here are much smaller than the corresponding values in real studies. These numbers are chosen to save computational time. The qualitative results should hold for much longer sequences and higher number of reads.

Due to page limitations, we present the figures for the simulation results as Figures S1–S9 in the Supplementary Material (see www.liebertonline.com/cmb for Supplementary Material). We make the following observations. First, when the average number of occurrences of the pattern of interest is relatively large, for example, greater than 500, both the normal and compound Poisson approximations work well. However, for most of the cases we considered in this study, the compound Poisson approximation outperforms the normal approximation. Second, when the sequence reads are heterogeneously sampled from the long background sequence, the range of the number of occurrences of the patterns will be larger than that under the homogeneous sampling scheme. Third, the distribution of the number of occurrences of patterns can have multiple peaks under some situations and the compound Poisson approximation can accurately capture such features.

3.2. Power studies using simulations

We next study the power of detecting enriched patterns when such patterns indeed occur more frequently than expected. For this we consider sequences with random instances of a motif on a background i.i.d. sequences as described in Subsection 2.5. Under this model, a motif instance is inserted at each position of the background sequence with probability 1 − ρ (motif density). Thus, the consensus pattern of the motif is enriched compared to the background. The background sequence models and the inserted motifs “TAT,” “ACGT,” “CGCG,” “ACGTATC,” and “AAGAAGAA” are all the same as in Subsection 3.1. For a type I error α = 0.025, we first use Equation 16 to find the threshold t_α based on the approximate distribution for Inline graphic under the null model ρ = 1. The theoretical power under the alternative model that ρ < 1 is calculated using Equation 17. We run 10,000 simulations based on the alternative model and record the number of occurrences of the corresponding pattern. The simulated power is approximated by the fraction of times that Inline graphic , where w^(d) is the consensus pattern of the motif and it is the inserted pattern in our simulations.

Tables 1 and 2 compare the theoretical and the simulated power of Inline graphic for detecting the corresponding enriched patterns for different values of motif density 1 − ρ for the patterns: “TAT,” “ACGTATC,” and “AAGAAGAA” under the single- and double-strand GC-poor models, respectively. The power of detecting the patterns “ACGT” and “CGCG” using the single- and double-strand model is the same because the counts for the double-strand model is twice the count for the single-strand model. The power results for these two patterns under the GC-poor model are given as Table S1 and the complete results for the GC-rich and uniform background models are given as Tables S2–S7 in Supplementary Material. The following conclusions can be obtained from the tables. First, the threshold value calculated from Equation 16 is conservative in that the simulated type I error rate is smaller than the specified type I error α in most of the situations. Second, the theoretical power given in Equation 17 is very close to the simulated power when the theoretical power is relatively large (e.g, greater than 50%). Third, the power of detecting enriched patterns under heterogeneous read sampling is smaller than that under homogeneous read sampling.

Table 1.

Comparison of the Simulated and Theoretical Power ( ×100) for Patterns TAT, ACGTATC, and AAGAAGAA Under the Single-Strand GC-Poor Model

			Scaled motif density
		Threshold	0	1	2	3	4	5	6	7	8	9
Homogeneous read sampling
TAT	Simulation	1966	1.2	2.2	4.7	8.9	14.5	21.9	33.3	45.1	55.2	67.1
	Theory	1966	2.4	4.8	8.4	13.7	20.7	29.3	39.1	49.4	59.5	68.9
ACGTATC	Simulation	54	1.8	14.8	32.0	51.8	70.8	83.5	89.3	93.5	97.1	98.4
	Theory	54	2.4	14.9	34.9	55.3	71.8	83.3	90.6	94.9	97.4	98.7
AAGAAGAA	Simulation	49	2.2	15.6	38.8	56.3	74.2	87.0	92.3	95.3	97.0	99.2
	Theory	49	2.3	16.6	38.2	58.9	74.9	85.6	92.1	95.9	97.9	99.0
Heterogeneous read sampling
TAT	Simulation	2026	1.3	2.1	5.1	7.9	11.2	16.5	24.2	32.3	40.3	49.6
	Theory	2026	2.4	4.2	6.7	10.1	14.6	20.1	26.5	33.8	41.6	49.6
ACGTATC	Simulation	70	2.4	9.7	19.6	33.6	46.0	62.6	72.7	81.3	87.9	93.3
	Theory	70	2.4	9.7	21.4	35.4	49.5	62.3	72.9	81.2	87.4	91.7
AAGAAGAA	Simulation	63	2.0	12.2	27.5	38.7	51.8	67.7	76.0	85.3	89.8	93.0
	Theory	63	2.4	11.2	24.4	39.5	54.1	66.7	76.8	84.4	89.8	93.5

Open in a new tab

The sequence length n = 5000, the number of reads M = 500, and the scaled motif density = 1000(1 − ρ) for the pattern TAT. For the two long patterns, the sequence length n = 20000, the number of reads M = 4000 and the scaled motif density = 20000(1 − ρ). The read length β = 80. Type I error α = 2.5%. The “Threshold” is obtained using Equation 16 based on the theoretical approximate distribution of the number of occurrences of a pattern under the null model. The number of simulations is 10,000.

Table 2.

Comparison of the Simulated and Theoretical Power ( ×100) for Patterns TAT, ACGTATC, and AAGAAGAA Under the Double-Strand GC-Poor Model

			Scaled motif density
		Threshold	0	1	2	3	4	5	6	7	8	9
Homogeneous read sampling
TAT	Simulation	3875	0.9	1.5	2.9	5.7	8.8	13.9	21.5	30.1	37.9	48.2
	Theory	3875	2.49	4.3	6.9	10.6	15.4	21.4	28.3	36.0	44.2	52.4
ACGTATC	Simulation	82	2.3	13.8	37.1	53.8	73.6	83.5	91.8	94.7	98.1	98.9
	Theory	82	2.4	16.1	37.4	58.3	74.5	85.4	92.0	95.8	97.9	99.0
AAGAGAA	Simulation	73	1.9	10.5	26.3	43.3	60.7	72.9	83.5	89.3	93.3	97.0
	Theory	73	2.3	11.2	26.4	44.2	60.9	74.3	84.0	90.4	94.5	97.0
Heterogeneous read sampling
TAT	Simulation	3982	1.2	1.9	2.9	4.5	6.9	10.6	16.8	21.5	27.5	33.4
	Theory	3982	2.49	3.8	5.7	8.1	11.2	14.9	19.4	24.4	30.0	36.0
ACGTATC	Simulation	101	2.2	7.7	12.9	27.7	37.5	51.3	60.0	68.9	79.2	83.7
	Theory	101	2.46	7.4	15.5	26.0	37.8	49.7	60.7	70.4	78.3	84.5
AAGAAGAA	Simulation	91	2.48	7.1	15.6	26.1	38.3	53.5	66.2	71.9	79.1	86.2
	Theory	91	2.4	7.9	16.9	28.2	40.6	52.8	63.8	73.2	80.7	86.4

Open in a new tab

The parameters are the same as in Table 1.

3.3. Applications to a ChIP-Seq data set in Valouev et al. (2008)

Now we apply the theory to a ChIP-Seq data set using transcription factor GABP in Valouev et al. (2008). We consider the promoter region of Nuclear Matrix Transcription Factor 4 gene (ZNF384) between position 6,667,900 and position 6,669,500 (a total of 1600 bp) on human chromosome 12, NCBI build 36. The gene ZNF384 has been shown to be the regulatory target of GABP and the region is enriched with ChIP-Seq reads as shown in the supplementary materials of Valouev et al. (2008). Our objective is to show the applicability of the theory developed in this article, not as a comparison with other computational methods of peak calling for ChIP-Seq data. The position weight matrix (PWM) of the GABP binding side is given in Table S8 (JASPAR, http://jaspar.cgb.ki.se/; ID number MA0062.2) in Supplementary Material. The consensus sequence formed by the dominant nucleotide at each position is “CCGGAAGTGGC”.

In a typical ChIP-Seq experiment, DNA regions of interest are sheared into short fragments and the specific DNA fragments interacting with the protein of interest are isolated by immuno-precipitation. Then NGS is used to sequence either end of the sequence. These end sequences are referred as tag sequences. In Valouev et al. (2008), the tag sequences are of length 25 bp. Since the tag sequences from ChIP-Seq can come from either the forward or the reverse strand of the selected fragments and the tag sequences may not contain the GABP binding sites, we extend both the forward and reverse strands to the whole sequence fragments as follows. It was estimated in Valouev et al. (2008) that the median read length in the GABP data set is 56 bp with mean around 57 bp. So we extend the forward strand by 31 bp in the forward direction. We also extend the reverse strand in the reverse direction by 31 bp so that each tag is associated with a read of 56 bp.

We analyze three different read data sets mapped to the forward strand only, the reverse strand only, and both the forward and the reverse strands combined. For each k-tuple (k = 6), we first approximate the p-value corresponding to the k-tuple using the control (Rx-noIP) data and the compound Poisson approximation. In such calculations, we use the nucleotide frequencies calculated from the extended reads. The distribution of the reads along the genomic region is estimated empirically by the fraction of reads starting from each position as follows. Since the number of reads starting from individual positions is generally small and the estimated distribution of reads λ_i using the number of reads starting at the i-th position is not reliable, we estimate λ_i within a window using the following approach. For a given window size S, we estimate λ_i by the average number of reads starting at the positions within the window of size S centered at i, in other words,

where M_i is the number of reads starting at position i and M is the total number of reads. Using these estimated parameters, we can approximate the p-value corresponding to each k-tuple.

We use our approach to analyze both the control and the ChIP-Seq data. We expect that no k-tuples are significant for the control data while the dominant patterns in the motif should be enriched in the ChIP-Seq data set. Different window sizes S = 1 to 50 by step 5 are used and the results are similar. Table 3 presents the top 10 k-tuples using k = 6 with the smallest p-values when S = 20 using the reads mapped to the forward strand, reverse strand, and both strands for the control data, and Table 4 presents the results based on the ChIP-Seq data.

Table 3.

Top 10 k-Tuples (k = 6) with Smallest p-Values Using Reads Mapped to the Forward, Reverse, and Both Forward and Reverse Strands for the Control Data Set with Transcription Factor GABP

Forward		Reverse		Combined
6-tuple	p-value	6-tuple	p-value	6-tuple	Complement	p-value
CACTTC	5.82E-06	GAAGTG	9.53E-05	CACTTC	GAAGTG	6.01E-07
ACTTCC	0.000349	GTGAGT	0.000584	CTATAG	CTATAG	5.76E-05
CTTCTG	0.000956	AGTCCT	0.000678	ACTTCC	GGAAGT	6.98E-05
TCCTTG	0.000956	GGAGGG	0.000731	CCCTCC	GGGAGG	1.11E-04
CCACTT	0.001367	CCGGAA	0.000961	CCGGAA	TTCCGG	1.56E-04
CCTTCC	0.001672	GTCCTC	0.001143	CAGAAG	CTTCTG	2.24E-04
CCTTGC	0.001743	AAGTGG	0.001221	ACTTCT	AGAAGT	3.26E-04
TTCCGG	0.001816	GAAGAA	0.001353	GCTATA	TATAGC	6.02E-04
CTCCTT	0.001828	CTATAG	0.001355	CGGAAG	CTTCCG	8.47E-04
CCTTGT	0.001899	GCTATA	0.001355	AAGTGG	CCACTT	1.02E-03

Open in a new tab

Table 4.

Top 10 k-Tuples (k = 6) with Smallest p-Values Using Reads Mapped to the Forward, Reverse, and Both Forward and Reverse Strands for the ChIP-Seq Data Set with Transcription Factor GABP

Forward		Reverse		Combined
6-tuple	p-value	6-tuple	p-value	6-tuple	Complement	p-value
CACTTC	5.55E-11	CGGAAG	2.06E-08	ACTTCC	GGAAGT	1.78E-15
ACTTCC	1.75E-10	CCGGAA	2.1E-08	CAGAAG	CTTCTG	1.78E-15
TTCCGG	4.97E-10	CTTCCG	2.75E-07	CGGAAG	CTTCCG	2.33E-15
CTTCCG	6.47E-10	TAGCGG	3.26E-07	CACTTC	GAAGTG	3.55E-15
CAGAAG	8.82E-07	GAAGCT	5.42E-07	GCAGAA	TTCTGC	7.44E-15
GCAGAA	1.16E-06	CTAGCG	1.2E-06	CCGGAA	TTCCGG	1.42E-14
AAATAG	1.25E-05	CACTTC	1.31E-06	CTATAG	CTATAG	2.80E-13
GAAATA	1.3E-05	ACTTCC	1.48E-06	AAGTGA	TCACTT	4.21E-12
TCACTT	1.32E-05	GAAGTG	1.53E-06	GCGGAA	TTCCGC	6.54E-12
ACACTT	2.45E-05	GGAAGT	1.77E-06	CCGCTA	TAGCGG	1.46E-10

Open in a new tab

For family-wise type I error 0.05, using the Bonferroni correction, only 6-tuples with p-value less than 0.05/4⁶ ≈ 1.25 × 10⁻⁵ are declared as significant. With this criterion, one 6-tuple “CACTTC” was identified as significant using the reads mapped to the forward strand based on the control data. The tuple “CACTTC” is complementary to the dominant pattern at positions [4,9]. The next pattern with relatively small p-value, although not significant using our criterion, is “ACTTCC” which is complementary to the dominant pattern at positions [5,10]. From the two patterns, it is possible to construct a consensus sequence of seven nucleotides “CACTTCC” which is complementary to the dominant pattern at positions [4,10]. We see some GABP motif signals in the control data set.

For the ChIP-Seq data, six 6-tuples are significant based on reads mapped to the forward strand. The top four 6-tuples with p-value at most 6.47 × 10⁻¹⁰—“CACTTC,” “ACTTCC,” “TTCCGG,” and “CTTCCG”—are complementary to the dominant patterns at positions [4,9], [3,8], [1,6], and [2,7], respectively. From these four patterns, we are able to construct a consensus sequence of 9 nucleotides “CACTTCCGG.” Similar observations can be made based on reads mapped to the reverse strand and both strands. We also carried out similar studies using k = 4 and 5, and the results are similar (data not shown).

4. Discussion

In this article, we study the distribution of the number of occurrences of patterns in sequence reads randomly sampled from long Markovian sequences. This problem comes naturally from the analysis of sequence reads generated from NGS including ChIP-Seq and RNA-Seq. In this article, we first develop probabilistic models for the background sequences and the sampling of sequence reads using NGS. The background sequence is modeled as a Markovian sequence. Each sequence read starts from the i-th position of the background sequence with probability λ_i and the sampling of sequence reads from the background sequence is assumed to be independent. Based on the model, we study the limit distribution of the number of occurrences of any k-tuple patterns. Two approximate distributions are considered. We assume throughout the paper that both the background sequence length and the number of sequence reads are large and that the sequence reads do not concentrate on particular regions of the background sequence. We first give a normal approximation for the number of occurrences of frequent patterns and provide formulas to calculate the mean and variance of the approximate normal distribution. For relatively rare patterns, we provide a new compound Poisson approximation for the number of occurrences. Simulation studies are first used to evaluate the theoretical results, and it is shown that the compound Poisson approximation seems to work well in most of the situations. The compound Poisson approximation is then used to analyze ChIP-Seq data mapped to the promoter region of the gene ZNF384 using transcription factor GABP. Surprisingly, we found GABP binding motif signals in the control data set indicating some ChIP residue effect even within the control data. With the ChIP-Seq data, we are able to recover the consensus patterns of the motif.

Despite the usefulness of the models and the approximations, there are some limitations. First, we assume that the background sequence follows a homogeneous Markov chain. In reality, the sequence to be sequenced may be heterogeneous with different regions following varied Markov models. If we have some idea about the composition of the nucleotides at different parts of the sequence, hidden Markov models can potentially be used to model such sequences. In our study, an empirical method to estimate the distribution of sequence reads along the genome sequence is used. Due to the relative low number of reads starting at particular positions, the empirically estimated read distribution may not be accurate, resulting in less reliable estimated p-values for each pattern. Several investigators studied the distribution of sequence reads from NGS technologies based on the sequence content surrounding a specific location (Hansen, et al. 2010, Li et al., 2010) and showed that the local sequence content can predict the read distribution well. These results can potentially be used to model the read distribution along the genome sequence. The effect of such dependency on the distribution of the number of occurrences of patterns needs further study. We also assumed that the fragments from NGS are of the same length. In reality, their length can vary and follow some distribution. This is another topic for future studies.

Supplementary Material

Supplemental data

Supp_Data.pdf^{(1MB, pdf)}

Acknowledgments

Z.Y.Z. was supported by NSFC 11071146 and Graduate Independent Innovation Foundation of Shandong University (GIIFSDU). G.R. was supported in part by the Institute for Mathematical Sciences of the National University of Singapore and US R21AG032743. M.S.W. was supported by NIH 1R21HG006199. Y.H.L. was supported by China NSFC 11071146. F.S. was supported by NIH P50 HG 002790 and 1R21HG006199; NSF DMS-1043075 and OCE 1136818; and NSFC 60805010.

Disclosure Statement

No competing financial interests exist.

References

Campbell A. Mrázek J. Karlin S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA. 1999;96:9184–9189. doi: 10.1073/pnas.96.16.9184. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen L. Shao Q.-M. Normal approximation under local dependence. Ann. Probabil. 2004;32:1985–2028. [Google Scholar]
Dalevi D. Dubhashi D. Hermansson M. Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics. 2006;22:517–522. doi: 10.1093/bioinformatics/btk029. [DOI] [PubMed] [Google Scholar]
Dufraigne C. Fertil B. Lespinats S., et al. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005;33:e6. doi: 10.1093/nar/gni004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guibas L. Odlyzko A. Periods in strings. J. Combin. Theory Ser. A. 1981;30:19–42. [Google Scholar]
Hansen K. Brenner S. Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. doi: 10.1093/nar/gkq224. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probabil. 2002;34:559–586. [Google Scholar]
Jun S. Sims G. Wu G., et al. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. USA. 2010;107:133–138. doi: 10.1073/pnas.0913033107. S. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karlin S. Mrázek J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA. 1997;94:10227–10232. doi: 10.1073/pnas.94.19.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kleffe J. Langbecker U. Exact computation of pattern probabilities in random sequences generated by Markov chains. Comput. Appl. Biosci. 1990;6:347–353. doi: 10.1093/bioinformatics/6.4.347. [DOI] [PubMed] [Google Scholar]
Lander E. Waterman M. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. doi: 10.1016/0888-7543(88)90007-9. [DOI] [PubMed] [Google Scholar]
Li J. Jiang H. Wong W. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50. doi: 10.1186/gb-2010-11-5-r50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lothaire M. Algebraic Combinatorics on Words (Encyclopedia of Mathematics and Its Applications) Cambridge University Press; New York: 1983. [Google Scholar]
Maclean D. Jones J. Studholme D. Application of “next-generation” sequencing technologies to microbial genetics. Nat. Rev. Microbiol. 2009;7:287–296. doi: 10.1038/nrmicro2122. [DOI] [PubMed] [Google Scholar]
Mardis E. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008a;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]
Mardis E. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008b;24:133–141. doi: 10.1016/j.tig.2007.12.007. [DOI] [PubMed] [Google Scholar]
McHardy A. Martín H. Tsirigos A., et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 2006;4:63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]
Nekrutenko A. Li W. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 2000;10:1986–1995. doi: 10.1101/gr.10.12.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nuel G. Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics. Algorithms Mol. Biol. 2006;1:1–14. doi: 10.1186/1748-7188-1-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Panjer H. Recursive evaluation of a family of compound distributions. Astin Bull. 1981;12:22–26. [Google Scholar]
Pape U.J. Rahmann S. Sun F.Z., et al. Compound poisson approximation of the number of occurrences of a Position Frequency Matrix (PFM) on both strands. J. Comput. Biol. 2008;15:547–564. doi: 10.1089/cmb.2007.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pavesi G. Mereghetti P. Mauri G., et al. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004;32:W199–W203. doi: 10.1093/nar/gkh465. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reinert G. Schbath S. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 1998;5:223–253. doi: 10.1089/cmb.1998.5.223. [DOI] [PubMed] [Google Scholar]
Reinert G. Schbath S. Waterman M. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]
Reinert G. Schbath S. Waterman M. Cambridge University Press; New York: 2005. Statistics on words with applications to biological sequences, 268–346.Applied Combinatorics on Words. (Encyclopedia of Mathemics and Its Applications. [Google Scholar]
Robin S. Rodolphe F. Schbath S. DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press; New York: 2005. [Google Scholar]
Roquain E. Schbath S. Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv. Appl. Probabil. 2007;39:128–140. [Google Scholar]
Schbath S. Compound Poisson approximation of word counts in DNA sequences. ESAIM Probabil. Stat. 1995;1:1–16. [Google Scholar]
Schbath S. An overview on the distribution of word counts in Markov chains. J. Comput. Biol. 2000;7:193–201. doi: 10.1089/10665270050081469. [DOI] [PubMed] [Google Scholar]
Schbath S. Robin S. Scan Statistics: Methods and Applications. (Statistics for Industry and Technology) Birkhäuser; Boston: 2009. How can pattern statistics be useful for DNA motif discovery?, 319–350. [Google Scholar]
Shan G. Zheng W.M. Counting of oligomers in sequences generated by Markov chains for DNA motif discovery. J. Bioinform. Comput. Biol. 2009;7:39–54. doi: 10.1142/s0219720009003935. [DOI] [PubMed] [Google Scholar]
Sims G. Jun S. Wu G., et al. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA. 2009;106:2677–2682. doi: 10.1073/pnas.0813249106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song K. Ren J. Zhai Z.Y., et al. Alignment-free sequence comparison based on next generation sequencing reads. Proc. RECOMB 2012. 2012:272–285. doi: 10.1089/cmb.2012.0228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uberbacher E. Mural R. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA. 1991;88:11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
Valouev A. Johnson D. Sundquist A., et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods. 2008;5:829–834. doi: 10.1038/nmeth.1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waterman M. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall; London: 1995. [Google Scholar]
Willmot G. Panjer H. Difference equation approaches in evaluation of compound distributions. Insurance Math. Econ. 1987;6:43–56. [Google Scholar]
Wu G. Jun S. Sims G., et al. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proc. Natl. Acad. Sci. USA. 2009;106:12826–12831. doi: 10.1073/pnas.0905115106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhai Z.Y. Ku S. Luan Y.H., et al. The power of detecting enriched patterns: an HMM approach. J. Comput. Biol. 2010;17:581–592. doi: 10.1089/cmb.2009.0218. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z.D. Rozowsky J. Snyder , et al. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol. 2008;4:e1000158. doi: 10.1371/journal.pcbi.1000158. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data

Supp_Data.pdf^{(1MB, pdf)}

[B1] Campbell A. Mrázek J. Karlin S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA. 1999;96:9184–9189. doi: 10.1073/pnas.96.16.9184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Chen L. Shao Q.-M. Normal approximation under local dependence. Ann. Probabil. 2004;32:1985–2028. [Google Scholar]

[B3] Dalevi D. Dubhashi D. Hermansson M. Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics. 2006;22:517–522. doi: 10.1093/bioinformatics/btk029. [DOI] [PubMed] [Google Scholar]

[B4] Dufraigne C. Fertil B. Lespinats S., et al. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005;33:e6. doi: 10.1093/nar/gni004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Guibas L. Odlyzko A. Periods in strings. J. Combin. Theory Ser. A. 1981;30:19–42. [Google Scholar]

[B6] Hansen K. Brenner S. Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. doi: 10.1093/nar/gkq224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probabil. 2002;34:559–586. [Google Scholar]

[B8] Jun S. Sims G. Wu G., et al. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. USA. 2010;107:133–138. doi: 10.1073/pnas.0913033107. S. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Karlin S. Mrázek J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA. 1997;94:10227–10232. doi: 10.1073/pnas.94.19.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Kleffe J. Langbecker U. Exact computation of pattern probabilities in random sequences generated by Markov chains. Comput. Appl. Biosci. 1990;6:347–353. doi: 10.1093/bioinformatics/6.4.347. [DOI] [PubMed] [Google Scholar]

[B11] Lander E. Waterman M. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. doi: 10.1016/0888-7543(88)90007-9. [DOI] [PubMed] [Google Scholar]

[B12] Li J. Jiang H. Wong W. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50. doi: 10.1186/gb-2010-11-5-r50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Lothaire M. Algebraic Combinatorics on Words (Encyclopedia of Mathematics and Its Applications) Cambridge University Press; New York: 1983. [Google Scholar]

[B14] Maclean D. Jones J. Studholme D. Application of “next-generation” sequencing technologies to microbial genetics. Nat. Rev. Microbiol. 2009;7:287–296. doi: 10.1038/nrmicro2122. [DOI] [PubMed] [Google Scholar]

[B15] Mardis E. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008a;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]

[B16] Mardis E. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008b;24:133–141. doi: 10.1016/j.tig.2007.12.007. [DOI] [PubMed] [Google Scholar]

[B17] McHardy A. Martín H. Tsirigos A., et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 2006;4:63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]

[B18] Nekrutenko A. Li W. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 2000;10:1986–1995. doi: 10.1101/gr.10.12.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Nuel G. Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics. Algorithms Mol. Biol. 2006;1:1–14. doi: 10.1186/1748-7188-1-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Panjer H. Recursive evaluation of a family of compound distributions. Astin Bull. 1981;12:22–26. [Google Scholar]

[B21] Pape U.J. Rahmann S. Sun F.Z., et al. Compound poisson approximation of the number of occurrences of a Position Frequency Matrix (PFM) on both strands. J. Comput. Biol. 2008;15:547–564. doi: 10.1089/cmb.2007.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Pavesi G. Mereghetti P. Mauri G., et al. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004;32:W199–W203. doi: 10.1093/nar/gkh465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Reinert G. Schbath S. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 1998;5:223–253. doi: 10.1089/cmb.1998.5.223. [DOI] [PubMed] [Google Scholar]

[B24] Reinert G. Schbath S. Waterman M. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]

[B25] Reinert G. Schbath S. Waterman M. Cambridge University Press; New York: 2005. Statistics on words with applications to biological sequences, 268–346.Applied Combinatorics on Words. (Encyclopedia of Mathemics and Its Applications. [Google Scholar]

[B26] Robin S. Rodolphe F. Schbath S. DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press; New York: 2005. [Google Scholar]

[B27] Roquain E. Schbath S. Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv. Appl. Probabil. 2007;39:128–140. [Google Scholar]

[B28] Schbath S. Compound Poisson approximation of word counts in DNA sequences. ESAIM Probabil. Stat. 1995;1:1–16. [Google Scholar]

[B29] Schbath S. An overview on the distribution of word counts in Markov chains. J. Comput. Biol. 2000;7:193–201. doi: 10.1089/10665270050081469. [DOI] [PubMed] [Google Scholar]

[B30] Schbath S. Robin S. Scan Statistics: Methods and Applications. (Statistics for Industry and Technology) Birkhäuser; Boston: 2009. How can pattern statistics be useful for DNA motif discovery?, 319–350. [Google Scholar]

[B31] Shan G. Zheng W.M. Counting of oligomers in sequences generated by Markov chains for DNA motif discovery. J. Bioinform. Comput. Biol. 2009;7:39–54. doi: 10.1142/s0219720009003935. [DOI] [PubMed] [Google Scholar]

[B32] Sims G. Jun S. Wu G., et al. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA. 2009;106:2677–2682. doi: 10.1073/pnas.0813249106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] Song K. Ren J. Zhai Z.Y., et al. Alignment-free sequence comparison based on next generation sequencing reads. Proc. RECOMB 2012. 2012:272–285. doi: 10.1089/cmb.2012.0228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] Uberbacher E. Mural R. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA. 1991;88:11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Valouev A. Johnson D. Sundquist A., et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods. 2008;5:829–834. doi: 10.1038/nmeth.1246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Waterman M. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall; London: 1995. [Google Scholar]

[B37] Willmot G. Panjer H. Difference equation approaches in evaluation of compound distributions. Insurance Math. Econ. 1987;6:43–56. [Google Scholar]

[B38] Wu G. Jun S. Sims G., et al. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proc. Natl. Acad. Sci. USA. 2009;106:12826–12831. doi: 10.1073/pnas.0905115106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] Zhai Z.Y. Ku S. Luan Y.H., et al. The power of detecting enriched patterns: an HMM approach. J. Comput. Biol. 2010;17:581–592. doi: 10.1089/cmb.2009.0218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] Zhang Z.D. Rozowsky J. Snyder , et al. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol. 2008;4:e1000158. doi: 10.1371/journal.pcbi.1000158. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads

Zhiyuan Zhai

Gesine Reinert

Kai Song

Michael S Waterman

Yihui Luan

Fengzhu Sun

Abstract

1. Introduction

2. Methods

2.1. Probabilistic models for the background sequence and sampling of sequence reads using NGS

2.2. Normal approximation for the number of occurrences of frequent patterns in randomly sampled NGS reads

Proposition 2.1

Proof of Proposition 2.1

Corollary 2.1

Theorem 2.1

2.3. Compound Poisson approximation for the number of occurrences of rare patterns in randomly sampled NGS reads

Theorem 2.2

2.4. Extending the approximations to the double-strand model

Theorem 2.3

2.5. The approximate power of detecting enriched patterns using the compound Poisson approximation for the distribution of the number of occurrences of word patterns

3. Results: Simulation Studies and Biological Applications

3.1. Evaluation of the accuracy of the normal and compound Poisson approximations using simulations

3.2. Power studies using simulations

Table 1.

Table 2.

3.3. Applications to a ChIP-Seq data set in Valouev et al. (2008)

Table 3.

Table 4.

4. Discussion

Supplementary Material

Acknowledgments

Disclosure Statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases