Abstract
Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).
Key words: algorithms, genome analysis, HMM, next generation sequencing, statistical models
1. Introduction
The study of the occurrences of word patterns in sequences has played an important role in molecular sequence analysis. Here, we shall use word pattern of length k and k-tuple interchangeably; often word patterns are also just referred to as words. For a given k, the frequencies of word patterns of length k form a vector, referred to as sequence signature (Campbell et al., 1999). Sequence signatures of genomic sequences of varying characteristics are usually different. For example, coding and non-coding sequences usually have different signatures and thus sequence signatures can be useful features to distinguish coding from non-coding sequences (Uberbacher and Mural, 1991). Sequence signatures within different parts of a genome tend to be similar, while they differ significantly between genomes (Karlin and Mrázek, 1997, Nekrutenko and Li, 2000). Thus, sequence signatures have been used to study the evolutionary relationship between different genomic sequences (Jun et al., 2010, Karlin and Mrázek, 1997, Sims et al., 2009, Wu et al., 2009), to study horizontal gene transfer (Dalevi et al., 2006, Dufraigne et al., 2005), and to bin sequence reads from metagenomic studies so that reads in the same bin tend to have similar sequence signatures (McHardy et al., 2006). The sequence signatures can also be employed to detect enrichment for short words. For example, the upstream regions of co-regulated genes usually share common transcription factor binding sites (TFBS) referred to as motifs, and thus motifs are usually enriched within these sequences. Finding enriched word patterns within these sequences is a powerful tool for the identification of TFBS (Pavesi et al., 2004).
Due to the many applications of sequence signatures, extensive studies have been carried out to study the distribution of the number of occurrences of word patterns in one or multiple long sequences consisting of independent identically distributed (i.i.d.) letters and sequences generated by both Markov and hidden Markov models (HMM). Several excellent reviews (Reinert et al., 2000, 2005, Schbath, 2000, Schbath and Robin, 2009) and a book (Robin et al., 2005) on this topic are available. The distribution of the number of occurrences of a pattern can also be studied using the so-called “imbedded Markov chain” techniques (Kleffe and Langbecker, 1990, Nuel, 2006, Shan and Zheng, 2009). However, the computation of p-values using these techniques can be very time consuming and impractical for long sequences. We recently studied the power of detecting enriched patterns when motifs are randomly distributed along the genome using HMM (Zhai et al., 2010).
In all these studies, one or several long sequences are available and the word pattern occurrences along these long sequences are studied. Rather than providing a few long sequences, recent developments in sequencing technologies make it possible to sequence a large number of relatively short reads (e.g., 30–80 bp for Illumina/Solexa and 300–500 bp for Roche 454) efficiently and economically. These new sequencing technologies have revolutionized current studies of many biological problems including locating genomic regions of TFBS, histone modification, and chromatin structure using ChIP-Seq, resequencing of known genomes for the identification of genetic polymorphisms, and sequencing of unknown genomes. For the applications of these NGS technologies, see recent reviews (Maclean et al., 2009, Mardis, 2008a,b). Although many computational methods have been developed to analyze NGS data, to our knowledge no studies on the distribution of the number of occurrences of word patterns in the sequence reads generated from NGS have been carried out. In this article, we fill this gap. The main challenge compared to word counts in sequences is that, in NGS, two random processes are involved, namely not only the the randomness in the background genome sequence but also the random sampling of the reads from the background sequence.
The study of the distribution of the numbers of occurrences of word patterns from NGS read data has several important applications, in particular, when the complete genome sequences are not available. First, such distributions are important for the comparison of genomes when NGS short reads are available for each genome (Song et al., 2012). Second, they can be used to identify enriched or depleted patterns in genomes whose complete genomes are not known. Such enriched or depleted patterns can be used to characterize the genome sequences. Third, the null distributions of the numbers of occurrences of patterns can be used to identify enriched patterns in ChIP-Seq experiments and such enriched patterns can be useful for the identification of TFBS.
In this article, we not only study the distributions of the numbers of occurrences of word patterns from NGS read data under a suitable null model, but we also address the issue of the power of the count statistics against an alternative model which assumes that there are motifs present in the sequence. Our methodology builds on Zhai et al. (2010), but differs from that article in the consideration of NGS data and the consideration of both strands of the genome sequences. In the study of word patterns for long sequences, both strands are rarely considered except in Pape et al. (2008). For NGS, the consideration of both strands are essential since the reads can come from both the forward strand and the reverse strand of the genome sequences. We provide simpler approximate distribution for the number of occurrences of word patterns for NGS data than the approximations given in Pape et al. (2008).
The article is organized as follows. In Section 2, we first present the probability models for the background sequence and the sampling process of reads using NGS. Then the results for normal and compound Poisson approximations for the number of occurrences of patterns in NGS reads are given. As the approximations assume that both the length of the reads and the length of the background sequence go to infinity, whereas in reality they are reasonably short, we also give bounds on the approximation errors. We consider both single strand and double strand models. This section forms the core theoretical results of the article. In Section 3, we first present simulation results to show the validity of the theoretical results for both common and rare patterns, and then use the theoretical results to analyze a ChIP-Seq data set from Valouev et al. (2008). It is surprising to see that, even in the control data, some TFBS signals can be identified, indicating that some residue ChIP effects are present in the control data. Using ChIP-Seq data, we are able to identify the consensus patterns of the motif of interest. The article concludes with some discussion on the limitations of the approach and future research directions. Many of the proofs are given in Supplementary Materials (available online at www.liebertonline.com/cmb).
2. Methods
2.1. Probabilistic models for the background sequence and sampling of sequence reads using NGS
In NGS, a large number of M reads are randomly sampled from the genome. For studying the distribution of the number of occurrences of patterns among the M reads, two random processes are involved. The first randomness comes from the generation of the background genome sequence and the second randomness comes from random sampling of the reads from the background sequence.
As in previous studies reviewed in Robin et al. (2005) and Schbath and Robin (2009), the background sequence is modeled as a homogeneous ergodic Markov chain taking states in the set
with transition probability matrix T = (tll′)L×L. The Markov chain has a unique stationary distribution π0. The results in this paper can also be extended to sequences generated by hidden Markov models without too much difficulty.
Next, we model the sampling of reads along the genome sequence using NGS. As it was shown in Zhang et al. (2008) that the homogeneous Lander-Waterman model (Lander and Waterman, 1988) for genomic mapping does not model the read distributions along the genome well, we use a modified version of the Lander-Waterman model to describe the distribution of reads along the genome. We assume that the sampled reads have the same length of β bp. A total of M reads are independently sampled from the genome of length n bp. Each read starts at position i with probability
, where
, with
.
Let
be any word pattern of length w with
. Then
is the probability of w. Let
be the number of occurrences of w in these M reads. To calculate the mean of
, note first that in each read of length β, the expected number of occurrences of w is (β − w + 1)P(w). As there are M reads, we obtain that
![]() |
(1) |
We study the approximate distribution of
and the approximate joint distribution of (
), where
indicates the set of word patterns. We consider both single strand and double strand models. In the single strand model, we assume that the reads just come from one strand. In the double strand model, the reads can come from either strand of the genome. We allow for the occurrences to overlap. For example if the sequence is 5′-CAATAATATAATAG-3′ and the word is ATA, then we count four occurrences in the single strand model. A clump of pattern w is a consecutive region of the sequence with overlapping occurrences of w. For the example given above, there are three clumps of occurrences, one clump (ATATA) of size two and two clumps of size one each. Counting the occurrences of ATA in the complementary sequence 5′-CTATTATATTATTG-3′ also, there are 4 + 1 = 5 occurrences of w in the double strand model. Note that we always count from the 5′ end to the 3′ end of the sequences.
2.2. Normal approximation for the number of occurrences of frequent patterns in randomly sampled NGS reads
In this subsection, we present our results for calculating the covariance of
and
under the models described in Subsection 2.1 for any two word patterns u and v. Proposition 2.1 presents a formula to calculate
. The covariance can then be derived using Equation (1). While the covariance of word counts for a single sequence read can be found in Waterman (1995), Proposition 12.1, the following Proposition 2.1 takes the randomness in the starting positions of the sequence reads into account.
Proposition 2.1
Let
be the underlying sequence of length n. Let
u
and
v
be two word patterns of length u and v, respectively, with u ≤ v. Assume that β ≥ u + v. Randomly choose M reads of length β from a genome of length n base pairs according to the model in Subsection 2.1 and let
and
be the numbers of occurrences of word patterns
u
and
v
in these reads, respectively. Then
can be calculated by
![]() |
where
, and
Nw[i, i + β − 1] the number of occurrences of word pattern
w
in
.
Formulas for calculating
are given in the supplementary materials, Proposition A.1; they are based on a slight modification of the proof for Proposition 12.1. in Waterman (1995).
Proof of Proposition 2.1
Let Cw(m) be the number of occurrences of word pattern w in the m-th read,
. Then
![]() |
Let
![]() |
According to our model, it is easy to see that for word patterns u and v, the counts (Cu(m),Cv(m)) have the same distribution for all
. Similarly, for any m ≠ m′, (Cu(m),Cv(m′)) have the same distribution. Thus,
![]() |
and
![]() |
Since the Markovian sequence is stationary, Cu(1) has the same distribution as Nu[1,β]. Thus,
![]() |
Conditioning on the locations of the first and second reads, we have
![]() |
The proposition is proved. ▪
For the special case that the background sequence is i.i.d., we have the following corollary.
Corollary 2.1
Suppose that the background sequence is i.i.d. With the same notation as in Proposition 2.1, we have
- 1. The covariance of
and
can be calculated as

- 2. If
and M depends on n such that limn→∞M/n = θ, then

In particular, if the reads are uniformly sampled from the genomic sequence, i.e. λi = 1/(n − β + 1), then
.
Corollary 2.1 follows by noting that for the i.i.d. case and η ≥ β,
![]() |
The second part follows directly by taking the limit of Cov(
in Part 1) over M as n tends to infinity.
Given the approximate mean and variance of
, it is tempting to approximate the distributions of
using a normal distribution. The approximation is based on the heuristic that the counts in different reads are independent unless the reads overlap, and if the words are not too long, the count in each read would be approximately normally distributed.
As reads are not very long, the approximation error may not be negligible, and hence we give an upper bound on the approximation error. Our result is phrased in terms of
![]() |
where Z denotes a standard normal variable. Thus
(standardized count ≤ x)
, and a bound on dK can be used to obtain a conservative p-value for the observed standardized count.
Here, we employ Theorem 2.6 in Chen and Shao (2004), and assume the i.i.d. model for the underlying background sequence. Then Nw[i, i + β − 1] and Nw[j, j + β − 1] are independent unless |i − j| ≤ β, where Nw[l, l′] is the number of occurrences of word w in the interval [l, l′] along the long sequence. Using the notation
and
, the following result holds.
Theorem 2.1
Assume the i.i.d. model for the background sequence and let Z be standard normally distributed. Then for a word w of length w,
![]() |
where
.
In the case that all letters are equally likely and independent, and all
, the bound will be of order O((ln n)−1) when the word length is not too large, w < log L(β/ ln(n)), while the read length β = c1Lw ln n for a constant c1 > 1 and the number of reads M = c2n/(ln n) for a constant c2 > 0. This type of regime is rather specific, for example the above regime with n = 5,000 and w = 4 on a 4-letter alphabet would require β > 2,181, while n / (ln n) = 587; if n = 20,000 and w = 7 we would need β > 162, 259, while n / (ln n) = 2019.5. Moreover, the above regime would require that M/n is small. Hence, it is no surprise that the normal approximation does not work well in many situations.
In particular, in many practical applications of NGS, the coverage of the sequenced reads is moderate to high depending on the biological applications. Thus, the normal approximation may not work well in these situations. The theorems also explain the poor performance of normal approximation in our simulations in Section 3. We emphasize that the bound may not be the best possible in all settings.
A similar result is available for multivariate word counts. The generalization to a Markovian sequence is straightforward, using the arguments from Huang (2002) for the joint counts starting at a specified position, and a local dependence argument as above.
Finally, note that if Mλi is large for some i, then
might be better approximated by the sum of products of normally distributed variables, which is approximately normal only when the number of summands is large.
2.3. Compound Poisson approximation for the number of occurrences of rare patterns in randomly sampled NGS reads
For rare patterns along the background sequence, the normal approximation as described in Subsection 2.2 is not appropriate; instead, we present a compound Poisson approximation for the number of occurrences of such patterns. This compound Poisson approximation takes clumps of occurrences directly into account. Recall that a clump of word w is a maximum consecutive region of the background sequence with overlapping occurrences of w. For the clumps, we first introduce some notation. Let
be the p-th prefix composed of the first p letters of w. The set of periods of the word w of length w (Guibas and Odlyzko, 1981, Lothaire, 1983) is defined by
![]() |
The set of principal periods of word pattern w,
, are those periods that cannot be written as multiples of other periods. It was shown in Reinert and Schbath (1998) and Schbath (1995) that the number of clumps, Nc,n, in a Markovian sequence of length n can be approximated by a Poisson random variable with mean
, where
![]() |
(2) |
We also consider Xi, the number of occurrences of word pattern w in the i-th clump. Let
![]() |
be the set of all possible ways a clump of size k can occur. It was shown in Reinert and Schbath (1998) and Schbath (1995) that
![]() |
(3) |
where
![]() |
(4) |
A compound Poisson approximation for
can be motivated as follows. Let Zi be the number of reads covering the first occurrence of w in the i-th clump. We can reasonably assume that the read will cover the whole clump as the clump size is generally not long. Then we may approximate
![]() |
(5) |
We note that the above equation may slightly over-estimate the number of occurrences of w in the M reads since we only require that the reads cover the first w, not the whole clump. However, the approximation is reasonable since the sequence reads are generally much longer than the length of clumps and the reads covering the first w are most likely covering the whole clump.
Next, we study the distribution of Zi. If the i-th clump starts at j, then the number of reads containing the first w in the clump is a binomial random variable
which is asymptotically Poisson with mean
. Since the occurrences of clumps follow asymptotically a Poisson process, the starting point of the ith clump is approximately uniformly distributed in [1, n − w + 1]. Thus, the independent random variables
with distribution
![]() |
(6) |
are a reasonable approximation to the random variables Zi.
The next theorem makes the heuristic argument precise. Recall that the total variation distance between two
-valued random variables X and Y (defined on the same probability space) is defined by
![]() |
Thus, if the total variation distance between X and Y is small, then for any subset, A, of the nonnegative integers, the difference between the probability for X to be in A and that for Y is also small. A bound in total variation distance can be applied to get conservative p-values for counts via the formula
![]() |
To state the results we need some more notation. Let α = α2 be the second-largest eigenvalue of the transition matrix T; the Perron-Frobenius Theorem ensures that |α| < 1. Let D be the matrix with the eigenvalues of T on the diagonal, ordered such that the first entry is α1 = 1, and zero entries everywhere else. Then we decompose T = PDP−1 such that the first column of P is
. For all
, let Jt denote the L × L matrix such that all its entries are equal to 0 except Jt(t, t) = 1, and we define
![]() |
Let π(x) be the probability of letter x and
![]() |
Let πmin be the smallest value of
. Put
![]() |
(7) |
Theorem 2.2
Let
have Poisson distribution with mean
, let
have distribution (6), let
have distribution (3) and assume that all these variables are independent. Then
![]() |
Here B(T, w, n) is given in (7).
Let
. Then the probability
can be calculated using the recursion (Panjer, 1981, Willmot and Panjer, 1987)
![]() |
(8) |
with initial value
and
.
While B(T, w, n) has a complicated expression, when the Markov chain is reasonably well mixed then its leading term will be of the order
. The bound in Theorem 2.2 will be small when, firstly, the compound Poisson approximation for intervals of length β is good; secondly, the distribution of starting points of reads is relatively homogeneous; thirdly, the number M of reads is not too large compared to n; and fourthly,
is small.
Theorem 6.6.4 in Reinert et al. (2005) gives the analogous approximation for counts of different words
, where r is an integer. The bounds are of similar flavor but involve more notation which considers the possible overlaps between different words. We omit the result here.
2.4. Extending the approximations to the double-strand model
In the above subsections, we assume that the sequence under study is single-stranded for simplicity. However, DNA sequences are double-stranded and the sequence reads from NGS can come from either strand and it is not known which strand the reads come from. To take both strands into consideration, we consider both the reads and their complements. Among the M pairs of reads, the number of occurrences of w,
, is equal to the number of occurrences of the complement of w,
, because we consider the complement of each read. Next we study the distribution of
for any word pattern w by considering the following scenarios.
We first assume a palindrome such that
, for example, w = ACGT or CGCG. For such word patterns, it is obvious that
. Next we assume
. For each pair of complementary reads, we consider the one from the forward strand. Thus, we have a new set of M reads all from the forward strand. Note that the word pattern w occurs in one of the strands of a pair of complementary reads if the forward read contains either w or
. Thus, the total number of occurrences of w in the M pairs of complementary reads equals the number of occurrences of either w or
along the forward reads. Thus, we are interested in the joint word counts of w and its complement
, but in contrast to Reinert and Schbath (1998) we allow for mixed clumps of occurrences, that is, the clumps can be composed of combinations of w and its complement. A compound Poisson approximation, with bounds, for the joint count of w and its complement can be found in Roquain and Schbath (2007). The approximation is valid only for non-palindromes; it also requires that the word and its complement have non-zero probability of appearing in the sequence. Here we illustrate how such a compound Poisson approximation for word counts in single reads can be combined for NGS data.
Let
![]() |
where
can be either w or
, and
![]() |
and
is a subset of
by removing those that are multiple of other numbers. By the definition, we have, for any word pattern w,
![]() |
Let Ck(w) be the subset of Ck such that the first word
, and
be the subset of Ck such that the first word
Define
and
.
Given the above notation, we consider the distribution of clumps starting with w and
, respectively. Here a clump is defined as a maximum region with overlapping occurrences of either w or
. Note that a clump starting with w occurs at a position i if 1) w occurs at position i, 2) sequences in
do not occur just before i. Thus, a w-clump starts at a particular position with probability
![]() |
(9) |
Similarly, the probability that a
-clump starts at a particular position
![]() |
(10) |
We refer to the clumps starting with w the w-clumps and those starting with
the
-clumps. Both the w-clumps and
-clumps can be approximated by a Poisson process with parameters
and
, respectively. The joint of the two approximate Poisson processes can again be approximated by a Poisson process with parameters
. Thus, the number of clumps including both the w-clumps and the
-clumps,
, can be approximated by a Poisson random variable with mean
.
We order the w- and
-clumps from the 5′-end to the 3′-end. Let Ii = 1 and Ii = 0 be the events that the i-th clump is a w-clump and
-clump, respectively. We have
![]() |
(11) |
Next we study the distribution of the number of occurrences of w or
in w-clumps and
-clumps, separately. A k-clump starting with w is referred as a k-w-clump. Similarly, a k-clump starting with
is referred as a k-
-clump. Then a k-w-clump occurs at a specific position i if (1) Ck(w) occurs at position i, (2) sequences in
do not occur just before i, and (3) Ck+1(w) does not occur at position i. Note that when we deduct the probability of the second and the third events, we deduct the probability of the event SCk+1(w) twice. Thus, we need to add the probability of this event, giving
![]() |
(12) |
Similarly, we have for
![]() |
(13) |
Let Xi and
be the numbers of occurrences of w or
in a w-clump and
-clump, respectively. Then the distributions of Xi and
are
![]() |
(14) |
Similar as above, let
be the number of reads covering the first occurrence of w or
in the i-th clump.
The distribution of
can be calculated by
![]() |
The number of occurrences of w along the M pairs of reads can be approximated by
![]() |
(15) |
The argument is made precise in the next proposition.
Theorem 2.3
Let
have a Poisson distribution with mean
let
have distribution (6), let
have distribution (14), let
have distribution (11) and assume that all these variables are independent. Then
![]() |
Here
![]() |
where C > 0 and C′ > 0 are two explicit constants that only depend on the transition matrix T, and α is the second largest eigenvalue in modulus of T.
2.5. The approximate power of detecting enriched patterns using the compound Poisson approximation for the distribution of the number of occurrences of word patterns
The framework for the normal and compound Poisson approximations for the number of occurrences of word patterns can equally be applicable to sequences generated by hidden Markov models. In particular, a regulatory sequence with many instances of transcription factor binding sites can be modeled by a hidden Markov model as in Zhai et al. (2010). Specifically, the long background sequence is modeled as an i.i.d. sequence. In addition, instances of a motif with a given position weight matrix are randomly inserted into the background sequence with probability 1 − ρ at each position. We refer to 1 − ρ as motif density. Next generation sequencing is then used to sample M reads from the long sequence as modeled in Subsection 2.1. Based on the reads, we want to test the null hypothesis H0 : ρ = 1, i.e, no motif instances are inserted, versus the alternative hypothesis H1 : ρ < 1, i.e, motif instances are inserted in the underlying background sequence. Consider the dominant pattern w(d) in the motif consisting of the nucleotide with the highest probability in each position. We can use
as a statistic to test the hypotheses. For a given type I error α, we can obtain a threshold tα, that is, the smallest value of t such that
![]() |
(16) |
based on the theory for
developed above, where
is the approximate probability distribution of
when no motifs are inserted, i.e. ρ = 1.
The approximate power of the test statistic when ρ < 1 is given by
![]() |
(17) |
where
is the probability distribution under the alternative model. The approximate power for detecting the enrichment of certain patterns under the double-strand model can be calculated similarly. In the following, for convenience we use the term “power” to mean approximate power.
3. Results: Simulation Studies and Biological Applications
3.1. Evaluation of the accuracy of the normal and compound Poisson approximations using simulations
We study nucleotide sequences consisting of four states (A, C, G, T) and consider three relatively short patterns (“TAT,” “ACGT,” and “CGCG”) and two relatively long patterns (“ACGTATC” and “AAGAAGAA”). The pattern “ACGT” does not have any periods and the patterns “TAT” and “CGCG” have a period 2. The pattern “ACGTATC” does not have any periods and the pattern “AAGAAGAA” has three periods {3, 6, 7} with principal periods {3, 7}. For each pattern, we compare the histogram of the simulated number of occurrences of a pattern with the theoretical compound Poisson approximation probability mass function given in Section 2. For patterns having relatively high number of occurrences, e.g, mean at least 50 in the cases we consider, we also plot the density function of the normal approximation.
In all our simulations, we use the following parameters: the nucleotide frequencies of (A, C, G, T) are (a) GC-rich (0.15, 0.35, 0.35, 0.15), (b) uniform (0.25, 0.25, 0.25, 0.25), and (c) GC-poor (0.35, 0.15, 0.15, 0.35). These settings allow us to see the effect of nucleotide frequency on the accuracy of the theoretical approximations. The sequence length n is chosen as either 5000 for the two short patterns or 20,000 for the two long patterns. The number of sequence reads M is either 500 for the two short patterns or 4000 for the two long patterns. The read length β = 80. We assume that the sequence reads are either homogeneously or heterogeneously chosen from the long sequence. In heterogeneous sampling of the reads, we divide the sequence into 100 consecutive blocks. For each block, we sample a random number from the gamma distribution Γ(1, 20) and the sampling probability λi for each position in the block is proportional to the chosen number (Zhang et al., 2008). We consider both the single- and double-strand models in our simulation studies. The number of simulations for each case is 10,000. Note that the sequence length and the number of reads simulated here are much smaller than the corresponding values in real studies. These numbers are chosen to save computational time. The qualitative results should hold for much longer sequences and higher number of reads.
Due to page limitations, we present the figures for the simulation results as Figures S1–S9 in the Supplementary Material (see www.liebertonline.com/cmb for Supplementary Material). We make the following observations. First, when the average number of occurrences of the pattern of interest is relatively large, for example, greater than 500, both the normal and compound Poisson approximations work well. However, for most of the cases we considered in this study, the compound Poisson approximation outperforms the normal approximation. Second, when the sequence reads are heterogeneously sampled from the long background sequence, the range of the number of occurrences of the patterns will be larger than that under the homogeneous sampling scheme. Third, the distribution of the number of occurrences of patterns can have multiple peaks under some situations and the compound Poisson approximation can accurately capture such features.
3.2. Power studies using simulations
We next study the power of detecting enriched patterns when such patterns indeed occur more frequently than expected. For this we consider sequences with random instances of a motif on a background i.i.d. sequences as described in Subsection 2.5. Under this model, a motif instance is inserted at each position of the background sequence with probability 1 − ρ (motif density). Thus, the consensus pattern of the motif is enriched compared to the background. The background sequence models and the inserted motifs “TAT,” “ACGT,” “CGCG,” “ACGTATC,” and “AAGAAGAA” are all the same as in Subsection 3.1. For a type I error α = 0.025, we first use Equation 16 to find the threshold tα based on the approximate distribution for
under the null model ρ = 1. The theoretical power under the alternative model that ρ < 1 is calculated using Equation 17. We run 10,000 simulations based on the alternative model and record the number of occurrences of the corresponding pattern. The simulated power is approximated by the fraction of times that
, where w(d) is the consensus pattern of the motif and it is the inserted pattern in our simulations.
Tables 1 and 2 compare the theoretical and the simulated power of
for detecting the corresponding enriched patterns for different values of motif density 1 − ρ for the patterns: “TAT,” “ACGTATC,” and “AAGAAGAA” under the single- and double-strand GC-poor models, respectively. The power of detecting the patterns “ACGT” and “CGCG” using the single- and double-strand model is the same because the counts for the double-strand model is twice the count for the single-strand model. The power results for these two patterns under the GC-poor model are given as Table S1 and the complete results for the GC-rich and uniform background models are given as Tables S2–S7 in Supplementary Material. The following conclusions can be obtained from the tables. First, the threshold value calculated from Equation 16 is conservative in that the simulated type I error rate is smaller than the specified type I error α in most of the situations. Second, the theoretical power given in Equation 17 is very close to the simulated power when the theoretical power is relatively large (e.g, greater than 50%). Third, the power of detecting enriched patterns under heterogeneous read sampling is smaller than that under homogeneous read sampling.
Table 1.
Comparison of the Simulated and Theoretical Power ( ×100) for Patterns TAT, ACGTATC, and AAGAAGAA Under the Single-Strand GC-Poor Model
| |
|
|
Scaled motif density |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Threshold | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||
| Homogeneous read sampling | ||||||||||||
| TAT | Simulation | 1966 | 1.2 | 2.2 | 4.7 | 8.9 | 14.5 | 21.9 | 33.3 | 45.1 | 55.2 | 67.1 |
| Theory | 1966 | 2.4 | 4.8 | 8.4 | 13.7 | 20.7 | 29.3 | 39.1 | 49.4 | 59.5 | 68.9 | |
| ACGTATC | Simulation | 54 | 1.8 | 14.8 | 32.0 | 51.8 | 70.8 | 83.5 | 89.3 | 93.5 | 97.1 | 98.4 |
| Theory | 54 | 2.4 | 14.9 | 34.9 | 55.3 | 71.8 | 83.3 | 90.6 | 94.9 | 97.4 | 98.7 | |
| AAGAAGAA | Simulation | 49 | 2.2 | 15.6 | 38.8 | 56.3 | 74.2 | 87.0 | 92.3 | 95.3 | 97.0 | 99.2 |
| Theory | 49 | 2.3 | 16.6 | 38.2 | 58.9 | 74.9 | 85.6 | 92.1 | 95.9 | 97.9 | 99.0 | |
| Heterogeneous read sampling | ||||||||||||
| TAT | Simulation | 2026 | 1.3 | 2.1 | 5.1 | 7.9 | 11.2 | 16.5 | 24.2 | 32.3 | 40.3 | 49.6 |
| Theory | 2026 | 2.4 | 4.2 | 6.7 | 10.1 | 14.6 | 20.1 | 26.5 | 33.8 | 41.6 | 49.6 | |
| ACGTATC | Simulation | 70 | 2.4 | 9.7 | 19.6 | 33.6 | 46.0 | 62.6 | 72.7 | 81.3 | 87.9 | 93.3 |
| Theory | 70 | 2.4 | 9.7 | 21.4 | 35.4 | 49.5 | 62.3 | 72.9 | 81.2 | 87.4 | 91.7 | |
| AAGAAGAA | Simulation | 63 | 2.0 | 12.2 | 27.5 | 38.7 | 51.8 | 67.7 | 76.0 | 85.3 | 89.8 | 93.0 |
| Theory | 63 | 2.4 | 11.2 | 24.4 | 39.5 | 54.1 | 66.7 | 76.8 | 84.4 | 89.8 | 93.5 | |
The sequence length n = 5000, the number of reads M = 500, and the scaled motif density = 1000(1 − ρ) for the pattern TAT. For the two long patterns, the sequence length n = 20000, the number of reads M = 4000 and the scaled motif density = 20000(1 − ρ). The read length β = 80. Type I error α = 2.5%. The “Threshold” is obtained using Equation 16 based on the theoretical approximate distribution of the number of occurrences of a pattern under the null model. The number of simulations is 10,000.
Table 2.
Comparison of the Simulated and Theoretical Power ( ×100) for Patterns TAT, ACGTATC, and AAGAAGAA Under the Double-Strand GC-Poor Model
| |
|
|
Scaled motif density |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Threshold | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||
| Homogeneous read sampling | ||||||||||||
| TAT | Simulation | 3875 | 0.9 | 1.5 | 2.9 | 5.7 | 8.8 | 13.9 | 21.5 | 30.1 | 37.9 | 48.2 |
| Theory | 3875 | 2.49 | 4.3 | 6.9 | 10.6 | 15.4 | 21.4 | 28.3 | 36.0 | 44.2 | 52.4 | |
| ACGTATC | Simulation | 82 | 2.3 | 13.8 | 37.1 | 53.8 | 73.6 | 83.5 | 91.8 | 94.7 | 98.1 | 98.9 |
| Theory | 82 | 2.4 | 16.1 | 37.4 | 58.3 | 74.5 | 85.4 | 92.0 | 95.8 | 97.9 | 99.0 | |
| AAGAGAA | Simulation | 73 | 1.9 | 10.5 | 26.3 | 43.3 | 60.7 | 72.9 | 83.5 | 89.3 | 93.3 | 97.0 |
| Theory | 73 | 2.3 | 11.2 | 26.4 | 44.2 | 60.9 | 74.3 | 84.0 | 90.4 | 94.5 | 97.0 | |
| Heterogeneous read sampling | ||||||||||||
| TAT | Simulation | 3982 | 1.2 | 1.9 | 2.9 | 4.5 | 6.9 | 10.6 | 16.8 | 21.5 | 27.5 | 33.4 |
| Theory | 3982 | 2.49 | 3.8 | 5.7 | 8.1 | 11.2 | 14.9 | 19.4 | 24.4 | 30.0 | 36.0 | |
| ACGTATC | Simulation | 101 | 2.2 | 7.7 | 12.9 | 27.7 | 37.5 | 51.3 | 60.0 | 68.9 | 79.2 | 83.7 |
| Theory | 101 | 2.46 | 7.4 | 15.5 | 26.0 | 37.8 | 49.7 | 60.7 | 70.4 | 78.3 | 84.5 | |
| AAGAAGAA | Simulation | 91 | 2.48 | 7.1 | 15.6 | 26.1 | 38.3 | 53.5 | 66.2 | 71.9 | 79.1 | 86.2 |
| Theory | 91 | 2.4 | 7.9 | 16.9 | 28.2 | 40.6 | 52.8 | 63.8 | 73.2 | 80.7 | 86.4 | |
The parameters are the same as in Table 1.
3.3. Applications to a ChIP-Seq data set in Valouev et al. (2008)
Now we apply the theory to a ChIP-Seq data set using transcription factor GABP in Valouev et al. (2008). We consider the promoter region of Nuclear Matrix Transcription Factor 4 gene (ZNF384) between position 6,667,900 and position 6,669,500 (a total of 1600 bp) on human chromosome 12, NCBI build 36. The gene ZNF384 has been shown to be the regulatory target of GABP and the region is enriched with ChIP-Seq reads as shown in the supplementary materials of Valouev et al. (2008). Our objective is to show the applicability of the theory developed in this article, not as a comparison with other computational methods of peak calling for ChIP-Seq data. The position weight matrix (PWM) of the GABP binding side is given in Table S8 (JASPAR, http://jaspar.cgb.ki.se/; ID number MA0062.2) in Supplementary Material. The consensus sequence formed by the dominant nucleotide at each position is “CCGGAAGTGGC”.
In a typical ChIP-Seq experiment, DNA regions of interest are sheared into short fragments and the specific DNA fragments interacting with the protein of interest are isolated by immuno-precipitation. Then NGS is used to sequence either end of the sequence. These end sequences are referred as tag sequences. In Valouev et al. (2008), the tag sequences are of length 25 bp. Since the tag sequences from ChIP-Seq can come from either the forward or the reverse strand of the selected fragments and the tag sequences may not contain the GABP binding sites, we extend both the forward and reverse strands to the whole sequence fragments as follows. It was estimated in Valouev et al. (2008) that the median read length in the GABP data set is 56 bp with mean around 57 bp. So we extend the forward strand by 31 bp in the forward direction. We also extend the reverse strand in the reverse direction by 31 bp so that each tag is associated with a read of 56 bp.
We analyze three different read data sets mapped to the forward strand only, the reverse strand only, and both the forward and the reverse strands combined. For each k-tuple (k = 6), we first approximate the p-value corresponding to the k-tuple using the control (Rx-noIP) data and the compound Poisson approximation. In such calculations, we use the nucleotide frequencies calculated from the extended reads. The distribution of the reads along the genomic region is estimated empirically by the fraction of reads starting from each position as follows. Since the number of reads starting from individual positions is generally small and the estimated distribution of reads λi using the number of reads starting at the i-th position is not reliable, we estimate λi within a window using the following approach. For a given window size S, we estimate λi by the average number of reads starting at the positions within the window of size S centered at i, in other words,
![]() |
where Mi is the number of reads starting at position i and M is the total number of reads. Using these estimated parameters, we can approximate the p-value corresponding to each k-tuple.
We use our approach to analyze both the control and the ChIP-Seq data. We expect that no k-tuples are significant for the control data while the dominant patterns in the motif should be enriched in the ChIP-Seq data set. Different window sizes S = 1 to 50 by step 5 are used and the results are similar. Table 3 presents the top 10 k-tuples using k = 6 with the smallest p-values when S = 20 using the reads mapped to the forward strand, reverse strand, and both strands for the control data, and Table 4 presents the results based on the ChIP-Seq data.
Table 3.
Top 10 k-Tuples (k = 6) with Smallest p-Values Using Reads Mapped to the Forward, Reverse, and Both Forward and Reverse Strands for the Control Data Set with Transcription Factor GABP
|
Forward |
Reverse |
Combined |
||||
|---|---|---|---|---|---|---|
| 6-tuple | p-value | 6-tuple | p-value | 6-tuple | Complement | p-value |
| CACTTC | 5.82E-06 | GAAGTG | 9.53E-05 | CACTTC | GAAGTG | 6.01E-07 |
| ACTTCC | 0.000349 | GTGAGT | 0.000584 | CTATAG | CTATAG | 5.76E-05 |
| CTTCTG | 0.000956 | AGTCCT | 0.000678 | ACTTCC | GGAAGT | 6.98E-05 |
| TCCTTG | 0.000956 | GGAGGG | 0.000731 | CCCTCC | GGGAGG | 1.11E-04 |
| CCACTT | 0.001367 | CCGGAA | 0.000961 | CCGGAA | TTCCGG | 1.56E-04 |
| CCTTCC | 0.001672 | GTCCTC | 0.001143 | CAGAAG | CTTCTG | 2.24E-04 |
| CCTTGC | 0.001743 | AAGTGG | 0.001221 | ACTTCT | AGAAGT | 3.26E-04 |
| TTCCGG | 0.001816 | GAAGAA | 0.001353 | GCTATA | TATAGC | 6.02E-04 |
| CTCCTT | 0.001828 | CTATAG | 0.001355 | CGGAAG | CTTCCG | 8.47E-04 |
| CCTTGT | 0.001899 | GCTATA | 0.001355 | AAGTGG | CCACTT | 1.02E-03 |
Table 4.
Top 10 k-Tuples (k = 6) with Smallest p-Values Using Reads Mapped to the Forward, Reverse, and Both Forward and Reverse Strands for the ChIP-Seq Data Set with Transcription Factor GABP
|
Forward |
Reverse |
Combined |
||||
|---|---|---|---|---|---|---|
| 6-tuple | p-value | 6-tuple | p-value | 6-tuple | Complement | p-value |
| CACTTC | 5.55E-11 | CGGAAG | 2.06E-08 | ACTTCC | GGAAGT | 1.78E-15 |
| ACTTCC | 1.75E-10 | CCGGAA | 2.1E-08 | CAGAAG | CTTCTG | 1.78E-15 |
| TTCCGG | 4.97E-10 | CTTCCG | 2.75E-07 | CGGAAG | CTTCCG | 2.33E-15 |
| CTTCCG | 6.47E-10 | TAGCGG | 3.26E-07 | CACTTC | GAAGTG | 3.55E-15 |
| CAGAAG | 8.82E-07 | GAAGCT | 5.42E-07 | GCAGAA | TTCTGC | 7.44E-15 |
| GCAGAA | 1.16E-06 | CTAGCG | 1.2E-06 | CCGGAA | TTCCGG | 1.42E-14 |
| AAATAG | 1.25E-05 | CACTTC | 1.31E-06 | CTATAG | CTATAG | 2.80E-13 |
| GAAATA | 1.3E-05 | ACTTCC | 1.48E-06 | AAGTGA | TCACTT | 4.21E-12 |
| TCACTT | 1.32E-05 | GAAGTG | 1.53E-06 | GCGGAA | TTCCGC | 6.54E-12 |
| ACACTT | 2.45E-05 | GGAAGT | 1.77E-06 | CCGCTA | TAGCGG | 1.46E-10 |
For family-wise type I error 0.05, using the Bonferroni correction, only 6-tuples with p-value less than 0.05/46 ≈ 1.25 × 10−5 are declared as significant. With this criterion, one 6-tuple “CACTTC” was identified as significant using the reads mapped to the forward strand based on the control data. The tuple “CACTTC” is complementary to the dominant pattern at positions [4,9]. The next pattern with relatively small p-value, although not significant using our criterion, is “ACTTCC” which is complementary to the dominant pattern at positions [5,10]. From the two patterns, it is possible to construct a consensus sequence of seven nucleotides “CACTTCC” which is complementary to the dominant pattern at positions [4,10]. We see some GABP motif signals in the control data set.
For the ChIP-Seq data, six 6-tuples are significant based on reads mapped to the forward strand. The top four 6-tuples with p-value at most 6.47 × 10−10—“CACTTC,” “ACTTCC,” “TTCCGG,” and “CTTCCG”—are complementary to the dominant patterns at positions [4,9], [3,8], [1,6], and [2,7], respectively. From these four patterns, we are able to construct a consensus sequence of 9 nucleotides “CACTTCCGG.” Similar observations can be made based on reads mapped to the reverse strand and both strands. We also carried out similar studies using k = 4 and 5, and the results are similar (data not shown).
4. Discussion
In this article, we study the distribution of the number of occurrences of patterns in sequence reads randomly sampled from long Markovian sequences. This problem comes naturally from the analysis of sequence reads generated from NGS including ChIP-Seq and RNA-Seq. In this article, we first develop probabilistic models for the background sequences and the sampling of sequence reads using NGS. The background sequence is modeled as a Markovian sequence. Each sequence read starts from the i-th position of the background sequence with probability λi and the sampling of sequence reads from the background sequence is assumed to be independent. Based on the model, we study the limit distribution of the number of occurrences of any k-tuple patterns. Two approximate distributions are considered. We assume throughout the paper that both the background sequence length and the number of sequence reads are large and that the sequence reads do not concentrate on particular regions of the background sequence. We first give a normal approximation for the number of occurrences of frequent patterns and provide formulas to calculate the mean and variance of the approximate normal distribution. For relatively rare patterns, we provide a new compound Poisson approximation for the number of occurrences. Simulation studies are first used to evaluate the theoretical results, and it is shown that the compound Poisson approximation seems to work well in most of the situations. The compound Poisson approximation is then used to analyze ChIP-Seq data mapped to the promoter region of the gene ZNF384 using transcription factor GABP. Surprisingly, we found GABP binding motif signals in the control data set indicating some ChIP residue effect even within the control data. With the ChIP-Seq data, we are able to recover the consensus patterns of the motif.
Despite the usefulness of the models and the approximations, there are some limitations. First, we assume that the background sequence follows a homogeneous Markov chain. In reality, the sequence to be sequenced may be heterogeneous with different regions following varied Markov models. If we have some idea about the composition of the nucleotides at different parts of the sequence, hidden Markov models can potentially be used to model such sequences. In our study, an empirical method to estimate the distribution of sequence reads along the genome sequence is used. Due to the relative low number of reads starting at particular positions, the empirically estimated read distribution may not be accurate, resulting in less reliable estimated p-values for each pattern. Several investigators studied the distribution of sequence reads from NGS technologies based on the sequence content surrounding a specific location (Hansen, et al. 2010, Li et al., 2010) and showed that the local sequence content can predict the read distribution well. These results can potentially be used to model the read distribution along the genome sequence. The effect of such dependency on the distribution of the number of occurrences of patterns needs further study. We also assumed that the fragments from NGS are of the same length. In reality, their length can vary and follow some distribution. This is another topic for future studies.
Supplementary Material
Acknowledgments
Z.Y.Z. was supported by NSFC 11071146 and Graduate Independent Innovation Foundation of Shandong University (GIIFSDU). G.R. was supported in part by the Institute for Mathematical Sciences of the National University of Singapore and US R21AG032743. M.S.W. was supported by NIH 1R21HG006199. Y.H.L. was supported by China NSFC 11071146. F.S. was supported by NIH P50 HG 002790 and 1R21HG006199; NSF DMS-1043075 and OCE 1136818; and NSFC 60805010.
Disclosure Statement
No competing financial interests exist.
References
- Campbell A. Mrázek J. Karlin S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA. 1999;96:9184–9189. doi: 10.1073/pnas.96.16.9184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L. Shao Q.-M. Normal approximation under local dependence. Ann. Probabil. 2004;32:1985–2028. [Google Scholar]
- Dalevi D. Dubhashi D. Hermansson M. Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics. 2006;22:517–522. doi: 10.1093/bioinformatics/btk029. [DOI] [PubMed] [Google Scholar]
- Dufraigne C. Fertil B. Lespinats S., et al. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005;33:e6. doi: 10.1093/nar/gni004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guibas L. Odlyzko A. Periods in strings. J. Combin. Theory Ser. A. 1981;30:19–42. [Google Scholar]
- Hansen K. Brenner S. Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. doi: 10.1093/nar/gkq224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang H. Error bounds on multivariate normal approximations for word count statistics. Adv. Appl. Probabil. 2002;34:559–586. [Google Scholar]
- Jun S. Sims G. Wu G., et al. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. USA. 2010;107:133–138. doi: 10.1073/pnas.0913033107. S. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S. Mrázek J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA. 1997;94:10227–10232. doi: 10.1073/pnas.94.19.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleffe J. Langbecker U. Exact computation of pattern probabilities in random sequences generated by Markov chains. Comput. Appl. Biosci. 1990;6:347–353. doi: 10.1093/bioinformatics/6.4.347. [DOI] [PubMed] [Google Scholar]
- Lander E. Waterman M. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. doi: 10.1016/0888-7543(88)90007-9. [DOI] [PubMed] [Google Scholar]
- Li J. Jiang H. Wong W. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50. doi: 10.1186/gb-2010-11-5-r50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lothaire M. Algebraic Combinatorics on Words (Encyclopedia of Mathematics and Its Applications) Cambridge University Press; New York: 1983. [Google Scholar]
- Maclean D. Jones J. Studholme D. Application of “next-generation” sequencing technologies to microbial genetics. Nat. Rev. Microbiol. 2009;7:287–296. doi: 10.1038/nrmicro2122. [DOI] [PubMed] [Google Scholar]
- Mardis E. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008a;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]
- Mardis E. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008b;24:133–141. doi: 10.1016/j.tig.2007.12.007. [DOI] [PubMed] [Google Scholar]
- McHardy A. Martín H. Tsirigos A., et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 2006;4:63–72. doi: 10.1038/nmeth976. [DOI] [PubMed] [Google Scholar]
- Nekrutenko A. Li W. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 2000;10:1986–1995. doi: 10.1101/gr.10.12.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nuel G. Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics. Algorithms Mol. Biol. 2006;1:1–14. doi: 10.1186/1748-7188-1-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Panjer H. Recursive evaluation of a family of compound distributions. Astin Bull. 1981;12:22–26. [Google Scholar]
- Pape U.J. Rahmann S. Sun F.Z., et al. Compound poisson approximation of the number of occurrences of a Position Frequency Matrix (PFM) on both strands. J. Comput. Biol. 2008;15:547–564. doi: 10.1089/cmb.2007.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavesi G. Mereghetti P. Mauri G., et al. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004;32:W199–W203. doi: 10.1093/nar/gkh465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reinert G. Schbath S. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 1998;5:223–253. doi: 10.1089/cmb.1998.5.223. [DOI] [PubMed] [Google Scholar]
- Reinert G. Schbath S. Waterman M. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]
- Reinert G. Schbath S. Waterman M. Cambridge University Press; New York: 2005. Statistics on words with applications to biological sequences, 268–346.Applied Combinatorics on Words. (Encyclopedia of Mathemics and Its Applications. [Google Scholar]
- Robin S. Rodolphe F. Schbath S. DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press; New York: 2005. [Google Scholar]
- Roquain E. Schbath S. Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv. Appl. Probabil. 2007;39:128–140. [Google Scholar]
- Schbath S. Compound Poisson approximation of word counts in DNA sequences. ESAIM Probabil. Stat. 1995;1:1–16. [Google Scholar]
- Schbath S. An overview on the distribution of word counts in Markov chains. J. Comput. Biol. 2000;7:193–201. doi: 10.1089/10665270050081469. [DOI] [PubMed] [Google Scholar]
- Schbath S. Robin S. Scan Statistics: Methods and Applications. (Statistics for Industry and Technology) Birkhäuser; Boston: 2009. How can pattern statistics be useful for DNA motif discovery?, 319–350. [Google Scholar]
- Shan G. Zheng W.M. Counting of oligomers in sequences generated by Markov chains for DNA motif discovery. J. Bioinform. Comput. Biol. 2009;7:39–54. doi: 10.1142/s0219720009003935. [DOI] [PubMed] [Google Scholar]
- Sims G. Jun S. Wu G., et al. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA. 2009;106:2677–2682. doi: 10.1073/pnas.0813249106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song K. Ren J. Zhai Z.Y., et al. Alignment-free sequence comparison based on next generation sequencing reads. Proc. RECOMB 2012. 2012:272–285. doi: 10.1089/cmb.2012.0228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uberbacher E. Mural R. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA. 1991;88:11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valouev A. Johnson D. Sundquist A., et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods. 2008;5:829–834. doi: 10.1038/nmeth.1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waterman M. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall; London: 1995. [Google Scholar]
- Willmot G. Panjer H. Difference equation approaches in evaluation of compound distributions. Insurance Math. Econ. 1987;6:43–56. [Google Scholar]
- Wu G. Jun S. Sims G., et al. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proc. Natl. Acad. Sci. USA. 2009;106:12826–12831. doi: 10.1073/pnas.0905115106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhai Z.Y. Ku S. Luan Y.H., et al. The power of detecting enriched patterns: an HMM approach. J. Comput. Biol. 2010;17:581–592. doi: 10.1089/cmb.2009.0218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z.D. Rozowsky J. Snyder , et al. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol. 2008;4:e1000158. doi: 10.1371/journal.pcbi.1000158. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.









































