Abstract
Background:
Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.
Methods:
In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.
Results:
We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.
Conclusions:
We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.
Keywords: Markov chains, next generation sequencing, transition probabilities, confidence intervals
INTRODUCTION
Markov chains (MC) have been widely used to model molecular sequences [1,2]. They have been used to study the dependencies between the bases [3], the enrichment and depletion of certain word patterns [4], prediction of occurrences of long word patterns from short patterns [5,6], and the detection of signals in introns [7]. Narlikar et al. [8] studied the effect of the order of MCs on several biological problems including phylogenetic analysis, assignment of sequence fragments to different genomes in meta-genomic studies, motif discovery, and functional classification of promoters. Usually the transition probabilities of the Markov chain are estimated from data, and hence the uncertainty of the estimated transition probabilities has to be taken into account. This is done for example in [2] for word counts in long sequences.
Often DNA samples do not consist of a few long sequences, but of many short sequences, for example based on next generation sequencing (NGS) technology. For short sequences the standard asymptotic results do not apply, see for example [9]. In [9] it was shown that the word count standardization includes an additional scaling factor, the effective coverage d. While [9] considered word count statistics under a Markov chain with estimated transition probabilities, in this study we are interested in finding confidence regions for the transition probabilities. We derive a normal and a chi-square approximation for several statistics related to MC based on many short sequences (also called reads) from a long underlying sequence that is assumed to be a realization of a Markov chain; and again the effective coverage d appears.
Consider an r-th order Markov chain A = A1A2…An of length n. The maximum likelihood estimate of transition probability from a word w of length r to a letter b is given by
where fw is the number of occurrences of word w within the sequence. Many investigators have studied the limit distribution of and other related statistics when the sequence length n tends to infinity [10,11]. In the classical papers [10,11], Billingsley studied many important problems related to MC. In particular, he showed the following important results.
Suppose the long sequence follows a simple first order Markov chain, with finite state space of size s. We call the elements of letters, as we will be thinking of a Markov chain on the set of nucleotides or amino acids. Denote the transition probability from letter i to letter j by pij , and let
| (1) |
Theorem 3.1 in [11] shows that if the Markov chain is stationary and ergodic, then as the sequence length n → ∞ the distribution of the s2-dimensional random vector ξ = (ξi,j) converges to the normal distribution with mean 0 and covariance matrix (λi,j;k,l), where
| (2) |
where δu,v is the Kronecker delta that is 1 when u = v, and 0 otherwise. Let ζ = (ζ1, …, ζs) be the random vector with components
| (3) |
Theorem 3.3 in [11] also shows that the distribution of the random vector ζ converges to the normal distribution with covariance matrix , where
In [11] it is also shown that the statistic
| (4) |
is asymptotically chi-square in distribution, and the number of degrees of freedom is s(s − 1) (assuming the transition probabilities pij are all positive), where s is the size of the alphabet set .
These results were derived under the assumption of one sample consisting of a long realisation of the Markov chain. With the development of next generation sequencing technologies, instead of one long sequence, typically short fragments from the long genome sequence are sampled, while assembly of these genome fragments as in [12,13] can be challenging.
In this study, we investigate the asymptotic distributions of similar statistics described above as in [10,11], but under the new model proposed in [9,14] for the NGS short reads data (see Fig. 1 and Section of “Methods” for the details of the model). We demonstrate that, after scaling by the effective coverage d proposed in [9], these statistics converge to the same corresponding distributions as those statistics for long sequences. We further apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate these asymptotic distributions and our theoretical confidence intervals using simulated as well as real data sets. In addition, we propose a parametric bootstrap method, and compare the bootstrap confidence intervals with our theoretical confidence intervals. We find that, the asymptotic properties of these statistics and the theoretical confidence intervals given in this study are highly accurate for the NGS short reads data, providing a powerful tool for NGS data analysis.
Figure 1. The model of generating NGS short reads data.
Two random processes are involved in the generation of the NGS short reads data: i) the underlying genome sequence with length G in bp is generated by a stationary and ergodic Markov chain with transition probability matrix P = [pij ]; ii) short reads of length β are randomly sampled from the genome by the Poisson process with rate c(x) = c(x, G) at position x. The Poisson process is independent of the Markov chain.
RESULTS
Our theoretical results are mainly presented in Section of “Methods”: we first introduce the probabilistic model for NGS data based on a MC sequence and random sampling of short reads; we then derive the statistics of interest based on NGS reads and the effective coverage d; we then develop two propositions on the asymptotic distributions of these statistics, and finally we propose the theoretical confidence intervals as well as the parametric bootstrap confidence intervals for MC transition probabilities based on NGS short reads data. Although we cannot rigorously prove the major propositions developed in this study, we validate statistical properties of these NGS related statistics and theoretical confidence intervals of transition probabilities using simulated NGS data sets as well as a real viral genome sequence data in this section.
Validating the theoretical results using simulated NGS data sets
We validate Proposition 1, Proposition 2 and the theoretical confidence intervals for MC transition probabilities of Eq. (16) in Section of “Methods” using simulated NGS short reads data sets.
For each simulated NGS data set, we first simulate the underlying genome sequence with MC using the same set of parameters of transition probabilities P = [pij] as in [9], which are listed in Supplementary Table S1. The initial position of the simulated genome is sampled based on the stationary distribution of the MC. We set the genome length G = 100, 000 for all simulated data sets. We then randomly sample M reads of length β bps from the simulated underlying genome. The reads are sampled based on the Poisson process as described in Section of “Methods”. The Poisson process can be homogeneous or inhomogeneous.
We simulate three NGS data sets with reads sampled by homogeneous Poisson processes. The parameters of the three homogeneous data sets are (1) M = 500, β = 100 (denoted as H1 thereafter); (2) M = 1000, β = 200 (denoted as H2 thereafter); and (3) M = 1000, β = 300 (denoted as H3 thereafter).
For each of these homogeneous data sets, the rate c of Poisson processes is calculated by
Noting Remark 1, the sequencing coverage C = βc = βM/G. We follow Eq. (13) and calculate the corresponding effective coverage d as
For real NGS data, the reads are generally generated inhomogeneously from the genome. To test our results for inhomogeneous cases, we also simulate three NGS data sets with reads sampled by inhomogeneous Poisson processes. To achieve this, we divide the long underlying genome sequence into 100 consecutive non-overlapped bins b1,…, bt,…, b100 with the same size; the sampling rates along the positions within the same bin bt are equal; and the rates of the different bins are proportional to 100 random variables drawn from the gamma distribution Γ(1, 20) [15]. The simulated three inhomogeneous data sets of different read numbers and read lengths are (1) M = 500, β = 100 (denoted as IH1 thereafter); (2) M = 1000, β = 200 (denoted as IH2 thereafter); and (3) M = 1000, β = 300 (denoted as IH3 thereafter).
Validating the approximate normal distribution of in Proposition 1
To validate the approximate normal distribution of by Proposition 1, we first standardize the as
where the estimated values of d and pi,j are used in the standardization of . For each of the six data sets (H1, H2, H3, IH1, IH2, and IH3), we repeat the simulating processes by the same parameters 2,000 times to generate 2,000 samples; then calculate the value for each sample; and check whether 2,000 values of have the standard normal distribution by the Kolmogorov-Smirnov (KS) test.
For the homogeneous data sets, for all ij, ′s have approximately mean of 0 and variance of 1, with p-values being uniformly distributed in the region of (0, 1) (Fig. 2), indicating that all ′s follow the standard normal distribution. We obtain similar results for the inhomogeneous data sets (Supplementary Fig. S1). These results show strong validations of Proposition 1.
Figure 2.
The mean, variance and p-value of statistic for the simulated homogeneous data sets of H1, H2, and H3. The p-value is calculated by Kolmogorov-Smirnov test to whether approximates a standard normal distribution.
Validating the approximate distributions of and S(R) in Proposition 2
To validate the approximate distributions of and S(R) in Proposition 2, we first standardize them as follows:
and
We then check on the simulated data sets, whether the new statistic has the standard normal distribution for all i ∈ A, and the new statistic has the chi-square distribution with df = s(s − 1) = 4 × 3 = 12, respectively.
Our simulation results show that: 1) for all i ∈ A, approximates to the standard normal distribution quite well for both homogeneous data sets (Supplementary Fig. S2) and inhomogeneous data sets (Supplementary Fig. S3); 2) the approximates a chi-square distribution with df = 12 (Mean = 12 and Variance = 24) on both homogeneous data sets and inhomogeneous data sets as well: the calculated values of (Mean, Variance and p-value) are (12.02, 25.15, 0.402) for H1, (12.00, 24.75, 0.690) for H2, (12.14, 24.82, 0.556) for H3, (11.81, 22.92, 0.058) for IH1, (12.15, 23.65, 0.064) for IH2, and (11.88, 24.18, 0.453) for IH3.
Validating the theoretical confidence intervals of MC transition probabilities using NGS data
One of the major applications of Proposition 1 is to construct the theoretical confidence intervals of MC transition probabilities using NGS data. We validate the derived theoretical level 1 − α confidence interval Eq. (16) by simulation study.
For each of the simulated homogeneous (H1, H2 and H3) and inhomogeneous (IH1, IH2 and IH3) data sets, we
1. Simulate one sample, estimate ;
2. Calculate theoretical level 1 − α (α ∈ {0.01, 0.05, 0.1}) confidence interval for pij by Eq. (16);
3. Repeat Steps 1 and 2 for 1,000 times, and calculate the fraction of times that the theoretical level 1 − α confidence intervals cover the true pij.
The simulation results in Fig. 3 and Supplementary Fig. S4 show that, the fractions of times that the theoretical confidence intervals cover the true pij are very close to 1 − α (α = 0.01, 0.05, 0.1) for both homogeneous and inhomogeneous data sets.
Figure 3.
The fraction of times that the theoretical confident intervals of pij cover the true transition probability pij on homogeneous data sets H1, H2, and H3.
Calculating theoretical confidence intervals of pij based the estimated
In real study, the effective coverage d is generally not available for the NGS data sets. To deal with data sets with unknown d, we adopt the approach proposed in [9], and estimate the effective coverage d as follows:
| (5) |
where is given in Eq. (14).
We find that, by plug-in the estimated effective coverage to the theoretical confidence intervals for pij in Eq. (16), we can still achieve similar accuracies in calculating the theoretical confidence intervals (Supplementary Figs. S5, S6).
Validating the theoretical confidence interval for pij by parametric bootstrap
For each of the homogeneous NGS data sets (H1, H2, H3), we calculate the confidence intervals by parametric bootstrap method. We repeat 100 bootstrap calculation for each data set. We find that the fraction of times that the bootstrap confidence intervals covering the true transition probabilities approximate the level 1 − α well (Supplementary Figs. S7-S9). Meanwhile, the theoretical confidence intervals by Eq. (16) achieve similar results as the parametric bootstrap confidence intervals both in 1) the fraction of time cover the true transition probability (Supplementary Figs. S7-S9), and 2) lengths of confidence intervals (Supplementary Fig. S10).
Application to real viral genome sequence data
We apply the theoretical results developed in this study on a real viral genome sequence Bacillus phage SP-beta. The length G of the genomic sequence, which was downloaded from NCBI, is 134, 416 in bp. The frequencies of A, G, C and T are 45,580 (33.91%), 21,078 (15.68%), 25,477 (18.95%) and 42,281 (31.46%), respectively. We model the Bacillus genome sequence with a 2nd-order MC after estimating MC order by method developed in [9]. We estimate the transition probability based on the genome sequence as
which are shown in Supplementary Table S2. The stationary distribution π of the 2nd-order MC is calculated based on the estimated transition probabilities (Supplementary Table S2). Based on Theorem 3.1 in Billingsley [11], we derive the theoretical confidence for long genome sequence as follows
| (6) |
See Table 1 for an example of the confidence intervals for pAA,A by Eq. (6).
Table 1.
The estimated transition probability of pAA,A, the theoretical confidence intervals as well as the parametric bootstrap confidence intervals on simulated homogeneous NGS data sets of Bacillus phage SP-beta genome.
| C = βc | pAA,A | 99% |
95% |
90% |
||||
|---|---|---|---|---|---|---|---|---|
| Lower | Upper | Lower | Upper | Lower | Upper | |||
| by Eq. (6) | 0.3739 | 0.3644 | 0.3834 | 0.3667 | 0.3812 | 0.3678 | 0.3800 | |
| by Eq. (7) | 0.5 | 0.3714 | 0.3551 | 0.3884 | 0.3590 | 0.3844 | 0.3611 | 0.3823 |
| 1.0 | 0.3771 | 0.3640 | 0.3911 | 0.3672 | 0.3879 | 0.3689 | 0.3862 | |
| 2.0 | 0.3728 | 0.3610 | 0.3845 | 0.3638 | 0.3817 | 0.3653 | 0.3803 | |
| 5.0 | 0.3707 | 0.3375 | 0.4044 | 0.3454 | 0.3964 | 0.3496 | 0.3923 | |
| 10.0 | 0.3734 | 0.3633 | 0.3835 | 0.3657 | 0.3811 | 0.3670 | 0.3798 | |
| by Eq. (19) | 0.5 | 0.3781 | 0.3570 | 0.3903 | 0.3618 | 0.3869 | 0.3635 | 0.3852 |
| 1.0 | 0.3736 | 0.3591 | 0.3885 | 0.3632 | 0.3847 | 0.3649 | 0.3831 | |
| 2.0 | 0.3723 | 0.3619 | 0.3858 | 0.3641 | 0.3825 | 0.3660 | 0.3814 | |
| 5.0 | 0.3758 | 0.3629 | 0.3835 | 0.3654 | 0.3814 | 0.3670 | 0.3807 | |
| 10.0 | 0.3753 | 0.3631 | 0.3834 | 0.3661 | 0.3811 | 0.3677 | 0.3801 | |
We then simulate NGS data sets by generating short reads from the Bacillus phage SP-beta genome sequence based on both homogeneous and inhomogeneous Poisson processes. The read length is β = 200 in bp and we generate NGS data sets using coverage C = βc ∈ {0.5, 1, 2, 5, 10}. We then estimate the MC transition matrix using NGS reads as
and extend the theoretical confidence intervals of Eq. (16) for NGS to the 2nd-order MC transition probabilities pij,k as follows
| (7) |
We calculate the theoretical confidence intervals by Eq. (7) as well as the bootstrap confidence intervals by Eq. (19) for transition probability pAA,A on homogeneous NGS data sets (Table 1). Although we do not have true transition probabilities to validate these confidence intervals, we find that the theoretical confidence intervals by Eq. (7) as well as the parametric bootstrap by Eq. (19) are very close to the confidence intervals estimated from the long sequence by Eq. (6) (Table 1).
The theoretical confidence intervals by Eq. (7) for inhomogeneous NGS data sets are listed in Supplementary Table S3.
DISCUSSION
In this study, we propose statistics for estimations of MC transition probabilities and their corresponding confidence intervals based on next generation sequence reads. Although we did not rigorously prove the two propositions (Propositions 1 and 2) on the asymptotic distributions of , and S(R), our simulation studies show that the propositions and the derived formulas for the confidence intervals of the transition probabilities perform decently well on simulated NGS datasets with wide regions of sequencing coverage.
In addition, we propose a parametric bootstrap method for the confidence intervals of MC transition probabilities based on NGS reads data, and show that both theoretical confidence intervals derived from the two propositions and the bootstrap methods achieve similar results with high accuracies, further supporting the validity of Propositions 1 and 2. However, it is worth noting that the parametric bootstrap based approach is computationally expensive in sampling data sets many times; while the theoretical confidence intervals bypass the need of the time consuming sampling step, thereby making it highly applicable to large-scale NGS datasets.
The developed theory and methods implicitly assume error-free for the NGS short reads data. Since the sequencing error rates of bulk sequencing are lower than 1% for Illumina sequencing machines, the error-free assumption on NGS data will not be a big issue. The rapid advances in single-cell sequencing (SCS) [16,17] provide unprecedented insights of cell heterogeneity. The statistical theory and methods developed in this study will be promising for the quantitative characterizations of SCS data. However, due to limited molecular materials within a single cell as well as high dropout events, the SCS data are error-prone. For SCS data analysis, the sequencing errors are non-negligible and will blur the estimators, making their asymptotic distributions biased. We will study the asymptotic distribution theory of statistics based on the error-prone NGS data in our future work.
METHODS
Probabilistic modeling of a MC sequence and random sampling of the reads using NGS
In NGS, a large number of reads are randomly sampled from the genome. Hence two random processes are involved in the generation of the short reads data: i) the generation of the underlying genome sequence with genome length G base pairs (bps), and ii) random sampling of the M reads with length β from the simulated genome in i). See Fig. 1 for an illustration.
In this study, we assume the underlying genome sequence is generated by a Markov chain. We treat mainly the case of a first-order Markov chain, as the results can be easily adapted to higher order Markov chains. We thus use a first-order stationary and ergodic MC to model the underlying genome sequence with each letter taking values in a finite alphabet set of size s. Since our study is based on genomic sequences, = {A, C, G, T} and s = 4. We denote the transition matrix of MC as P = [pij]4×4.
We model the random sampling of the reads using a Poisson process as in the Lander-Waterman model [18]. The Poisson process is a process on [0, ∞) and is assumed to be independent of the Markov chain. As in [14,18-21], we assume that the genome is a sequence of contiguous bases and that the distribution of reads along the genome follows a potentially inhomogeneous Poisson process with rate c(x) = c(x, G) at position x. We suppress the argument G unless we take limits. If c(x) = c for all x, we refer to the sampling of the reads as homogeneous. Allowing for inhomogeneity in the sampling Poisson process reflects observations about NGS data in [22] that the sampling rates of reads at different positions are not homogeneous.
We assume that all sampled reads have the same length of β bps. This length β does not depend on the length of the underlying sequence, nor on the parameters of the Markov chain or the Poisson process. For NGS data, β is given by the technology which is applied to obtain the reads.
A total of M reads are independently sampled from the genome of length G bps. The M reads are our data; they are a realisation of the above Poisson process and the Markov chain sample A, as follows. Suppose that an event in occurs at t. Let s be the closest integer ⩽t. Then the Poisson event results in a read R = (As, As + 1, …, As + β) from the underlying Markov chain A = (A1,…, AG).
Statistics related to Markov chains based on NGS data
We extend the statistics ξij, ζi and S for long sequence in [11] to , and S(R) for the NGS short sequence/reads data accordingly as follows.
Let be the number of occurrences of dinucleotide (ij) within the M reads and . Define
| (8) |
Note that the moment estimation of pij from NGS data is , thus can also be written as
Similarly, we define
| (9) |
and
| (10) |
where
with M being the number of reads.
To see the centering in the above equations, we first calculate the expectation of and of . For this purpose we introduce some notations. For a read Rk = (rk, 1,…, rk,β ), we denote its starting position on genome G by S(Rk ), rounded to the lower integer. The left-hand end of the read is a Poisson process which creates the read starts at t such that s⩽t< s + 1 for a nonnegative integer s, then we set S(Rk) = s. Let fi be the number of occurrences of letter i in the genome, we have that
where 1 denotes the indicator function which is 1 when the argument is true, and 0 otherwise.
To calculate the counts of letter i in the M reads, the can be written in the form of S(Rk ) as follows
By the independence of the Poisson process and the Markov chain, we have
where is the averaged coverage, which is calculated as follows
This explains the centering in Eq. (9).
For the centering in Eq. (8) we first calculate
Note that
we then have
This conditional expectation calculation makes the centering in Eq. (8) reasonable. Note that we do not center by . The centering in Eq. (10) follows the centering in Eq. (8).
The effective coverage d of NGS data
In order to state our results, we need to introduce some more notations. For a position s ∈ {1, 2,…, G}, let L(s) denote the (random) number of reads which cover position s. For n = 1, 2,…,
denote the set of positions that is covered by exactly n reads. As in [9], we also call C(n) the region of coverage n. Note that a region does not need to be contiguous; we group all the parts of the genome with coverage n into the n-th region.
We define
| (11) |
which is the expected percentage of the genome that has coverage n. We then define the effective coverage d as follows
| (12) |
To calculate d, we note that
can be with an explicit expression, since (s − β,s] has a Poisson distribution with parameter c(x)dx. In the homogeneous case, (s − β,s] has Poisson distribution with parameter βc. Then dn = P(Poi(βc) = n), where Poi(βc) denotes a Poisson random variable with parameter βc. Thus in the homogeneous case, we have
and
Thus, in the homogeneous case, d has a simple expression as
| (13) |
Remark 1.
The results in [9] also include d, but the notation is slightly different. The rate c(x) in this paper translates into the rate c(x)/β in [9] so that βc in this paper is c in [9]. We argue that in the setup, it is more intuitive to let the rate of the Poisson process not be scaled by the length of the reads, and hence we change in notation.
Asymptotic distributions of statistics related to Markov chains based on NGS data
The previous study [9] showed that under the same NGS short reads generative model as in the Section of “Methods” which models the underlying long genome sequence as a MC and uses the LanderWaterman model for random sampling of the reads, the traditional standardization of word count as
| (14) |
where , for NGS short reads does not converge to the standard normal distribution as that for a long sequence. Thus [9], proposed the concept of effective coverage d for NGS short reads data given in Eq. (12), and proved that approximates to the standard normal distribution.
In this study, we show through simulations that the effective coverage d is also critical in the asymptotic distribution theory of MC transition probability estimators based on NGS short reads. We have the following two propositions for the asymptotic distributions of , and S(R), all of which are related to MC for NGS short reads data.
Proposition 1.
Assuming the underlying genomic sequence follows a stationary and ergodic MC with transition probability matrix P = [pij]4×4. We also assume that a set of reads by NGS are randomly sampled according to the Poisson process, possibly inhomogeneous, with rate function c(x). Then, as the genome length G tends to infinity, converge in distribution to a mean zero normal vector with covariance Σ with entries where d is the scaling factor given in Eq. (12) and λi,j;k,l is given in Eq. (2).
Proposition 2.
Under the same assumptions of Proposition 1, as G → ∞,
1. The vector converges in distribution to a mean zero normal vector with covariance matrix (αij)i,j∈A.
2. The statistic S(R)/d has an approximate χ2 -distribution with s(s − 1) degrees of freedom.
Remark 2.
In this study, we only show the validity of the two propositions by simulation studies. Currently, we cannot provide rigorous mathematical proofs for the two propositions, due to the complex intercorrelations of words within reads. We leave them as our conjectures for future mathematical justifications.
Theoretical confidence intervals for MC transition probabilities based on NGS reads data
We show in Proposition 1 that, has an asymptotic normal distribution
We apply this property to obtain the theoretical confidence interval for pij using NGS data. By dividing the numerator and denominator by , we have
where, , and hence,
For 0 < α < 1, let z1-a be the number such that the area under the standard normal density function to the left of z1-a is 1 − α. Thus, we have
| (15) |
Manipulation of the inequalities gives
wher . Therefore, we obtain the level 1 − α confidence interval for pij as
In practice, can be very large, making and γ ≈ 0. We thus discard terms of resulting in a simplified confidence interval for pij as
| (16) |
Parametric bootstrap for obtaining the confidence intervals for transition probabilities and the stationary distribution of MC
Let X be a stationary and ergodic Markov chain with transition matrix P = [pij]s×s. Assume that a set of reads are randomly sampled from X according to the Poisson process, denoted by {R1, R2, …, RM}. We can estimate the transition matrix P and stationary distribution π by and , where,
| (17) |
| (18) |
where ∣Rm∣ is the length of read Rm, m = 1, 2,…, M . Moreover,
When G → ∞,
We provide a parametric bootstrap method for finding the confidence intervals for pij and πi as follows:
B1 Simulate the NGS data set. Simulate a genome sequence of length G by the stationary and ergodic Markov chain P, and the sequence starts from the stationary distribution π generate NGS data set by sampling M reads {R1,R2, …, RM} from the simulated genome by the Poisson process. Note that, for a real data set, we skip Step B1 and go directly to B2.
B2 Given the set of reads of the NGS data set (either a simulated one from Step B1 or a real one), estimate parameters of transition matrix and stationary distribution by Eqs. (17) and (18).
B3 Generate bootstrap data set. Simulate a new Markov genome sequence of length G based on the estimated parameters and in B2; and generate a bootstrap data set with M reads {,, …, } by the Poisson process.
B4 Estimate parameters of transition matrix and stationary distribution from the simulated bootstrap data set in B3 by Eqs. (17) and (18).
B5 Repeat Steps B3–B4 B times to get B estimators and ; For each dinucleotide, order from the smallest to largest values as (1), (2), …, (B), and the level 1 − α confidence interval for pij is given by
| (19) |
For each i, order from the smallest to largest values as (1), (2), …, (B), and the level 1 − α confidence interval for πi is given by
| (20) |
Supplementary Material
Author summary:
Markov chains (MC) have been widely used to model molecular sequences. We present the asymptotic distributions of several statistics related to MC based on next generation sequencing (NGS) short reads data. We show that, after scaling by an effective coverage d proposed by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences. The asymptotic properties of these statistics can be applied to find the theoretical confidence regions for MC transition probabilities based on NGS short reads data with high accuracy.
ACKNOWLEDGEMENTS
We thank Dr. Gesine Reinert of Oxford University for discussions and help related to the problems studied in this paper. LW is supported by NSFC grants (Nos.11571349 and 91630314), the National Key R&D Program of China under Grant 2018YFB0704304, NCMIS of CAS, LSC of CAS, and the Youth Innovation Promotion Association of CAS. JR and FS were supported by US National Science Foundation (NSF) (DMS-1518001) and National Institutes of Health (NIH) (R01GM120624, 1R01GM131407).
Footnotes
SUPPLEMENTARY MATERIALS
The supplementary materials can be found online with this article at http://10.1007/s40484-020-0200-y.
COMPLIANCE WITH ETHICS GUIDELINES
The authors Lin Wan, Xin Kang, Jie Ren and Fengzhu Sun declare that they have no conflict of interests.
This article does not contain any studies with human or animal subjects performed by any of the authors.
REFERENCES
- 1.Almagor H (1983) A Markov analysis of DNA sequences. J. Theor. Biol, 104, 633–645 [DOI] [PubMed] [Google Scholar]
- 2.Reinert G, Schbath S and Waterman MS (2005) Statistics on words with applications to biological sequences. In: Applied Combinatorics on Words, Lothaire M (ed.), ch. 6, pp. 268–352 New York: Cambridge University Press [Google Scholar]
- 3.Blaisdell BE (1985) Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol, 21, 278–288 [DOI] [PubMed] [Google Scholar]
- 4.Pevzner PA, Borodovsky MY, and Mironov AA (1989) Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn, 6, 1013–1026 [DOI] [PubMed] [Google Scholar]
- 5.Hong J (1990) Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res., 18, 1625–1628 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Arnold J, Cuticchia AJ, Newsome DA, Jennings WW III and Ivarie R (1988) Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 7145–7158 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Avery PJ (1987) The analysis of intron data and their use in the detection of short signals. J. Mol. Evol, 26, 335–340 [DOI] [PubMed] [Google Scholar]
- 8.Narlikar L, Mehta N, Galande S and Arjunwadkar M (2013) One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41, 1416–1424 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ren J, Song K, Deng M, Reinert G, Cannon CH and Sun F (2016) Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, 32, 993–1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Billingsley P (1961) Statistical Inference for Markov Processes, vol. 2. Chicago: University of Chicago Press Chicago [Google Scholar]
- 11.Billingsley P (1961) Statistical methods in Markov chains. Ann. Math. Stat, 32, 12–40 [Google Scholar]
- 12.Pevzner PA, Tang H and Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zerbino DR and Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhai Z, Reinert G, Song K, Waterman MS, Luan Y and Sun F (2012) Normal and compound Poisson approximations for pattern occurrences in NGS reads. J. Comput. Biol, 19, 839–854 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Song K, Ren J, Zhai Z, Liu X, Deng M and Sun F (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol, 20, 64–79 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sun F, Arnheim N and Waterman MS (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res., 23, 3034–3040 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Daley Tand Smith AD (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 22, 3159–3165 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lander ES and Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231–239 [DOI] [PubMed] [Google Scholar]
- 19.Zhang ZD, Rozowsky J, Snyder M, Chang J and Gerstein M (2008) Modeling ChIP sequencing in silico with applications. PLOS Comput. Biol, 4, e1000158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Daley T and Smith AD (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Simpson JT (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Schwartz S, Oren R and Ast G (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One, 6, e16685. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



