Summary Statistics of Neutral Mutations in Longitudinal DNA Samples

Xiaoming Liu; Yun-Xin Fu

doi:10.1016/j.tpb.2008.04.006

. Author manuscript; available in PMC: 2011 Mar 18.

Published in final edited form as: Theor Popul Biol. 2008 May 5;74(1):56–67. doi: 10.1016/j.tpb.2008.04.006

Summary Statistics of Neutral Mutations in Longitudinal DNA Samples

Xiaoming Liu ¹, Yun-Xin Fu ^1,^*

PMCID: PMC3060710 NIHMSID: NIHMS60606 PMID: 18547598

Abstract

Longitudinal samples of DNA sequences are the DNA sequences sampled from the same population at different time points. For fast evolving organisms, e.g. RNA virus, these kind of samples have increasingly been used to study the evolutionary process in action. Longitudinal samples provide some interesting new summary statistics of genetic variation, such as the frequency of mutation of size i in one sample and size j in another, the average number of mutations accumulated since the common ancestor of two sequences each from a different sample, and number of private, shared and fixed mutations within samples. To make the results more applicable, we used in this study a general two-sample model, which assumes two longitudinal samples were taken from the same measurably evolving population. Inspired by the HIV study, we also studied a two-sample-two-stage model, which is a special case of two-sample model and assumes a treatment after the first sampling instantaneously changes the population size. We derived the formulas for calculating statistical properties, e.g. expectations, variances and covariances, of these new summary statistics under the two models. Potential applications of these results were discussed.

Keywords: longitudinal sample, coalescent theory, summary statistics, measurably evolving population

1 Introduction

One fundamental goal of population genetics is to understand the change of genetic variation of a population in the process of evolution. Longitudinal samples, or serial samples, are samples taken at a series of time points from the same population. For fast evolving organisms (e.g. RNA virus), longitudinal DNA/RNA samples are often used to study the change of the population. Other applications of longitudinal DNA/RNA samples include ancient DNA studies (Willerslev and Cooper, 2005) and artificial evolution experiments on RNA enzymes (Johns and Joyce, 2005). Several new statistical methods have been developed for analyzing longitudinal DNA samples (reviewed by Drummond et al., 2003). Most of the methods are based on likelihood frameworks (e.g. Rodrigo and Felsenstein, 1999; Rambaut, 2000; Drummond et al., 2001; Seo et al., 2002; Drummond et al., 2002; Rodrigo et al., 2003), that try to capture all information from the data and thus to be more powerful. However, there are some difficulties faced by those methods, such as high computational intensity and high model dependency. Especially, calculating likelihood of haplotypes with recombination is computationally so intensive that unrealistic assumptions such as no recombination or free recombination between markers are often made. Such assumptions may cause large bias on parameter estimations. On the other hand, using summary statistics may suffer from some information loss. But with carefully chosen summary statistics, it is possible to find an effective way to capture the essential information in a sample, while avoiding the bias introduced and some difficulties faced by likelihood-base methods.

In this paper we studied some new summary statistics of mutation number provided by longitudinal samples, namely the frequency of mutation of size i in one sample and size j in another, the average number of mutations accumulated since the common ancestor of two sequences each from a different sample, and number of private, shared and fixed mutations within samples (see definitions below). Assuming no selection and no recombination, we derived the formulas for calculating the expectations, variances and covariances of these summary statistics. The statistical properties of these summary statistics are independent to mutation models. Under the infinite site model, the number of mutations and the number of segregating sites is equal, so that the results of this paper can be applied directly to the sequence data. Under a finite site model, the number of mutations and the number of segregating sites is not equal. But as long as the mutation model is given, the number of mutations can be recovered adequately in most cases and these summary statistics of mutations can be widely applicable also. While the expectations of these summary statistics are not affected by recombination events, their variances will be smaller when there is recombination. So that, with applying some estimators or test statistics built on the variances of these summary statistics, cautions are needed when interpreting the results, if recombination can not be ignored.

In the remaining of this paper we first introduced the definitions of different summary statistics with our two-sample model and its special case, the two-sample-two-stage model, which is inspired by HIV treatment. Then we showed the formulas for calculating the expectations, variances and covariances of these summary statistics. Finally, we discussed their potential applications. Most technical details can be found in Appendix.

2 Summary statistics for longitudinal samples

Among the most well-known summary statistics for DNA polymorphism (segregating sites) are the number of mutations of various sizes (ξ_i), number of segregating sites (K) and average difference between two sequences (Π) (see definitions below). The properties of these statistics have been well studied for the case of single sample assuming infinite site model (Watterson, 1975; Tajima, 1983; Tavaré, 1984; Fu, 1995). Since in infinite site model each new mutation produce a new segregating site, segregating site and mutation are mutually exchangeable. In this paper, we borrowed the notations of these classic summary statistics for our new statistics studied, but keep in mind our new summary statistics are for mutations. They can apply to sequence analysis directly with assumption of the infinite site model. But with finite site model, a transformation to statistics of segregating is needed.

We begin with the definition of these summary statistics under the infinite site model. A mutation on the genealogy of a DNA sample divides the sample into two groups. One consists of sequences that inherit the mutation and the other does not. If there are i descendants of the mutation in the sample, the mutation is said to be size i. Let ξ_i be the total number (or frequency) of mutations of size i in that sample, the range of possible i is 1 to n − 1, where n is the sample size. The total number of segregating sites, K, is just the total number of mutations on the genealogy of the sample after their most recent common ancestor (MRCA). It can also be regarded as the sum of all ξ_i, i.e.

K = \sum_{i = 1}^{n - 1} ξ_{i} .

(1)

The average number of difference between two sequences, Π, is defined as

Π = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} d_{ij}

(2)

where d_ij is the number of differences between the i-th and j-th sequence of the sample of the total n sequences. Again Π can be written as a linear combination of ξ_i, i.e.

Π = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} i (n - i) ξ_{i} .

(3)

An example of a genealogy of five sequences was presented in Figure 1. The sample sequences and their MRCA sequence were shown at the bottom and top of the tree. There were four mutations (shown as gray dots) on the genealogy with mutant ancestor sequences shown on the left-hand side. The t_i (i ≥ 2) shown on the right-hand side of the tree was the i-th coalescent time. If we define state i as the state when there are exact i lines on the genealogy (or i ancestral sequences of the sample), the i-th coalescent time is the time duration in state i. Given these sequences, it is easy to calculate that ξ₁ = 2, ξ₂ = ξ₃ = 1, ξ₄ = 0, K = 4, Π = 2.

The above summary statistics describe the mutations in a single sample. Because longitudinal samples consist of more than one correlated samples, there are some interesting summary statistics describing the variation between samples. The purpose of this study is to investigate the statistical properties of these summary statistics, based on which methods for analyzing longitudinal samples can be developed.

The type of summary statistics we were interested in this paper was best illustrated in Figure 2. It showed a genealogy of three longitudinal samples s₁, s₂ and s₃ sampled at three different time points, and each sample had three sequences. There were five mutations numbered 1 to 5 on the genealogy, which were shown as gray dots. Consider a pair of longitudinal samples first. Define ξ_ij be the number of mutations which is size i in the earlier sample and size j in the later sample. If sample s₁ is assumed to be the earlier sample and sample s₂ as the later sample, we can see: mutation 1 is size 1 in both samples; mutation 2 is size 1 in sample s₁ and size 0 in sample s₂; mutation 3 is size 0 in sample s₁ and size 1 in sample s₂. So that, for sample s₁ and s₂, ξ₁₁ = ξ₁₀ = ξ₀₁ = 1 and all other ξ_ijs equal 0. Now look at sample s₂ and sample s₃. We can see: mutation 3 is size 1 in sample s₂ and size 3 in sample s₃; mutation 4 is size 0 in sample s₂ and size 3 in sample s₃; mutation 5 is size 0 in sample s₂ and size 2 in sample s₃. Assume sample s₂ as the earlier sample and sample s₃ as the later sample, then ξ₁₃ = ξ₀₃ = ξ₀₂ = 1 and all other ξ_ijs equal 0.

A genealogy of three longitudinal samples each with three sequences.

We can divide ξ_ijs into different groups to form new summary statistics. Define the number of private mutations, K_p, as the number of mutations which are polymorphic only in one sample but not the other. K_p can be further divided to K_p₁ and K_p₂, the number of mutations which is polymorphic only in the earlier sample or the later sample, respectively. Define the number of shared mutations, K_s, as the number of mutations which are polymorphic in both samples. Define the number of fixed mutations, K_f, as the number of mutations which is monomorphic in both samples but the monomorphic type of the earlier sample is different to that of the later sample. Write them in the form of functions of ξ_ijs:

K_{p_{1}} = \sum_{i = 1}^{n_{1} - 1} (ξ_{i 0} + ξ_{{in}_{2}})

(4)

K_{p_{2}} = \sum_{i = 1}^{n_{2} - 1} (ξ_{0 i} + ξ_{n_{1} i})

(5)

K_{s} = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} ξ_{ij}

(6)

K_{f} = ξ_{{0 n}_{2}} + ξ_{n_{1} 0}

(7)

where n₁ and n₂ are sizes of the first and the second sample, respectively. Look at sample s₁ and s₂ again, we can see mutation 1 is a shared mutation of sample s₁ and s₂, mutation 2 is a private mutation of sample s₁ and mutation 3 is a private mutation of sample s₂. If we look at sample s₂ and s₃, then mutation 3 is a private mutation of sample s₂, mutation 4 is a fixed mutation and mutation 5 is a private mutation of sample s₃. So that, for s₁ and s₂, K_p₁ = K_p₂ = K_s = 1 and K_f = 0. For s₂ and s₃, K_p₁ = K_p₂ = K_f = 1 and K_s = 0. Actually, if we only consider two longitudinal samples, shared mutations and fixed mutations are mutually exclusive, that is, either K_s or K_f must equal 0 (Wakeley and Hey, 1997). This fact is easy to understand by considering the possible position of mutations on the genealogy.

As to the average number of mutations accumulated since the common ancestor of two sequences (i.e. the average differences between two sequences under the infinite site model), other than the cases both sequences are from the same sample, now we can define Π₁₂ as the average number of mutations accumulated since the common ancestor of two sequences, with one sequence from the earlier sample and one sequence from the later sample:

Π_{12} = \frac{1}{n_{1} n_{2}} \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} d_{ij}^{*}

(8)

where $d_{ij}^{*}$ is the number of mutation accumulated on the i-th sequence of the earlier sample and the j-th sequence of the later sample after their MRCA, and again n₁ and n₂ are sizes of the two samples. Write it as a function of ξ_ijs:

Π_{12} = \frac{1}{n_{1} n_{2}} \sum_{i = 0}^{n_{1}} \sum_{j = 0}^{n_{2}} [(n_{1} - i) j + (n_{2} - j) i] ξ_{ij} .

(9)

3 Two-sample model and two-sample-two-stage model

One of the simple models for longitudinal samples is the two-sample model. Under it only two longitudinal samples are considered. So that ξ_ij, K_p, K_s, K_f and Π₁₂ are all summary statistics for two longitudinal samples. In this model, we assumed the population followed the commonly used Wright-Fisher model for haploids (Fisher, 1930; Wright, 1931) and the population size change follows a determinate function

N (t) = N (0) ω (t)

(10)

where ω (t) is a function of time t; N (0) and N (t) are the population sizes at time 0 and time t, respectively. The second sample (sample 2, size n₂) is taken at time 0 and the first sample (sample 1, size n₁) is taken at g generations before. (We followed the retrospective view of coalescent theory and defined time 0 as the time point when sample 2 was taken. However, the “before” and “after” mentioned in this paper still referred to the normal prospective view.) Unless otherwise mentioned, time in the paper was scaled in N (0) generations and let T = g/N (0). Let N₁ and N₂ be the population size at time points of the first sampling and the second sampling, respectively. That is, N₁ = N (T) and N₂ = N (0). We further assumed that all mutations were neutral; the mutation rate was μ per sequence per generation and remained constant over time; and the number of mutations accumulated in a given period of time τ follow the Poisson distribution with parameter μτ. Some other parameters are defined as following: λ = N₁/N₂, θ₁ = 2N₁μ, θ₂ = 2N₂μ.

In HIV study, people can use longitudinal samples to study the effect of a treatment, for example, giving a treatment after the first sampling and taking a second sample after a period. In this case we may assume a two-sample-two-stage model, which is a special case of the general two-sample model. Figure 3 illustrated the model with two longitudinal samples each with three sequences. In this model, we assume that before the earlier sample (sample 1) was taken, the effective population size, N₁, remained constant. Then right after sample 1 was taken, a treatment is given and the effective population size experienced an instantaneously change to N₂, then it remained constantly thereafter. g generations after sample 1 was taken, the later sample, sample 2, was taken. This two-sample-two-stage model is closely related to Wakeley and Hey (1997)’s isolation model and size-change model. Actually, some of the expectations of the new summary statistics in this study were derived from their results. However, they provided no formulas for calculating the variance and covariances of the statistics, besides they missed some other important expectations. That largely limited the application of these statistics and we hope this study can fill these gaps.

The basis of two-sample model is coalescent times. Let t_m (T) and t_m (0) be the m-th coalescent times of samples taken at time T and time 0, respectively. Let t_ma and t_mb be the part of t_m (0) after and before T, respectively. Suppose at T, there are k ancestral sequences of sample 2. Then given k, t_ka and t_kb are independent; if m < k, t_ma = 0 and t_mb = t_m (0); if m > k, t_ma = t_m (0) and t_mb = 0. If k = 1, as shown in Figure 3, define t₁ (0) = t_1a as the time interval between T and the MRCA of sample 2, that is,

t_{1} (0) = max (0, T - \sum_{m = 2}^{n_{2}} t_{m} (0))

(11)

where max (·) stands for maximum.

With the probability densities of coalescent times under the two-sample model or two-sample-two-stage model (see Appendix A and B) we can calculate E (t_m (T)), E (t_ma) and E (t_mb). Then all the means of the summary statistics studied in this paper can be calculated directly (see details in following sections). Variances and covariances of the summary statistics can also be derived under the two-sample model, which need $E (t_{m} (T) t_{m'} (T)), E (t_{m}^{2} (T)), E (t_{ma}^{2})$ and E (t_mat_m′a), where m′ ≠ m. The mathematics becomes more complicated. For a simpler model, such as the two-sample-two-stage model, it is possible to calculate them directly (see Appendix B). For more complicated demographic change, simulation may be used to obtain those quantities. Since only coalescent times need to be simulated, the simulation process can be very fast and accurate estimations of these quantities can be obtained practically.

4 Number of mutations of various sizes in two samples

4.1 Calculate E (ξ_ij)

As defined in Section 2, ξ_ij is the number of mutations which are size i in sample 1 and size j in sample 2. Further define ξ_ijb and ξ_ija as the parts of ξ_ij that occurred before and after T, respectively. Assume there are n′ (n′ ≤ n₂) ancestral sequences of sample 2 at time T, then

E (ξ_{ijb} | n') = {\begin{matrix} \sum_{k' = 1}^{min (n' - 1, j)} P (k' \to j | n_{2}, n') \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ k' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + k' \end{matrix})} E (ξ_{i + k' | n_{1} + n'} (T)) & if 0 < j < n_{2} \\ \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i \end{matrix})} E (ξ_{i | n_{1} + n'} (T)) & if j = 0 \\ \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + n' \end{matrix})} E (ξ_{i + n' | n_{1} + n'} (T)) & if j = n_{2} \\ 0 & otherwise \end{matrix}

(12)

E (ξ_{ija} | n') = {\begin{matrix} \frac{θ_{2}}{2} \sum_{k = max (2, n')}^{n_{2}} kP (k, j | n_{2}) E (t_{ka} | n') & if i = 0, 1 < j < n_{2} \\ \frac{θ_{2}}{2} E (t_{1 a} | 1) & if i = 0, j = n_{2}, n' = 1 \\ 0 & otherwise \end{matrix}

(13)

where E (·|n′) means the expectation of some statistic, specified by ·, given there are n′ ancestral sequences of sample 2 at time T; min (·) stands for the minimum; ξ_i|n (T) is the number of mutations of size i in a sample of size n and taken at time T; P (k′ → j|n, k) is a function of k′, j, n, k, whose meaning and formula can be found in Appendix C.

The properties of ξ_i|n (T) have been studied by Fu (1995) and for 1 ≤ j < n

E (ξ_{j} (T)) = \sum_{k = 2}^{n} kP (k, i | n) E (ζ_{k} (T))

(14)

where P (k, i|n) is a function of k, i and n, whose meaning and formula can be found in Appendix C; ζ_k (T) is the number of mutations accumulated on a line of state k.

E (ζ_{k} (T)) = \frac{θ_{2}}{2} E (t_{k} (T))

(15)

Under two-sample-two-stage model, it can be simplified (Fu, 1995) as

E_{TT} (ζ_{k} (T)) = \frac{θ_{1}}{k (k - 1)}

(16)

where “E_TT” means the expectation are only applicable to the two-sample-two-stage model. Substitute (16) into (14) and (12), we have

E_{TT} (ξ_{j} (T)) = \frac{θ_{1}}{k}

(17)

(Fu, 1995) and

E_{TT} (ξ_{ijb} | n') = {\begin{matrix} θ_{1} \sum_{k' = 1}^{min (n' - 1, j)} P (k' \to j | n_{2}, n') \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ k' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + k' \end{matrix})} \frac{1}{(i + k')} & if 0 < j < n_{2} \\ θ_{1} \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i \end{matrix})} \frac{1}{i} & if j = 0 \\ θ_{1} \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + n' \end{matrix})} \frac{1}{i + n'} & if j = n_{2} \\ 0 & otherwise \end{matrix} .

(18)

For the shared mutations, the expectation of ξ_ij can be derived from the results of Wakeley and Hey (1997):

E (ξ_{ij}) = E (ξ_{ijb})

(19)

= \sum_{k = 2}^{n_{2}} P_{n_{2,} k} (T) [\sum_{k' = 1}^{k - 1} P (k' \to j | n_{2}, k) \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} k \\ k' \end{matrix})}{(\begin{matrix} n_{1} + k \\ i + k' \end{matrix})} E (ξ_{i + k'} (T))]

(20)

E_{TT} (ξ_{ij}) = θ_{1} \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = 1}^{k - 1} P (k' \to j | n_{2}, k) \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} k \\ k' \end{matrix})}{(\begin{matrix} n_{1} + k \\ i + k' \end{matrix})} \frac{1}{(i + k')}]

(21)

with 0 < i ≤ n₁, 0 < j < n₂, where P_n,k (T) is a function of n, k and T, whose meaning and formula can be found in Appendix C.

Similarly, we can compute the expectations for the fixed or private mutations:

E (ξ_{i 0}) = E (ξ_{i 0 b})

(22)

= \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + k \\ i \end{matrix})} E (ξ_{i} (T))

(23)

E_{TT} (ξ_{i 0}) = θ_{1} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + k \\ i \end{matrix})} \frac{1}{i}

(24)

E (ξ_{{in}_{2}}) = E (ξ_{{in}_{2} b})

(25)

= \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + k \\ i + k \end{matrix})} E (ξ_{i + k} (T))

(26)

E_{TT} (ξ_{{in}_{2}}) = θ_{1} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{(\begin{matrix} n_{1} \\ i \end{matrix})}{(\begin{matrix} n_{1} + k \\ i + k \end{matrix})} \frac{1}{(i + k)}

(27)

E (ξ_{0 n_{2} b}) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{1}{(\begin{matrix} n_{1} + k \\ k \end{matrix})} E (ξ_{k} (T))

(28)

E_{TT} (ξ_{0 n_{2} b}) = θ_{1} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{1}{(\begin{matrix} n_{1} + k \\ k \end{matrix})} \frac{1}{k}

(29)

E (ξ_{0 n_{2} a}) = \frac{θ_{2}}{2} E (t_{1 a})

(30)

E (ξ_{0 n_{2}}) = E (ξ_{0 n_{2} b}) + E (ξ_{0 n_{2} a})

(31)

E (ξ_{0 jb}) = \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = 1}^{min (k - 1, j)} P (k' \to j | n_{2}, k) \frac{(\begin{matrix} k \\ k' \end{matrix})}{(\frac{n_{1} + k}{k'})} E (ξ_{k'} (T))]

(32)

E_{TT} (ξ_{0 jb}) = θ_{1} \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = 1}^{min (k - 1, j)} P (k' \to j | n_{2}, k) \frac{(\begin{matrix} k \\ k' \end{matrix})}{(\begin{matrix} n_{1} + k \\ k' \end{matrix})} \frac{1}{k'}]

(33)

E (ξ_{0 ja}) = \frac{θ_{2}}{2} \sum_{k' = 2}^{n_{2}} k' P (k', j | n_{2}) E (t_{k' a})

(34)

= \frac{θ_{2}}{2} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = max (k, 2)}^{n_{2}} k' P (k', j | n_{2}) E (t_{k' a} | k)]

(35)

E (ξ_{0 j}) = E (ξ_{0 jb}) + E (ξ_{0 ja})

(36)

where 0 < i ≤ n₁, 0 < j < n₂, except 0 < i < n₁ in (25).

4.2 Calculate $E (ξ_{ij}^{2})$ and E (ξ_ijξ_lm)

Again, assume there are n′ (n′ ≤ n₂) ancestral sequences of sample 2 at time T. E (ξ_ijbξ_lmb|n′) can be computed as

E (ξ_{ijb} ξ_{lmb} | n') = δ_{(i = l, j = m)} \sum_{k = 2}^{n_{1} + n'} \sum_{j' = 0}^{n'} kP (k, i + j' | n_{1} + n') \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix})} P (j' \to j | n_{2}, n') E (ζ_{k}^{2} (T)) + δ_{(i + l \leq n_{1}, j + m \leq n_{2})} \sum_{k = 2}^{n_{1} + n'} \sum_{j' = 0}^{n'} \sum_{m' = 0}^{n' - j'} k (k - 1) P (k, i + j'; k, l + m' | n_{1} + n') \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix}) (\begin{matrix} n_{1} - i \\ l \end{matrix}) (\begin{matrix} n' - j' \\ m' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix}) (\begin{matrix} n_{1} + n' - i - j' \\ l + m' \end{matrix})} P (j' \to j, m' \to m | n_{2}, n') E (ζ_{k} (T) ζ_{k}^{'} (T)) + δ_{(i \geq l, j \geq m)} \sum_{k = 2}^{n_{1} + n' - 1} \sum_{k' = k + 1}^{n_{1} + n'} \sum_{j' = 0}^{n'} \sum_{m' = 0}^{j'} kk' P_{a} (k, i + j'; k', l + m' | n_{1} + n') \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix}) (\begin{matrix} i \\ l \end{matrix}) (\begin{matrix} j' \\ m' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix}) (\begin{matrix} i + j' \\ l + m' \end{matrix})} P (j' - m' \to j - m, m' \to m | n_{2}, n') E (ζ_{k} (T) ζ_{k'} (T)) + δ_{(l \geq i, m \geq j)} \sum_{k = 2}^{n_{1} + n' - 1} \sum_{k' = k + 1}^{n_{1} + n'} \sum_{m' = 0}^{n'} \sum_{j' = 0}^{m'} kk' P_{a} (k, l + m'; k', i + j' | n_{1} + n') \frac{(\begin{matrix} n_{1} \\ l \end{matrix}) (\begin{matrix} n' \\ m' \end{matrix}) (\begin{matrix} l \\ i \end{matrix}) (\begin{matrix} m' \\ j' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ l + m' \end{matrix}) (\begin{matrix} l + m' \\ i + j' \end{matrix})} P (m' - j' \to m - j, j' \to j | n_{2}, n') E (ζ_{k} (T) ζ_{k'} (T)) + δ_{(i + l \leq n_{1}, j + m \leq n_{2})} \sum_{k = 2}^{n_{1} + n' - 1} \sum_{k' = k + 1}^{n_{1} + n'} \sum_{j' = 0}^{n'} \sum_{m' = 0}^{n' - j'} kk' [P_{b} (k, i + j'; k', l + m' | n_{1} + n') + P_{b} (k, l + m'; k', i + j' | n_{1} + n')] \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix}) (\begin{matrix} n_{1} - i \\ l \end{matrix}) (\begin{matrix} n' - j' \\ m' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix}) (\begin{matrix} n_{1} + n' - i - j' \\ l + m' \end{matrix})} P (j' \to j, m' \to m | n_{2}, n') E (ζ_{k} (T) ζ_{k'} (T))

(37)

and

\begin{matrix} E_{TT} (ξ_{ijb} ξ_{lmb} | n') = δ_{(i = l, j = m)} \sum_{j' = 0}^{min (n', j)} \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix})} P (j' \to j | n_{2}, n') (\frac{θ_{1}}{i + j'} + β_{n_{1} + n'} (i + j') θ_{1}^{2}) + δ_{(i + l = n_{1}, j + m = n_{2})} θ_{1}^{2} \sum_{j' = 0}^{min (n', j)} \frac{(\begin{matrix} n_{1} \\ i \end{matrix}) (\begin{matrix} n' \\ j' \end{matrix})}{(\begin{matrix} n_{1} + n' \\ i + j' \end{matrix})} P (j' \to j | n_{2}, n') \\ (\frac{a_{n_{1} + n'} - a_{i + j'}}{n_{1} + n' - i - j'} + \frac{a_{n_{1} + n'} - a_{l + n' - j'}}{n_{1} + j' - l} - \frac{β_{n_{1} + n'} (i + j') + β_{n_{1} + n'} (l + n' - j')}{2}) + δ_{(i + l < n_{1}, j + m = n_{2})} θ_{1}^{2} \sum_{j' = 0}^{min (n', j)} \frac{(\begin{matrix} i + j' \\ i \end{matrix}) (\begin{matrix} l + n' - j' \\ l \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} P (j' \to j | n_{2}, n') \\ (\frac{1}{(i + j') (l + n' - j')} - \frac{γ_{n_{1} + n'} (i + j') + γ_{n_{1} + n'} (l + n' - j')}{2}) + δ_{(i + l \leq n_{1}, j + m < n_{2})} θ_{1}^{2} \sum_{j' = 0}^{min (n' - 1, j)} \sum_{m' = 0}^{min (n' - 1 - j', m)} \frac{(\begin{matrix} i + j' \\ i \end{matrix}) (\begin{matrix} l + m' \\ l \end{matrix}) (\begin{matrix} n_{1} + n' - i - j' - l - m' \\ n_{1} - i - l \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} \\ P (j' \to j, m' \to m | n_{2}, n') (\frac{1}{(i + j') (l + m')} - \frac{γ_{n_{1} + n'} (i + j') + γ_{n_{1} + n'} (l + m')}{2}) + δ_{(i > l, j = m)} θ_{1}^{2} \sum_{j' = 0}^{min (n', j)} \frac{(\begin{matrix} l + j' \\ l \end{matrix}) (\begin{matrix} n_{1} + n' - i - j' \\ n_{1} - i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} P (j' \to j | n_{2}, n') \frac{γ_{n_{1} + n'} (l + j')}{2} + δ_{(i \geq l, j > m)} θ_{1}^{2} \sum_{j' = 1}^{min (n', j)} \sum_{m' = 0}^{min (m, j' - 1)} \frac{(\begin{matrix} n_{1} + n' - i - j' \\ n_{1} - i \end{matrix}) (\begin{matrix} l + m' \\ l \end{matrix}) (\begin{matrix} i + j' - l - m' \\ i - l \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} \\ P (j' - m' \to j - m, m' \to m | n_{2}, n') \frac{γ_{n_{1} + n'} (l + m')}{2} + δ_{(l > i, m = j)} θ_{1}^{2} \sum_{j' = 0}^{min (n', j)} \frac{(\begin{matrix} i + j' \\ i \end{matrix}) (\begin{matrix} n_{1} + n' - l - j' \\ n_{1} - l \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} P (j' \to j | n_{2}, n') \frac{γ_{n_{1} + n'} (i + j')}{2} + δ_{(l \geq i, m > j)} θ_{1}^{2} \sum_{m' = 1}^{min (n', m)} \sum_{j' = 0}^{min (m' - 1, j)} \frac{(\begin{matrix} n_{1} + n' - l - m' \\ n_{1} - l \end{matrix}) (\begin{matrix} i + j' \\ i \end{matrix}) (\begin{matrix} l + m' - i - j' \\ l - i \end{matrix})}{(\begin{matrix} n_{1} + n' \\ n_{1} \end{matrix})} \\ P (m' - j' \to m - j, j' \to j | n_{2}, n') \frac{γ_{n_{1} + n'} (i + j')}{2} \end{matrix}

(38)

where

E (ζ_{k}^{2} (T)) = \frac{θ_{2}}{2} E (t_{k} (T)) + {(\frac{θ_{2}}{2})}^{2} E (t_{k}^{2} (T))

(39)

E_{TT} (ζ_{k}^{2} (T)) = \frac{1}{k (k - 1)} θ_{1} + \frac{2}{k^{2} {(k - 1)}^{2}} θ_{1}^{2};

(40)

E (ζ_{k}^{'} (T) ζ_{k} (T)) = {(\frac{θ_{2}}{2})}^{2} E (t_{k}^{2} (T))

(41)

E_{TT} (ζ_{k}^{'} (T) ζ_{k} (T)) = \frac{2}{k^{2} {(k - 1)}^{2}} θ_{1}^{2};

(42)

E (ζ_{k} (T) ζ_{k'} (T)) = {(\frac{θ_{2}}{2})}^{2} E (t_{k} (T) t_{k'} (T))

(43)

E_{TT} (ζ_{k} (T) ζ_{k'} (T)) = \frac{1}{k (k - 1) k' (k' - 1)} θ_{1}^{2};

(44)

δ_(r) is an index function that takes the value 1 if the relationship r is true and 0 otherwise; a_n is a function of n; β_n (i) and γ_n (i) are functions of n and i; P (j′ → j, m′ → m|n, n′) is a function of j′, j, m′, m, n′ and n; P (k, i; k, j|n) is a function of i, j, k and n; P_a (k, i; k′, j|n) and P_b (k, i; k′, j|n) are functions of i, j, k, k′ and n; the meanings and formulas of these functions can be found in Appendix C.

Then

E (ξ_{ijb} ξ_{lmb}) = \sum_{n' = 1}^{n_{2}} P_{n_{2}, n'} (T) E (ξ_{ijb} ξ_{lmb} | n')

(45)

For E (ξ_ijaξ_lma), consider the following conditions:

1	i = l = 0, j < n₂, m < n₂
2	i = l = 0, j < n₂, m = n₂
3	i = l = 0, j = m = n₂
4	otherwise

Open in a new tab

Condition 1: i = l = 0, j < n₂, m < n₂

E (ξ_{0 ja} ξ_{0 ma}) = δ_{(j = m)} \sum_{k = 2}^{n_{2}} kP (k, j | n_{2}) E (ζ_{ka}^{2}) + δ_{(j + m \leq n_{2})} \sum_{k = 2}^{n_{2}} k (k - 1) P (k, j; k, m | n_{2}) E (ζ_{ka} ζ_{ka}^{'}) + \sum_{k = 2}^{n_{2} - 1} \sum_{k' = k + 1}^{n_{2}} kk' [P (k, j; k', m | n_{2}) + P (k, m; k', j | n_{2})] E (ζ_{ka} ζ_{k' a}) .

(46)

Condition 2: i = l = 0, j < n₂, m = n₂

E (ξ_{0 ja} ξ_{0 n_{2} a}) = \sum_{k = 2}^{n_{2}} kP (k, j | n_{2}) E (ζ_{ka} ζ_{1 a})

(47)

Condition 3: i = l = 0, j = m = n₂

E (ξ_{0 n_{2} a}^{2}) = E (ζ_{1 a}^{2})

(48)

Condition 4: otherwise

E (ξ_{ija} ξ_{lma}) = 0

(49)

where ζ_ka and $ζ_{ka}^{'}$ are the numbers of mutations accumulated on two different lines of state k after T;

E (ζ_{ka}^{2}) = \frac{θ_{2}}{2} E (t_{ka}) + {(\frac{θ_{2}}{2})}^{2} E (t_{ka}^{2})

(50)

E (ζ_{ka} ζ_{ka}^{'}) = {(\frac{θ_{2}}{2})}^{2} E (t_{ka}^{2})

(51)

E (ζ_{ka} ζ_{k' a}) = {(\frac{θ_{2}}{2})}^{2} E (t_{ka} t_{k' a}) .

(52)

Since the numbers of mutations before and after T are independent to each other given the state at T,

E (ξ_{ij} ξ_{lm} | n') = {\begin{matrix} E (ξ_{ijb} ξ_{lmb} | n') & if i > 0, l > 0 \\ E (ξ_{ijb} ξ_{0 mb} | n') + E (ξ_{ijb} | n') E (ξ_{0 ma} | n') & if i > 0, l = 0 \\ E (ξ_{0 jb} ξ_{lmb} | n') + E (ξ_{lmb} | n') E (ξ_{0 ja} | n') & if i = 0, l > 0 \\ E (ξ_{0 jb} ξ_{0 mb} | n') + E (ξ_{0 ja} ξ_{0 ma} | n') \\ + E (ξ_{0 jb} | n') E (ξ_{0 ma} | n') & if i = l = 0 \\ + E (ξ_{0 mb} | n') E (ξ_{0 ja} | n') \end{matrix}

(53)

E (ξ_{ij} ξ_{lm}) = \sum_{n' = 1}^{n_{2}} P_{n_{2}, n'} (T) E (ξ_{ij} ξ_{lm} | n') .

(54)

Finally,

Var (ξ_{ij}) = E (ξ_{ij} ξ_{ij}) - E^{2} (ξ_{ij})

(55)

Cov (ξ_{ij}, ξ_{lm}) = E (ξ_{ij} ξ_{lm}) - E (ξ_{ij}) E (ξ_{lm}) .

(56)

5 Number of private, shared and fixed mutations

The expectations of E (K_f), E (K_s), E (K_p₁) and E (K_p₂) can be derived from the results of Wakeley and Hey (1997). That is,

E (K_{f} before T) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{E (ξ_{n_{1}} (T)) + E (ξ_{k} (T))}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}

(57)

E_{TT} (K_{f} before T) = θ_{1} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) \frac{(\frac{1}{n_{1}} + \frac{1}{k})}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}

(58)

E (K_{f} after T) = \frac{θ_{2}}{2} E (t_{1 a})

(59)

E_{TT} (K_{f} after T) = \frac{θ_{2} T}{2} - \frac{θ_{2}}{2} \sum_{i = 2}^{n_{2}} \frac{1 - exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{(\begin{matrix} i \\ 2 \end{matrix}) \prod_{j = 2, j \neq i}^{n_{2}} [1 - \frac{i (i - 1)}{j (j - 1)}]}

(60)

E (K_{f}) = E (K_{f} before T) + E (K_{f} after T)

(61)

E (K_{s}) = \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [\frac{E (ξ_{n_{1}} (T)) + E (ξ_{k} (T))}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})} - \sum_{k' = k}^{n_{1} + k - 1} E (ξ_{k'} (T))] + (1 - P_{n_{2}, 1} (T)) \sum_{k' = 1}^{n_{1} - 1} E (ξ_{k'} (T))

(62)

E_{TT} (K_{s}) = θ_{1} \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [a_{n_{1}} + a_{k} - a_{n_{1} + k} + \frac{(\frac{1}{n_{1}} + \frac{1}{k})}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}]

(63)

E (K_{p_{1}}) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = k}^{n_{1} + k - 1} E (ξ_{k'} (T)) - \frac{E (ξ_{n_{1}} (T)) + E (ξ_{k} (T))}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}]

(64)

E_{TT} (K_{p_{1}}) = θ_{1} \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) [a_{n_{1} + k} - a_{k} - \frac{(\frac{1}{n_{1}} + \frac{1}{k})}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}]

(65)

E (K_{p_{2}} before T) = \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [\sum_{k' = n_{1}}^{n_{1} + k - 1} E (ξ_{k'} (T)) - \frac{E (ξ_{n_{1}} (T)) + E (ξ_{k} (T))}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}]

(66)

E_{TT} (K_{p_{2}} before T) = θ_{1} \sum_{k = 2}^{n_{2}} P_{n_{2}, k} (T) [a_{n_{1} + k} - a_{n_{1}} - \frac{(\frac{1}{n_{1}} + \frac{1}{k})}{(\begin{matrix} n_{1} + k \\ n_{1} \end{matrix})}]

(67)

E (K_{p_{2}} after T) = \sum_{k' = 1}^{n_{2} - 1} E (ξ_{k'} (0)) - \sum_{k = 2}^{n_{2}} P_{n_{2} k} (T) [\sum_{k' = 1}^{k - 1} E (ξ_{k'} (T))]

(68)

E_{TT} (K_{p_{2}} after T) = θ_{2} [a_{n_{2}} - \sum_{k = 2}^{n_{2}} P_{n_{2} k} (T) a_{k}]

(69)

E (K_{p_{2}}) = E (K_{p_{2}} before T) + E (K_{p_{2}} after T)

(70)

In Section 2 we also showed that K_f, K_s, K_p₁ and K_p₂ can also be expressed as the sum of ξ_ij using (4–7). Obviously, their expectations can also be calculated using E (ξ_ij)s, i.e.,

E (K_{p_{1}}) = \sum_{i = 1}^{n_{1} - 1} [E (ξ_{i 0}) + E (ξ_{{in}_{2}})]

(71)

E (K_{p_{2}}) = \sum_{i = 1}^{n_{2} - 1} [E (ξ_{0 i}) + E (ξ_{n_{1} i})]

(72)

E (K_{s}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} E (ξ_{ij})

(73)

E (K_{f}) = E (ξ_{0 n_{2}}) + E (ξ_{n_{1} 0})

(74)

We can also use the summation of E (ξ_ijξ_lm) to calculate the expectations of the products of K_f, K_s, K_p₁ and K_p₂:

E (K_{f}^{2}) = E (ξ_{0 n_{2}}^{2}) + E (ξ_{n_{1} 0}^{2}) + 2 E (ξ_{0 n_{2}} ξ_{n_{1} 0})

(75)

E (K_{s}^{2}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} \sum_{l = 1}^{n_{1} - 1} \sum_{m = 1}^{n_{2} - 1} E (ξ_{ij} ξ_{lm})

(76)

E (K_{p_{1}}^{2}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{1} - 1} E (ξ_{i 0} ξ_{j 0}) + E (ξ_{{in}_{2}} ξ_{{jn}_{2}}) + 2 E (ξ_{i 0} ξ_{{jn}_{2}})

(77)

E (K_{p_{2}}^{2}) = \sum_{i = 1}^{n_{2} - 1} \sum_{j = 1}^{n_{2} - 1} E (ξ_{0 i} ξ_{0 j}) + E (ξ_{n_{1} i} ξ_{n_{1} j}) + 2 E (ξ_{0 i} ξ_{n_{1} j})

(78)

E (K_{f} K_{s}) = 0

(79)

E (K_{f} K_{p_{1}}) = \sum_{i = 1}^{n_{1} - 1} [E (ξ_{i 0} ξ_{0 n_{2}}) + E (ξ_{i 0} ξ_{n_{1} 0}) + E (ξ_{{in}_{2}} ξ_{0 n_{2}}) + E (ξ_{{in}_{2}} ξ_{n_{1} 0})]

(80)

E (K_{f} K_{p_{2}}) = \sum_{i = 1}^{n_{2} - 1} [E (ξ_{0 i} ξ_{0 n_{2}}) + E (ξ_{0 i} ξ_{n_{1} 0}) + E (ξ_{n_{1} i} ξ_{0 n_{2}}) + E (ξ_{n_{1} i} ξ_{n_{1} 0})]

(81)

E (K_{s} K_{p_{1}}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} \sum_{k = 1}^{n_{1} - 1} E (ξ_{ij} ξ_{k 0}) + E (ξ_{ij} ξ_{{kn}_{2}})

(82)

E (K_{s} K_{p_{2}}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} \sum_{k = 1}^{n_{2} - 1} E (ξ_{ij} ξ_{0 k}) + E (ξ_{ij} ξ_{n_{1} k})

(83)

E (K_{p_{1}} K_{p_{2}}) = \sum_{i = 1}^{n_{1} - 1} \sum_{j = 1}^{n_{2} - 1} [E (ξ_{i 0} ξ_{0 j}) + E (ξ_{i 0} ξ_{n_{1} j}) + E (ξ_{{in}_{2}} ξ_{0 j}) + E (ξ_{{in}_{2}} ξ_{n_{1} j})],

(84)

with which variances and covariances of K_f, K_s, K_p₁ and K_p₂ can be calculated easily.

6 Average number of mutations since the MRCA of two sequences between samples

It is easy to see that

E (Π_{12}) = 2 E (ζ_{2} (T)) + μ g

(85)

E_{TT} (Π_{12}) = θ_{1} + \frac{θ_{2} T}{2} .

(86)

and

E (Π_{12}^{2} | n_{1}, n_{2}) = \frac{1}{n_{1}^{2} n_{2}^{2}} E [{(\sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} d_{ij}^{*})}^{2}]

(87)

= \frac{1}{n_{1} n_{2}} E (d_{ij}^{* 2}) + \frac{n_{2} - 1}{n_{1} n_{2}} E (d_{ij}^{*} d_{ik}^{*}) + \frac{n_{1} - 1}{n_{1} n_{2}} E (d_{ij}^{*} d_{lj}^{*}) + \frac{(n_{1} - 1) (n_{2} - 1)}{n_{1} n_{2}} E (d_{ij}^{*} d_{lk}^{*}),

(88)

where i ≠ l and j ≠ k. Since $E (Π_{12}^{2} | n_{1}, n_{2})$ can also be expressed and calculated as

E (Π_{12}^{2} | n_{1}, n_{2}) = \frac{1}{n_{1}^{2} n_{2}^{2}} \sum_{i = 0}^{n_{1}} \sum_{j = 0}^{n_{2}} \sum_{l = 0}^{n_{1}} \sum_{m = 0}^{n_{2}} (n_{1} j + n_{2} i - 2 ij) (n_{1} m + n_{2} l - 2 lm) E (ξ_{ij} ξ_{lm}),

(89)

$E (d_{ij}^{* 2}), E (d_{ij}^{*} d_{kj}^{*}), E (d_{ij}^{*} d_{lj}^{*}) and E (d_{ij}^{*} d_{lk}^{*})$ can be obtain by solving equations for $E (Π_{12}^{2} | 1, 1), E (Π_{12}^{2} | 1, 2), E (Π_{12}^{2} | 2, 1) and E (Π_{12}^{2} | 2, 2)$ . That is

{\begin{matrix} E (d_{ij}^{* 2}) = E (Π_{12}^{2} | 1, 1) \\ E (d_{ij}^{*} d_{ik}^{*}) = 2 E (Π_{12}^{2} | 1, 2) - E (Π_{12}^{2} | 1, 1) \\ E (d_{ij}^{*} d_{lj}^{*}) = 2 E (Π_{12}^{2} | 2, 1) - E (Π_{12}^{2} | 1, 1) \\ E (d_{ij}^{*} d_{lk}^{*}) = 4 E (Π_{12}^{2} | 2, 2) + E (Π_{12}^{2} | 1, 1) - 2 E (Π_{12}^{2} | 1, 2) - 2 E (Π_{12}^{2} | 2, 1) \end{matrix} .

(90)

With $E (Π_{12}^{2} | n_{1}, n_{2})$ and E (Π₁₂),

Var (Π_{12} | n_{1}, n_{2}) = E (Π_{12}^{2} | n_{1}, n_{2}) - E^{2} (Π_{12})

(91)

7 Discussion

Longitudinal samples of DNA sequences are the DNA sequences sampled from the same population at different time points. For fast evolving organisms, e.g. RNA virus, these kind of samples have increasingly been used to study the evolutionary process in action. Longitudinal samples provide some interesting new summary statistics of genetic variation, such as the frequency of mutation of size i in one sample and size j in another (ξ_ij), the average number of mutations accumulated since the MRCA of two sequences each from a different sample (Π₁₂), and the number of private, shared and fixed mutations (K_p₁, K_p₂, K_s and K_f). All these statistics were studied under the two-sample model, which assumes two longitudinal samples. Because in the application of longitudinal samples of RNA viruses, a sampling is often followed by medical treatments, we also studied a two-sample-two-stage model, which further assumes an instantaneous population size change right after the first sampling and no population size change before or after that event. We derived the formulas for calculating statistical properties, e.g. expectations, variances and covariances, of these new summary statistics that can be used to develop methods for better analysis of longitudinal samples.

The results developed in this paper can be utilized in several ways. One is to estimate various parameters. For example, θ is a very important parameter in population genetics. Fu and Li (1993) showed that a better estimator than the widely used Watterson (1975)’s can be found by partitioning the total number of mutations K into some non-overlapping categories and building an estimator on the more detailed information. The best linear unbiased estimators (BLUEs) were developed on that rational (Fu, 1994a,b). One natural partition for longitudinal samples is ξ_ij, which can provide much more detailed partition than that of ξ in a single sample (n₁n₂ + n₁ + n₂ − 4 possible partitions vs. n − 1 possible partitions). So based on ξ_ij it is possible to build a better estimator of θ. The construction of BLUE needs to know the estimation, variances and covariance of these summary statistics, which can be computed using the formulas derived in this paper. One advantage of BLUE based on the ξs over the estimators based on the likelihood of genealogy is its robustness when the sequences have undergone recombination. Recombination does not affect the expectations of ξs, but it will reduce the variances of them, which actually will improve the accuracy of the estimations. So that, for studying the organisms with relative high recombination rate, such as HIV, the BLUE is preferred. For the parameters whose linear or quadratic relationship with summary statistics does not exist, e.g. growth rate, the BLUE can not apply. One alternative is to use χ² estimator developed by Fu and Chakraborty (1998). Another alternative is to use moment based estimators by solving a group of equations numerically, since the estimations of summary statistics can be expressed as functions of these parameters. For instance, Wakeley and Hey (1997) showed an example of using K_p₁, K_p₂, K_s and K_f to estimate θ₁, θ₂ and T under the infinite site model.

Hypothesis tests can also be constructed with the summary statistics studied in this paper. One of the most interested tests in empirical population genetics studies is testing the hypothesis of neutral mutations. Significant progress has been made over the last decade in developing such statistical tests (see reviews by Kreitman, 2000; Ford, 2002). One type of tests uses the difference between two canonical summary statistics of genetic variation within a single population (Tajima, 1989; Fu and Li, 1993; Fu, 1997; Fay and Wu, 2000). Most of the tests of this type are of the form

\frac{L_{1} - L_{2}}{\sqrt{Var (L_{1} - L_{2})}}

(92)

where L₁ and L₂ are summary statistics with the expectation of L₁ − L₂ equals zero under the null hypothesis that all mutations are selective neutral. Since longitudinal samples provide more summary statistics than a single sample, there exists a potential to find more powerful tests of form (92). Other hypotheses tests useful for longitudinal samples include testing the significance of genetic difference between two longitudinal samples and testing the population size change or mutation rate change. One example was given by Liu and Fu (2007), which proposed several methods for testing genetical isochronism or detecting significant genetical heterochronism between two longitudinal samples. Two best tests named T_c₂ and T_c₃ achieved 75% power with 5% significance level if μg = 1, θ₁ = 10 and n₁ = n₂ = 20. With the same μg = 1 but much larger θ₁ = 40, they can still achieve about 65% power with 5% significance level if n₁ = n₂ = 40. These methods can be used to determine the necessary sample size and sampling interval in experimental design or to combine genetically isochronic samples for better data analysis.

Acknowledgments

This research was supported by National Institutes of Health grants GM60777 and GM50428. We thank Sara Barton for preparing the manuscript. We thank three anonymous reviewers for their comments and suggestions. Part of the work was done during an academic visit to the Institute of Biostatistics, Fudan University, Shanghai, China. We thank Dr. Zewei Luo for his invitation.

Appendix

A Probability density of coalescent time under two-sample model

Assuming the population size change follows a determinate function

N (t) = N (0) ω (t)

where N (0) and N (t) are the population sizes at time 0 and time t respectively. Here time is scaled in N (0) generations.

For sample 1,

f (t_{m} (T) = t) = {\begin{matrix} f_{m}^{*} (t | T) & if m = n_{1} \\ \int_{0}^{\infty} f_{n_{1}, m}^{*} (s | T) f_{m}^{*} (t | s) ds & if m < n_{1} \end{matrix}

(A.1)

where

f_{m}^{*} (t | s) = \frac{(\begin{matrix} m \\ 2 \end{matrix})}{ω (s + t)} exp (- (\begin{matrix} m \\ 2 \end{matrix}) \int_{s}^{s + t} \frac{d σ}{ω (σ)})

(A.2)

is the conditional probability density function of the m-th coalescent time given at time s there are m ancestral sequences (Griffiths and Tavaré, 1994).

f_{n, k}^{*} (t | s) = ϕ (n, k) \sum_{i = k + 1}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) \int_{s}^{s + t} \frac{d σ}{ω (σ)})}{ω (s + t) ψ (n, k, i)}

is the conditional probability density of convolution of $f_{n}^{*}, f_{n - 1}^{*}, \dots, f_{k + 1}^{*}$ given at time s there are n ancestral sequences (Polanski et al., 2003).

For sample 2,

f (t_{m} (0) = t) = {\begin{matrix} f_{m}^{*} (t | 0) & if m = n_{1} \\ \int_{0}^{\infty} f_{n_{2}, m}^{*} (s | 0) f_{m}^{*} (t | s) ds & if m < n_{1} \end{matrix}

(A.3)

f^{*} (t_{ma} = t | k) = {\begin{matrix} \frac{f_{n_{2}, k}^{*} (T - t | 0) P_{k, k}^{*} (t | T - t)}{P_{n_{2}, k}^{*} (T)} & if 1 \neq m = k \neq n_{2} \\ δ_{(t = T)} & if m = k = n_{2} \\ \frac{f_{n_{2}, 1}^{*} (T - t | 0)}{P_{n_{2}, 1}^{*} (T)} & if m = k = 1 \\ 0 & if m < k \\ \frac{\int_{0}^{T - t} f_{n_{2}, m}^{*} (x | 0) f_{m}^{*} (t | x) P_{m, 1}^{*} (T - t - x | x + t) dx}{P_{n_{2}, k}^{*} (T)} & if m > k \end{matrix}

(A.4)

f^{*} (t_{mb} = t | k) = {\begin{matrix} f^{*} (t_{m} (T) = t) & if m \leq k \\ 0 & if m > k \end{matrix},

(A.5)

where

P_{n, k}^{*} (t | s) = P_{n, k} (\int_{s}^{s + t} \frac{d σ}{λ (σ)})

(A.6)

is the conditional probability of n sequences coalesce to k sequences in t given there are n sequences at time s (Griffiths and Tavaré, 1994).

B Coalescent time under two-sample-two-stage model

The properties of coalescent time of sample 1 are the same as a single sample, which have been well known assuming the population size is large and a continuous approximation of time (Kingman, 1982a,b). Further, if N₁ = N₂ or λ = 1, the properties of t_m (0) are the same as a single sample. Here we only consider t_m (0) under the situation when λ ≠ 1. Since this section is for the two-sample-two-stage model, we dropped the “TT” in the “E_TT” temporarily.

B.1 Calculate E (t_m (0)) and $E (t_{m}^{2} (0))$

Given there are k sequences at T, for all m ≤ k, t_mb has the same properties as m-th coalescent time of a single sample. Obviously, we have

E (t_{mb} | k) = {\begin{matrix} \frac{2 λ}{m (m - 1)} & if m \leq k \\ 0 & if m > k \end{matrix}

(B.1)

E (t_{mb}^{2} | k) = {\begin{matrix} 2 E^{2} (t_{mb} | k) & if m \leq k \\ 0 & if m > k \end{matrix} .

(B.2)

Considering the following conditions:

1	k = 1, m = k
2	k = 1, m ≠ k
3	k ≠ 1, m = k
4	k ≠ 1, m ≠ k

Open in a new tab

Condition 1: k = m = 1

As shown in Hey (1991),

\begin{matrix} f (t_{ma} = t | k) & = \frac{f_{n_{2}, 1} (T - t)}{P_{n_{2}, 1} (T)} \\ = \frac{ϕ (n_{2}, 1) \sum_{i = 2}^{n_{2}} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) (T - t))}{ψ (n_{2}, 1, i)}}{P_{n_{2}, 1} (T)} \end{matrix}

(B.3)

E (t_{m} (0) | k) = E (t_{ma} | k)

(B.4)

= \frac{T + \sum_{i = 2}^{n_{2}} \frac{ϕ (n_{2}, 1) exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{{(\begin{matrix} i \\ 2 \end{matrix})}^{2} ψ (n_{2}, 1, i)} - \frac{1}{(\begin{matrix} i \\ 2 \end{matrix})}}{P_{n_{2}, 1} (T)}

(B.5)

E (t_{m}^{2} (0) | k) = E (t_{ma}^{2} | k)

(B.6)

= \frac{{(T - \sum_{i = 2}^{n_{2}} \frac{1}{(\begin{matrix} i \\ 2 \end{matrix})})}^{2} + \sum_{i = 2}^{n_{2}} \frac{1}{{(\begin{matrix} i \\ 2 \end{matrix})}^{2}} - \frac{2 ϕ (n_{2}, 1) exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{{(\begin{matrix} i \\ 2 \end{matrix})}^{3} ψ (n_{2}, 1, i)}}{P_{n_{2}, 1} (T)} .

(B.7)

Condition 2: k = 1, m > 1

\begin{matrix} f (t_{ma} = t | k) & = \frac{f_{m} (t) P_{n_{2}, 1, - m} (T - t)}{P_{n_{2}, 1} (T)} \\ = \frac{(\begin{matrix} m \\ 2 \end{matrix}) e^{- (\begin{matrix} m \\ 2 \end{matrix}) t} (1 - ϕ_{- m} (n_{2}, 1) \sum_{i = 2, i \neq m}^{n_{2}} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) (T - t))}{ψ_{- m} (n_{2}, 1, i) (\begin{matrix} i \\ 2 \end{matrix})})}{P_{n_{2}, 1} (T)} \end{matrix}

(B.8)

\begin{matrix} E (t_{m} (0) | k) & = E (t_{ma} | k) \\ = \frac{g_{1} (m, T) - ϕ (n_{2}, 1) \sum_{i = 2, i \neq m}^{n_{2}} \frac{h_{1} (m, i, T)}{(\begin{matrix} i \\ 2 \end{matrix}) ψ_{- m} (n_{2}, 1, i)}}{P_{n_{2}, 1} (T)} \end{matrix}

(B.9)

\begin{matrix} E (t_{m}^{2} (0) | k) & = E (t_{ma}^{2} | k) \\ = \frac{g_{2} (n_{2}, T) - ϕ (n_{2}, 1) \sum_{i = 2, i \neq m}^{n_{2}} \frac{h_{2} (m, i, T)}{(\begin{matrix} i \\ 2 \end{matrix}) ψ_{- m} (n_{2}, 1, i)}}{P_{n_{2}, 1} (T)} . \end{matrix}

(B.10)

Condition 3: k ≠ 1, m = k

if k = m = n₂

f (t_{ma} = t | k) = δ_{(t = T)}

(B.11)

E (t_{ma} | k) = T

(B.12)

E (t_{ma}^{2} | k) = T^{2}

(B.13)

otherwise

f (t_{ma} = t | k) = \frac{f_{n_{2}, k} (T - t) P_{k, k} (t)}{P_{n_{2}, k} (T)}

(B.14)

= \frac{ϕ (n_{2}, k) \sum_{i = k + 1}^{n_{2}} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T - ((\begin{matrix} k \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})) t)}{ψ (n_{2}, k, i)}}{P_{n_{2}, k} (T)}

(B.15)

E (t_{ma} | k) = \frac{ϕ (n_{2}, k) \sum_{i = k + 1}^{n_{2}} \frac{h_{1} (k, i, T)}{ψ (n_{2}, k, i)}}{P_{n_{2}, k} (T)}

(B.16)

E (t_{ma}^{2} | k) = \frac{ϕ (n_{2}, k) \sum_{i = k + 1}^{n_{2}} \frac{h_{2} (k, i, T)}{ψ (n_{2}, k, i)}}{P_{n_{2}, k} (T)} .

(B.17)

Then

E (t_{m} (0) | k) = E (t_{mb} | k) + E (t_{ma} | k)

(B.18)

E (t_{m}^{2} (0) | k) = E (t_{mb}^{2} | k) + E (t_{ma}^{2} | k) + 2 E (t_{mb} | k) E (t_{ma} | k) .

(B.19)

Condition 4: k ≠ 1, m ≠ k

if m < k

E (t_{m} (0) | k) = E (t_{mb} | k)

(B.20)

= \frac{2 λ}{m (m - 1)}

(B.21)

E (t_{m}^{2} (0) | k) = E (t_{mb}^{2} | k)

(B.22)

= 2 E^{2} (t_{mb} | k)

(B.23)

if m > k

(t_{ma} = t | k) = \frac{f_{m} (t) P_{n_{2}, k, - m} (T - t)}{P_{n_{2}, k} (T)}

(B.24)

= \frac{ϕ (n_{2}, k) \sum_{i = k, i \neq m}^{n_{2}} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T - ((\begin{matrix} m \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})) t)}{ψ_{- m} (m, k - 1, i)}}{P_{n_{2}, k} (T)}

(B.25)

E (t_{m} (0) | k) = E (t_{ma} | k)

(B.26)

= \frac{ϕ (n_{2}, k) \sum_{i = k, i \neq m}^{n_{2}} \frac{h_{1} (n_{2}, i, T)}{ψ_{- m} (m, k - 1, i)}}{P_{n_{2}, k} (T)}

(B.27)

E (t_{m}^{2} (0) | k) = E (t_{ma}^{2} | k)

(B.28)

= \frac{ϕ (n_{2}, k) \sum_{i = k, i \neq m}^{n_{2}} \frac{h_{2} (n_{2}, i, T)}{ψ_{- m} (m, k - 1, i)}}{P_{n_{2}, k} (T)}

(B.29)

where

f_{m} (t) = (\begin{matrix} m \\ 2 \end{matrix}) e^{- (\begin{matrix} m \\ 2 \end{matrix}) t}

(B.30)

is the probability density of coalescent time m; the details of functions f_n,k (t), f_n,k,−m (t), P_n,k (T), P_n,k,−m (T), ϕ (n, k), ψ (n, k, r), ϕ_−m (n, k), ψ_−m (n, k, r), g₁ (n, T), g₂ (n, T), h₁ (n, i, T) and h₂ (n, i, T) can be found in Appendix C.

All together,

E (t_{m} (0)) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) E (t_{m} (0) | k)

(B.31)

E (t_{m}^{2} (0)) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) E (t_{m}^{2} (0) | k) .

(B.32)

B.2 Calculate E (t_m (0) t_m′ (0))

If λ ≠ 1, the coalescent time may not be independent. Given there are k sequences at T, t_m (0) and t_m′ (0) are dependent only if m, m′ ≥ k. Here we show how to compute E (t_mat_m′a|k) assuming m > m′ ≥ k without loss of generality.

We consider the following conditions:

1	k = 1, k ≠ n₂ − 1, m′ = k
2	k = 1, k ≠ n₂ − 1, m′ > k
3	k = n₂ − 1, m′ = k, m = n₂
4	1 < k < n₂ − 1, m′ = k
5	1 < k < n₂ − 1, m′ > k

Open in a new tab

Condition 1: k = 1, k ≠ n₂ − 1, m′ = k

f (t_{ma} = t, t_{m' a} = t' | k) = \frac{f_{m} (t) f_{n_{2}, 1, - m} (T - t - t')}{P_{n_{2}, 1} (T)}

(B.33)

E (t_{ma} t_{m' a} | k) = \frac{ϕ (n_{2}, 1) \sum_{i = 2, i \neq m}^{n_{2}} \frac{w_{1} (m, i, T)}{ψ_{- m} (n_{2}, 1, i)}}{P_{n_{2}, 1} (T)} .

(B.34)

Condition 2: k = 1, k ≠ n₂ − 1, m′ > k

f (t_{ma} = t, t_{m' a} = t' | k) = \frac{f_{m} (t) f_{m'} (t') P_{n_{2}, 1, - m, - m'} (T - t - t')}{P_{n_{2}, 1} (T)}

(B.35)

E (t_{ma} t_{m' a} | k) = \frac{w_{1} (m, m', T)}{P_{n_{2}, 1} (T)} - \frac{ϕ (n_{2}, 1) \sum_{i = 2, i \neq m, i \neq m'}^{n_{2}} \frac{w_{2} (m, m', i, T)}{ψ_{- m, - m'} (n_{2}, 1, i) (\begin{matrix} i \\ 2 \end{matrix})}}{P_{n_{2}, 1} (T)} .

(B.36)

Condition 3: k = n₂ − 1, m′ = n₂ − 1, m = n₂

\begin{matrix} E (t_{ma} t_{m' a} | k) & = E (t_{n_{2}} (0) (T - t_{n_{2}} (0))) \\ = T \cdot E (t_{n_{2}} (0) | n_{2} - 1) - E (t_{n_{2}}^{2} (0) | n_{2} - 1) . \end{matrix}

(B.37)

Condition 4: n₂ − 1 > k > 1, m′ = k

f (t_{ma} = t, t_{m' a} = t' | k) = \frac{f_{m} (t) f_{n_{2}, k, - m} (T - t - t') P_{k, k} (t')}{P_{n_{2}, k} (T)}

(B.38)

E (t_{ma} t_{m' a} | k) = \frac{ϕ (n_{2}, k) \sum_{i = k + 1, i \neq m}^{n_{2}} \frac{w_{2} (m, m', i, T)}{ψ_{- m} (n_{2}, k, i)}}{P_{n_{2}, k} (T)} .

(B.39)

Condition 5: n₂ − 1 > k > 1, m′ ≠ k

f (t_{ma} = t, t_{m' a} = t' | k) = \frac{f_{m} (t) f_{m'} (t') P_{n_{2}, k, - m, - m'} (T - t - t')}{P_{n_{2}, k} (T)}

(B.40)

E (t_{ma} t_{m' a} | k) = \frac{ϕ (n_{2}, k) \sum_{i = k, i \neq m, i \neq m'}^{n_{2}} \frac{w_{2} (m, m', i, T)}{ψ_{- m, - m'} (n_{2}, k - 1, i)}}{P_{n_{2}, k} (T)}

(B.41)

where details of functions f_{n,k,−m,−m′} (T), P_{n,k,−m,−m′} (T), ϕ_−m,−m′ (n, k), ψ_−m,−m′ (n, k, r), w₁ (m, i, T) and w₂ (m, m′, i, T) can be found in Appendix C.

In general, given m > m′

E (t_{m} (0) t_{m'} (0) | k) = {\begin{matrix} E (t_{ma} t_{m' a} | k) + E (t_{m' b} | k) E (t_{ma} | k) & if m' = k \\ [E (t_{ma} | k) + E (t_{mb} | k)] E (t_{m' b} | k) & if m = k \\ E (t_{ma} t_{m' a} | k) & if m' > k \\ E (t_{mb} | k) E (t_{m' b} | k) & if m < k \\ E (t_{ma} | k) E (t_{m' b} | k) & otherwise \end{matrix} .

(B.42)

All together,

E (t_{m} (0) t_{m'} (0)) = \sum_{k = 1}^{n_{2}} P_{n_{2}, k} (T) E (t_{m} (0) t_{m'} (0) | k) .

(B.43)

C Functions

P (k, i | n) = {\begin{matrix} \frac{(\begin{matrix} n - i - 1 \\ k - 2 \end{matrix})}{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})} & if k \geq 2, 1 \leq i < n \\ 0 & otherwise \end{matrix}

(C.1)

is the probability that a randomly chosen line of state k is of size i at state n;

P (k' \to j | n, k) = \frac{(\begin{matrix} j - 1 \\ k' - 1 \end{matrix}) (\begin{matrix} n - j - 1 \\ k - k' - 1 \end{matrix})}{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})}

(C.2)

is the probability that k′ lines in state k grow to j lines in state n;

P_{n, k} (T) = {\begin{matrix} ϕ (n, 1) \sum_{i = 2}^{n} \frac{1 - exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{(\begin{matrix} i \\ 2 \end{matrix}) ψ (n, 1, i)} & if k = 1 \\ ϕ (n, k) \sum_{i = k}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{ψ (n, k - 1, i)} & if 2 \leq k < n \\ exp (- (\begin{matrix} n \\ 2 \end{matrix}) T) & if k = n \end{matrix}

(C.3)

is the probability of n sequences coalesce to k sequences in T;

ϕ (n, k) = {\begin{matrix} 1 & if n = k \\ \prod_{s = k + 1}^{n} (\begin{matrix} s \\ 2 \end{matrix}) & otherwise \end{matrix};

(C.4)

ψ (n, k, r) = {\begin{matrix} 1 & if r = n, k = n - 1 \\ \prod_{s = k + 1, s \neq r}^{n} ((\begin{matrix} s \\ 2 \end{matrix}) - (\begin{matrix} r \\ 2 \end{matrix})) & otherwise \end{matrix};

(C.5)

a_{n} = {\begin{matrix} 1 + \frac{1}{2} + \dots + \frac{1}{n - 1} & if n > 1 \\ 0 & otherwise \end{matrix};

(C.6)

β_{n} (i) = {\begin{matrix} \frac{2 n}{(n - i + 1) (n - i)} (a_{n + 1} - a_{i}) - \frac{2}{n - i} & if i < n \\ 0 & otherwise \end{matrix};

(C.7)

γ_{n} (i) = β_{n} (i) - β_{n} (i + 1);

(C.8)

P (j' \to j, m' \to m | n, n') = {\begin{matrix} 1 & if j' = j = m' = m = 0 \\ P (m' \to m | n, n') & if j' = j = 0, 0 < m' \leq m \\ P (j' \to j | n, n') & if m' = m = 0, 0 < j' \leq j \\ P (j' \to j | n, n') & \begin{matrix} if j' + m' = n', j + m = n, \\ 0 < j' \leq j, 0 < m' \leq m \end{matrix} \\ \frac{(\begin{matrix} j - 1 \\ j' - 1 \end{matrix}) (\begin{matrix} m - 1 \\ m' - 1 \end{matrix}) (\begin{matrix} n - j - m - 1 \\ n' - j' - m' - 1 \end{matrix})}{(\begin{matrix} n - 1 \\ n' - 1 \end{matrix})} & \begin{matrix} if j' + m' < n', j + m < n, \\ 0 < j' \leq j, 0 < m' \leq m, \\ n' - j' - m' \leq n - j - m \end{matrix} \\ 0 & otherwise \end{matrix}

(C.9)

is the joint probability of mutations of size j′ at state n′ which grow to size j at state n and mutations of size m′ at state n′ which grow to size m at state n (Johnson and Kotz, 1977);

P (k, i; k, j | n) = {\begin{matrix} \frac{(\begin{matrix} n - i - j - 1 \\ k - 3 \end{matrix})}{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})} & if i + j < n, k > 2, i \geq 1, j \geq 1 \\ \frac{1}{n - 1} & if i + j = n, k = 2, i \geq 1, j \geq 1 \\ 0 & otherwise \end{matrix}

(C.10)

is the probability that two randomly chosen lines of state k are of size i and j at state n;

P (k, i; k', j | n) = δ_{(i \geq j)} P_{a} (k, i; k', j | n) + δ_{(i + j \leq n)} P_{b} (k, i; k', j | n)

is the probability that a line of state k and a line of state k′ are of size i and j at state n respectively (Fu, 1995), where

P_{a} (k, i; k', j | n) = {\begin{matrix} \sum_{t = 2}^{min (k' - k + 1, i - j + 1)} \frac{(\begin{matrix} k' - k \\ t - 1 \end{matrix})}{(\begin{matrix} k' - 1 \\ t \end{matrix})} \frac{k - 1}{k'} \frac{(\begin{matrix} i - j - 1 \\ t - 2 \end{matrix}) (\begin{matrix} n - i - 1 \\ k' - t - 1 \end{matrix})}{(\begin{matrix} n - 1 \\ k' - 1 \end{matrix})} & if i > j \\ \frac{k - 1}{(k' - 1) k'} \frac{(\begin{matrix} n - i - 1 \\ k' - 2 \end{matrix})}{(\begin{matrix} n - 1 \\ k' - 1 \end{matrix})} & if i = j \\ 0 & otherwise \end{matrix}

(C.11)

is the probability for the case that the line of state k′ is a descendant of the line of state k and

P_{b} (k, i; k', j | n) = {\begin{matrix} \sum_{t = 1}^{min (k' - 2, k' - k + 1, i)} \frac{(\begin{matrix} k' - k \\ t - 1 \end{matrix})}{(\begin{matrix} k' - 1 \\ t \end{matrix})} \frac{(k - 1) (k' - t)}{tk'} \frac{(\begin{matrix} i - 1 \\ t - 1 \end{matrix}) (\begin{matrix} n - i - j - 1 \\ k' - t - 2 \end{matrix})}{(\begin{matrix} n - 1 \\ k' - 1 \end{matrix})} & if i + j < n, k \geq 2 \\ \frac{1}{jk'} \frac{(\begin{matrix} n - k' \\ j - 1 \end{matrix})}{(\begin{matrix} n - 1 \\ j \end{matrix})} & if i + j = n, k = 2 \\ 0 & otherwise \end{matrix}

(C.12)

is the probability for the case that the line of state k′ is not a descendant of the line of state k;

P_{n, k, - m} (T) = {\begin{matrix} ϕ_{- m} (n, k) \sum_{i = k, i \neq m}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{ψ_{- m} (n, k - 1, i)} & if 2 \leq k < n \\ ϕ_{- m} (n, 1) \sum_{i = 2, i \neq m}^{n} \frac{1 - exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{(\begin{matrix} i \\ 2 \end{matrix}) ψ_{- m} (n, 1, i)} & if k = 1 \end{matrix};

(C.13)

ϕ_{- m} (n, k) = {\begin{matrix} 1 & if n = k \\ \prod_{s = k + 1, s \neq m}^{n} (\begin{matrix} s \\ 2 \end{matrix}) & otherwise \end{matrix};

(C.14)

ψ_{- m} (n, k, r) = {\begin{matrix} 1 & if r = n, k = n - 1 \\ \prod_{s = k + 1, s \neq r, s \neq m}^{n} ((\begin{matrix} s \\ 2 \end{matrix}) - (\begin{matrix} r \\ 2 \end{matrix})) & otherwise \end{matrix};

(C.15)

g_{1} (n, T) = - T e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - \frac{e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - 1}{(\begin{matrix} n \\ 2 \end{matrix})};

(C.16)

g_{2} (n, T) = - T^{2} e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} + \frac{2}{(\begin{matrix} n \\ 2 \end{matrix})} [- T e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - \frac{e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - 1}{(\begin{matrix} n \\ 2 \end{matrix})}];

(C.17)

h_{1} (n, i, T) = - \frac{T e^{- (\begin{matrix} n \\ 2 \end{matrix}) T}}{(\begin{matrix} n \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})} - \frac{e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - e^{- (\begin{matrix} i \\ 2 \end{matrix}) T}}{{((\begin{matrix} n \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix}))}^{2}};

(C.18)

h_{2} (n, i, T) = - \frac{T^{2} e^{- (\begin{matrix} n \\ 2 \end{matrix}) T}}{(\begin{matrix} n \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})} - \frac{2 T e^{- (\begin{matrix} n \\ 2 \end{matrix}) T}}{{((\begin{matrix} n \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix}))}^{2}} - \frac{2 (e^{- (\begin{matrix} n \\ 2 \end{matrix}) T} - e^{- (\begin{matrix} i \\ 2 \end{matrix}) T})}{{((\begin{matrix} n \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix}))}^{3}};

(C.19)

f_{n, k} (t) = ϕ (n, k) \sum_{i = k + 1}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) t)}{ψ (n, k, i)}

(C.20)

is the probability density of the convolution of f_n,f_n−1, …, f_k+1;

f_{n, k, - m} (t) = ϕ_{- m} (n, k) \sum_{i = k + 1, i \neq m}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) t)}{ψ_{- m} (n, k, i)}

(C.21)

is the probability density of the convolution of f_n,f_n−1, …, f_m+1, f_m−1, …, f_k+1;

f_{n, k, - m, - m'} (T) = ϕ_{- m, - m'} (n, k) \sum_{i = k + 1, i \neq m, i \neq m'}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{ψ_{- m, - m'} (n, k, i)}

(C.22)

is the probability density of the convolution of f_n,f_n−1, …, f_m+1, f_m−1, …, f_m′+1, f_m′−1, …, f_k+1;

P_{n, k, - m, - m'} (T) = {\begin{matrix} ϕ_{- m, - m'} (n, k) \sum_{i = k, i \neq m, i \neq m'}^{n} \frac{exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{ψ_{- m, - {m'}^{(n, k - 1, i)}}} & if 2 \leq k \\ ϕ_{- m, - m'} (n, 1) \sum_{i = 2, i \neq m, i \neq m'}^{n} \frac{1 - exp (- (\begin{matrix} i \\ 2 \end{matrix}) T)}{(\begin{matrix} i \\ 2 \end{matrix}) ψ_{- m, - {m'}^{(n, 1, i)}}} & if k = 1 \end{matrix};

(C.23)

ϕ_{- m, - m'} (n, k) = {\begin{matrix} 1 & if n = k \\ \prod_{s = k + 1, s \neq m, s \neq m'}^{n} (\begin{matrix} s \\ 2 \end{matrix}) & otherwise \end{matrix};

(C.24)

ψ_{- m, - m'} (n, k, r) = {\begin{matrix} 1 & if r = n, k = n - 1 \\ \prod_{s = k + 1, s \neq r, s \neq m, s \neq m'}^{n} ((\begin{matrix} s \\ 2 \end{matrix}) - (\begin{matrix} r \\ 2 \end{matrix})) & otherwise \end{matrix};

(C.25)

w_{1} (m, i, T) = (T - \frac{1}{(\begin{matrix} i \\ 2 \end{matrix})}) \frac{g_{1} (m, T)}{(\begin{matrix} i \\ 2 \end{matrix}) (\begin{matrix} m \\ 2 \end{matrix})} - \frac{g_{2} (m, T)}{(\begin{matrix} i \\ 2 \end{matrix}) (\begin{matrix} m \\ 2 \end{matrix})} + \frac{h_{1} (m, i, T)}{{(\begin{matrix} i \\ 2 \end{matrix})}^{2}};

(C.26)

w_{2} (m, m', i, T) = - (T + \frac{1}{[(\begin{matrix} m' \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})]}) \frac{h_{1} (m, m', T)}{[(\begin{matrix} m' \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})]} + \frac{h_{2} (m, m', T)}{[(\begin{matrix} m' \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})]} + \frac{h_{1} (m, i, T)}{{[(\begin{matrix} m' \\ 2 \end{matrix}) - (\begin{matrix} i \\ 2 \end{matrix})]}^{2}} .

(C.27)

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Drummond A, Forsberg R, Rodrigo AG. The inference of stepwise changes in substitution rates using serial sequence samples. Mol Biol Evol. 2001;18(7):1365–1371. doi: 10.1093/oxfordjournals.molbev.a003920. [DOI] [PubMed] [Google Scholar]
Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002;161(3):1307–1320. doi: 10.1093/genetics/161.3.1307. [DOI] [PMC free article] [PubMed] [Google Scholar]
Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. Measurably evolving populations. Trends Ecol Evol. 2003;18(9):481–488. [Google Scholar]
Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155(3):1405–1413. doi: 10.1093/genetics/155.3.1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fisher RA. The Genetical Theory of Natural Selection. Oxford: Clarendon; 1930. [Google Scholar]
Ford MJ. Applications of selective neutrality tests to molecular ecology. Mol Ecol. 2002;11(8):1245–1262. doi: 10.1046/j.1365-294X.2002.01536.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics. 1994a Dec;138(4):1375–1386. doi: 10.1093/genetics/138.4.1375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. A phylogenetic estimator of effective population-size or mutation-rate. Genetics. 1994b;136(2):685–692. doi: 10.1093/genetics/136.2.685. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48(2):172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997 Oct;147(2):915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX, Chakraborty R. Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics. 1998;150(1):487–497. doi: 10.1093/genetics/150.1.487. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 1994;344(1310):403–410. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]
Hey J. The structure of genealogies and the distribution of fixed differences between DNA sequence samples from natural populations. Genetics. 1991;128(4):831–840. doi: 10.1093/genetics/128.4.831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johns G, Joyce G. The promise and peril of continuous in vitro evolution. J Mol Evol. 2005;61:253–263. doi: 10.1007/s00239-004-0307-1. [DOI] [PubMed] [Google Scholar]
Johnson NL, Kotz S. Urn Models and Their Application. New York: Wiley; 1977. [Google Scholar]
Kingman JFC. The coalescent. Stoch Process Appl. 1982a;13:235–248. [Google Scholar]
Kingman JFC. On the genealogy of large populations. J Appl Prob. 1982b;19A:27–43. [Google Scholar]
Kreitman M. Methods to detect selection in populations with applications to the human. Annu Rev Genomics Hum Genet. 2000;1:539–559. doi: 10.1146/annurev.genom.1.1.539. [DOI] [PubMed] [Google Scholar]
Liu X, Fu Y-X. Test of genetical isochronism for longitudinal samples of DNA sequences. Genetics. 2007;176:327–342. doi: 10.1534/genetics.106.065037. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
Rambaut A. Estimating the rate of molecular evolution: Incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics. 2000;16(4):395–399. doi: 10.1093/bioinformatics/16.4.395. [DOI] [PubMed] [Google Scholar]
Rodrigo AG, Felsenstein J. Coalescent approaches to HIV-1 population genetics. Ch. 8. In: Crandall KA, editor. The Evolution of HIV. Baltimore, Maryland: Johns Hopkins University Press; 1999. pp. 233–272. [Google Scholar]
Rodrigo AG, Goode M, Forsberg R, Ross HA, Drummond A. Inferring evolutionary rates using serially sampled sequences from several populations. Mol Biol Evol. 2003;20(12):2010–2018. doi: 10.1093/molbev/msg215. [DOI] [PubMed] [Google Scholar]
Seo T-K, Thorne JL, Hasegawa M, Kishino H. Estimation of effective population size of HIV-1 within a host: A pseudomaximum-likelihood approach. Genetics. 2002;160(4):1283–1293. doi: 10.1093/genetics/160.4.1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F. Evolutionary relationship of DNA-sequences in finite populations. Genetics. 1983;105(2):437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré S. Line-of-descent and genealogical processes, and their applications in population-genetics models. Theor Popul Biol. 1984;26(2):119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
Wakeley J, Hey J. Estimating ancestral population parameters. Genetics. 1997;145(3):847–855. doi: 10.1093/genetics/145.3.847. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson GA. Number of segregating sites in genetic models without recombination. Theor Popul Biol. 1975;7(2):256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]
Willerslev E, Cooper A. Ancient DNA. Proc Biol Sci. 2005;272:3–16. doi: 10.1098/rspb.2004.2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright S. Evolution in mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Drummond A, Forsberg R, Rodrigo AG. The inference of stepwise changes in substitution rates using serial sequence samples. Mol Biol Evol. 2001;18(7):1365–1371. doi: 10.1093/oxfordjournals.molbev.a003920. [DOI] [PubMed] [Google Scholar]

[R2] Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002;161(3):1307–1320. doi: 10.1093/genetics/161.3.1307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. Measurably evolving populations. Trends Ecol Evol. 2003;18(9):481–488. [Google Scholar]

[R4] Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155(3):1405–1413. doi: 10.1093/genetics/155.3.1405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fisher RA. The Genetical Theory of Natural Selection. Oxford: Clarendon; 1930. [Google Scholar]

[R6] Ford MJ. Applications of selective neutrality tests to molecular ecology. Mol Ecol. 2002;11(8):1245–1262. doi: 10.1046/j.1365-294X.2002.01536.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fu YX. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics. 1994a Dec;138(4):1375–1386. doi: 10.1093/genetics/138.4.1375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fu YX. A phylogenetic estimator of effective population-size or mutation-rate. Genetics. 1994b;136(2):685–692. doi: 10.1093/genetics/136.2.685. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48(2):172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]

[R10] Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997 Oct;147(2):915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Fu YX, Chakraborty R. Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics. 1998;150(1):487–497. doi: 10.1093/genetics/150.1.487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 1994;344(1310):403–410. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]

[R14] Hey J. The structure of genealogies and the distribution of fixed differences between DNA sequence samples from natural populations. Genetics. 1991;128(4):831–840. doi: 10.1093/genetics/128.4.831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Johns G, Joyce G. The promise and peril of continuous in vitro evolution. J Mol Evol. 2005;61:253–263. doi: 10.1007/s00239-004-0307-1. [DOI] [PubMed] [Google Scholar]

[R16] Johnson NL, Kotz S. Urn Models and Their Application. New York: Wiley; 1977. [Google Scholar]

[R17] Kingman JFC. The coalescent. Stoch Process Appl. 1982a;13:235–248. [Google Scholar]

[R18] Kingman JFC. On the genealogy of large populations. J Appl Prob. 1982b;19A:27–43. [Google Scholar]

[R19] Kreitman M. Methods to detect selection in populations with applications to the human. Annu Rev Genomics Hum Genet. 2000;1:539–559. doi: 10.1146/annurev.genom.1.1.539. [DOI] [PubMed] [Google Scholar]

[R20] Liu X, Fu Y-X. Test of genetical isochronism for longitudinal samples of DNA sequences. Genetics. 2007;176:327–342. doi: 10.1534/genetics.106.065037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]

[R22] Rambaut A. Estimating the rate of molecular evolution: Incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics. 2000;16(4):395–399. doi: 10.1093/bioinformatics/16.4.395. [DOI] [PubMed] [Google Scholar]

[R23] Rodrigo AG, Felsenstein J. Coalescent approaches to HIV-1 population genetics. Ch. 8. In: Crandall KA, editor. The Evolution of HIV. Baltimore, Maryland: Johns Hopkins University Press; 1999. pp. 233–272. [Google Scholar]

[R24] Rodrigo AG, Goode M, Forsberg R, Ross HA, Drummond A. Inferring evolutionary rates using serially sampled sequences from several populations. Mol Biol Evol. 2003;20(12):2010–2018. doi: 10.1093/molbev/msg215. [DOI] [PubMed] [Google Scholar]

[R25] Seo T-K, Thorne JL, Hasegawa M, Kishino H. Estimation of effective population size of HIV-1 within a host: A pseudomaximum-likelihood approach. Genetics. 2002;160(4):1283–1293. doi: 10.1093/genetics/160.4.1283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Tajima F. Evolutionary relationship of DNA-sequences in finite populations. Genetics. 1983;105(2):437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Tavaré S. Line-of-descent and genealogical processes, and their applications in population-genetics models. Theor Popul Biol. 1984;26(2):119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]

[R29] Wakeley J, Hey J. Estimating ancestral population parameters. Genetics. 1997;145(3):847–855. doi: 10.1093/genetics/145.3.847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Watterson GA. Number of segregating sites in genetic models without recombination. Theor Popul Biol. 1975;7(2):256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]

[R31] Willerslev E, Cooper A. Ancient DNA. Proc Biol Sci. 2005;272:3–16. doi: 10.1098/rspb.2004.2813. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Wright S. Evolution in mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Summary Statistics of Neutral Mutations in Longitudinal DNA Samples

Xiaoming Liu

Yun-Xin Fu

Abstract

1 Introduction

2 Summary statistics for longitudinal samples

Figure 1.

Figure 2.

3 Two-sample model and two-sample-two-stage model

Figure 3.

4 Number of mutations of various sizes in two samples

4.1 Calculate E (ξ_ij)

4.2 Calculate $E (ξ_{ij}^{2})$ and E (ξ_ijξ_lm)

5 Number of private, shared and fixed mutations

6 Average number of mutations since the MRCA of two sequences between samples

7 Discussion

Acknowledgments

Appendix

A Probability density of coalescent time under two-sample model

B Coalescent time under two-sample-two-stage model

B.1 Calculate E (t_m (0)) and $E (t_{m}^{2} (0))$

B.2 Calculate E (t_m (0) t_m′ (0))

C Functions

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Summary Statistics of Neutral Mutations in Longitudinal DNA Samples

Xiaoming Liu

Yun-Xin Fu

Abstract

1 Introduction

2 Summary statistics for longitudinal samples

Figure 1.

Figure 2.

3 Two-sample model and two-sample-two-stage model

Figure 3.

4 Number of mutations of various sizes in two samples

4.1 Calculate E (ξij)

4.2 Calculate E(ξij2) and E (ξijξlm)

5 Number of private, shared and fixed mutations

6 Average number of mutations since the MRCA of two sequences between samples

7 Discussion

Acknowledgments

Appendix

A Probability density of coalescent time under two-sample model

B Coalescent time under two-sample-two-stage model

B.1 Calculate E (tm (0)) and E(tm2(0))

B.2 Calculate E (tm (0) tm′ (0))

C Functions

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.1 Calculate E (ξ_ij)

4.2 Calculate $E (ξ_{ij}^{2})$ and E (ξ_ijξ_lm)

B.1 Calculate E (t_m (0)) and $E (t_{m}^{2} (0))$

B.2 Calculate E (t_m (0) t_m′ (0))