An official website of the United States government
Here's how you know
Official websites use .gov
A
.gov website belongs to an official
government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you've safely
connected to the .gov website. Share sensitive
information only on official, secure websites.
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with,
the contents by NLM or the National Institutes of Health.
Learn more:
PMC Disclaimer
|
PMC Copyright Notice
. Author manuscript; available in PMC: 2011 Mar 18.
Published in final edited form as: Theor Popul Biol. 2008 May 5;74(1):56–67. doi: 10.1016/j.tpb.2008.04.006
Summary Statistics of Neutral Mutations in Longitudinal DNA Samples
The publisher's version of this article is available at Theor Popul Biol
Abstract
Longitudinal samples of DNA sequences are the DNA sequences sampled from the same population at different time points. For fast evolving organisms, e.g. RNA virus, these kind of samples have increasingly been used to study the evolutionary process in action. Longitudinal samples provide some interesting new summary statistics of genetic variation, such as the frequency of mutation of size i in one sample and size j in another, the average number of mutations accumulated since the common ancestor of two sequences each from a different sample, and number of private, shared and fixed mutations within samples. To make the results more applicable, we used in this study a general two-sample model, which assumes two longitudinal samples were taken from the same measurably evolving population. Inspired by the HIV study, we also studied a two-sample-two-stage model, which is a special case of two-sample model and assumes a treatment after the first sampling instantaneously changes the population size. We derived the formulas for calculating statistical properties, e.g. expectations, variances and covariances, of these new summary statistics under the two models. Potential applications of these results were discussed.
Keywords: longitudinal sample, coalescent theory, summary statistics, measurably evolving population
1 Introduction
One fundamental goal of population genetics is to understand the change of genetic variation of a population in the process of evolution. Longitudinal samples, or serial samples, are samples taken at a series of time points from the same population. For fast evolving organisms (e.g. RNA virus), longitudinal DNA/RNA samples are often used to study the change of the population. Other applications of longitudinal DNA/RNA samples include ancient DNA studies (Willerslev and Cooper, 2005) and artificial evolution experiments on RNA enzymes (Johns and Joyce, 2005). Several new statistical methods have been developed for analyzing longitudinal DNA samples (reviewed by Drummond et al., 2003). Most of the methods are based on likelihood frameworks (e.g. Rodrigo and Felsenstein, 1999; Rambaut, 2000; Drummond et al., 2001; Seo et al., 2002; Drummond et al., 2002; Rodrigo et al., 2003), that try to capture all information from the data and thus to be more powerful. However, there are some difficulties faced by those methods, such as high computational intensity and high model dependency. Especially, calculating likelihood of haplotypes with recombination is computationally so intensive that unrealistic assumptions such as no recombination or free recombination between markers are often made. Such assumptions may cause large bias on parameter estimations. On the other hand, using summary statistics may suffer from some information loss. But with carefully chosen summary statistics, it is possible to find an effective way to capture the essential information in a sample, while avoiding the bias introduced and some difficulties faced by likelihood-base methods.
In this paper we studied some new summary statistics of mutation number provided by longitudinal samples, namely the frequency of mutation of size i in one sample and size j in another, the average number of mutations accumulated since the common ancestor of two sequences each from a different sample, and number of private, shared and fixed mutations within samples (see definitions below). Assuming no selection and no recombination, we derived the formulas for calculating the expectations, variances and covariances of these summary statistics. The statistical properties of these summary statistics are independent to mutation models. Under the infinite site model, the number of mutations and the number of segregating sites is equal, so that the results of this paper can be applied directly to the sequence data. Under a finite site model, the number of mutations and the number of segregating sites is not equal. But as long as the mutation model is given, the number of mutations can be recovered adequately in most cases and these summary statistics of mutations can be widely applicable also. While the expectations of these summary statistics are not affected by recombination events, their variances will be smaller when there is recombination. So that, with applying some estimators or test statistics built on the variances of these summary statistics, cautions are needed when interpreting the results, if recombination can not be ignored.
In the remaining of this paper we first introduced the definitions of different summary statistics with our two-sample model and its special case, the two-sample-two-stage model, which is inspired by HIV treatment. Then we showed the formulas for calculating the expectations, variances and covariances of these summary statistics. Finally, we discussed their potential applications. Most technical details can be found in Appendix.
2 Summary statistics for longitudinal samples
Among the most well-known summary statistics for DNA polymorphism (segregating sites) are the number of mutations of various sizes (ξi), number of segregating sites (K) and average difference between two sequences (Π) (see definitions below). The properties of these statistics have been well studied for the case of single sample assuming infinite site model (Watterson, 1975; Tajima, 1983; Tavaré, 1984; Fu, 1995). Since in infinite site model each new mutation produce a new segregating site, segregating site and mutation are mutually exchangeable. In this paper, we borrowed the notations of these classic summary statistics for our new statistics studied, but keep in mind our new summary statistics are for mutations. They can apply to sequence analysis directly with assumption of the infinite site model. But with finite site model, a transformation to statistics of segregating is needed.
We begin with the definition of these summary statistics under the infinite site model. A mutation on the genealogy of a DNA sample divides the sample into two groups. One consists of sequences that inherit the mutation and the other does not. If there are i descendants of the mutation in the sample, the mutation is said to be size i. Let ξi be the total number (or frequency) of mutations of size i in that sample, the range of possible i is 1 to n − 1, where n is the sample size. The total number of segregating sites, K, is just the total number of mutations on the genealogy of the sample after their most recent common ancestor (MRCA). It can also be regarded as the sum of all ξi, i.e.
(1)
The average number of difference between two sequences, Π, is defined as
(2)
where dij is the number of differences between the i-th and j-th sequence of the sample of the total n sequences. Again Π can be written as a linear combination of ξi, i.e.
(3)
An example of a genealogy of five sequences was presented in Figure 1. The sample sequences and their MRCA sequence were shown at the bottom and top of the tree. There were four mutations (shown as gray dots) on the genealogy with mutant ancestor sequences shown on the left-hand side. The ti (i ≥ 2) shown on the right-hand side of the tree was the i-th coalescent time. If we define state i as the state when there are exact i lines on the genealogy (or i ancestral sequences of the sample), the i-th coalescent time is the time duration in state i. Given these sequences, it is easy to calculate that ξ1 = 2, ξ2 = ξ3 = 1, ξ4 = 0, K = 4, Π = 2.
The above summary statistics describe the mutations in a single sample. Because longitudinal samples consist of more than one correlated samples, there are some interesting summary statistics describing the variation between samples. The purpose of this study is to investigate the statistical properties of these summary statistics, based on which methods for analyzing longitudinal samples can be developed.
The type of summary statistics we were interested in this paper was best illustrated in Figure 2. It showed a genealogy of three longitudinal samples s1, s2 and s3 sampled at three different time points, and each sample had three sequences. There were five mutations numbered 1 to 5 on the genealogy, which were shown as gray dots. Consider a pair of longitudinal samples first. Define ξij be the number of mutations which is size i in the earlier sample and size j in the later sample. If sample s1 is assumed to be the earlier sample and sample s2 as the later sample, we can see: mutation 1 is size 1 in both samples; mutation 2 is size 1 in sample s1 and size 0 in sample s2; mutation 3 is size 0 in sample s1 and size 1 in sample s2. So that, for sample s1 and s2, ξ11 = ξ10 = ξ01 = 1 and all other ξijs equal 0. Now look at sample s2 and sample s3. We can see: mutation 3 is size 1 in sample s2 and size 3 in sample s3; mutation 4 is size 0 in sample s2 and size 3 in sample s3; mutation 5 is size 0 in sample s2 and size 2 in sample s3. Assume sample s2 as the earlier sample and sample s3 as the later sample, then ξ13 = ξ03 = ξ02 = 1 and all other ξijs equal 0.
A genealogy of three longitudinal samples each with three sequences.
We can divide ξijs into different groups to form new summary statistics. Define the number of private mutations, Kp, as the number of mutations which are polymorphic only in one sample but not the other. Kp can be further divided to Kp1 and Kp2, the number of mutations which is polymorphic only in the earlier sample or the later sample, respectively. Define the number of shared mutations, Ks, as the number of mutations which are polymorphic in both samples. Define the number of fixed mutations, Kf, as the number of mutations which is monomorphic in both samples but the monomorphic type of the earlier sample is different to that of the later sample. Write them in the form of functions of ξijs:
(4)
(5)
(6)
(7)
where n1 and n2 are sizes of the first and the second sample, respectively. Look at sample s1 and s2 again, we can see mutation 1 is a shared mutation of sample s1 and s2, mutation 2 is a private mutation of sample s1 and mutation 3 is a private mutation of sample s2. If we look at sample s2 and s3, then mutation 3 is a private mutation of sample s2, mutation 4 is a fixed mutation and mutation 5 is a private mutation of sample s3. So that, for s1 and s2, Kp1 = Kp2 = Ks = 1 and Kf = 0. For s2 and s3, Kp1 = Kp2 = Kf = 1 and Ks = 0. Actually, if we only consider two longitudinal samples, shared mutations and fixed mutations are mutually exclusive, that is, either Ks or Kf must equal 0 (Wakeley and Hey, 1997). This fact is easy to understand by considering the possible position of mutations on the genealogy.
As to the average number of mutations accumulated since the common ancestor of two sequences (i.e. the average differences between two sequences under the infinite site model), other than the cases both sequences are from the same sample, now we can define Π12 as the average number of mutations accumulated since the common ancestor of two sequences, with one sequence from the earlier sample and one sequence from the later sample:
(8)
where is the number of mutation accumulated on the i-th sequence of the earlier sample and the j-th sequence of the later sample after their MRCA, and again n1 and n2 are sizes of the two samples. Write it as a function of ξijs:
(9)
3 Two-sample model and two-sample-two-stage model
One of the simple models for longitudinal samples is the two-sample model. Under it only two longitudinal samples are considered. So that ξij, Kp, Ks, Kf and Π12 are all summary statistics for two longitudinal samples. In this model, we assumed the population followed the commonly used Wright-Fisher model for haploids (Fisher, 1930; Wright, 1931) and the population size change follows a determinate function
(10)
where ω (t) is a function of time t; N (0) and N (t) are the population sizes at time 0 and time t, respectively. The second sample (sample 2, size n2) is taken at time 0 and the first sample (sample 1, size n1) is taken at g generations before. (We followed the retrospective view of coalescent theory and defined time 0 as the time point when sample 2 was taken. However, the “before” and “after” mentioned in this paper still referred to the normal prospective view.) Unless otherwise mentioned, time in the paper was scaled in N (0) generations and let T = g/N (0). Let N1 and N2 be the population size at time points of the first sampling and the second sampling, respectively. That is, N1 = N (T) and N2 = N (0). We further assumed that all mutations were neutral; the mutation rate was μ per sequence per generation and remained constant over time; and the number of mutations accumulated in a given period of time τ follow the Poisson distribution with parameter μτ. Some other parameters are defined as following: λ = N1/N2, θ1 = 2N1μ, θ2 = 2N2μ.
In HIV study, people can use longitudinal samples to study the effect of a treatment, for example, giving a treatment after the first sampling and taking a second sample after a period. In this case we may assume a two-sample-two-stage model, which is a special case of the general two-sample model. Figure 3 illustrated the model with two longitudinal samples each with three sequences. In this model, we assume that before the earlier sample (sample 1) was taken, the effective population size, N1, remained constant. Then right after sample 1 was taken, a treatment is given and the effective population size experienced an instantaneously change to N2, then it remained constantly thereafter. g generations after sample 1 was taken, the later sample, sample 2, was taken. This two-sample-two-stage model is closely related to Wakeley and Hey (1997)’s isolation model and size-change model. Actually, some of the expectations of the new summary statistics in this study were derived from their results. However, they provided no formulas for calculating the variance and covariances of the statistics, besides they missed some other important expectations. That largely limited the application of these statistics and we hope this study can fill these gaps.
The basis of two-sample model is coalescent times. Let tm (T) and tm (0) be the m-th coalescent times of samples taken at time T and time 0, respectively. Let tma and tmb be the part of tm (0) after and before T, respectively. Suppose at T, there are k ancestral sequences of sample 2. Then given k, tka and tkb are independent; if m < k, tma = 0 and tmb = tm (0); if m > k, tma = tm (0) and tmb = 0. If k = 1, as shown in Figure 3, define t1 (0) = t1a as the time interval between T and the MRCA of sample 2, that is,
(11)
where max (·) stands for maximum.
With the probability densities of coalescent times under the two-sample model or two-sample-two-stage model (see Appendix A and B) we can calculate E (tm (T)), E (tma) and E (tmb). Then all the means of the summary statistics studied in this paper can be calculated directly (see details in following sections). Variances and covariances of the summary statistics can also be derived under the two-sample model, which need and E (tmatm′a), where m′ ≠ m. The mathematics becomes more complicated. For a simpler model, such as the two-sample-two-stage model, it is possible to calculate them directly (see Appendix B). For more complicated demographic change, simulation may be used to obtain those quantities. Since only coalescent times need to be simulated, the simulation process can be very fast and accurate estimations of these quantities can be obtained practically.
4 Number of mutations of various sizes in two samples
4.1 Calculate E (ξij)
As defined in Section 2, ξij is the number of mutations which are size i in sample 1 and size j in sample 2. Further define ξijb and ξija as the parts of ξij that occurred before and after T, respectively. Assume there are n′ (n′ ≤ n2) ancestral sequences of sample 2 at time T, then
(12)
(13)
where E (·|n′) means the expectation of some statistic, specified by ·, given there are n′ ancestral sequences of sample 2 at time T; min (·) stands for the minimum; ξi|n (T) is the number of mutations of size i in a sample of size n and taken at time T; P (k′ → j|n, k) is a function of k′, j, n, k, whose meaning and formula can be found in Appendix C.
The properties of ξi|n (T) have been studied by Fu (1995) and for 1 ≤ j < n
(14)
where P (k, i|n) is a function of k, i and n, whose meaning and formula can be found in Appendix C; ζk (T) is the number of mutations accumulated on a line of state k.
(15)
Under two-sample-two-stage model, it can be simplified (Fu, 1995) as
(16)
where “ETT” means the expectation are only applicable to the two-sample-two-stage model. Substitute (16) into (14) and (12), we have
For the shared mutations, the expectation of ξij can be derived from the results of Wakeley and Hey (1997):
(19)
(20)
(21)
with 0 < i ≤ n1, 0 < j < n2, where Pn,k (T) is a function of n, k and T, whose meaning and formula can be found in Appendix C.
Similarly, we can compute the expectations for the fixed or private mutations:
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
where 0 < i ≤ n1, 0 < j < n2, except 0 < i < n1 in (25).
4.2 Calculate and E (ξijξlm)
Again, assume there are n′ (n′ ≤ n2) ancestral sequences of sample 2 at time T. E (ξijbξlmb|n′) can be computed as
(37)
and
(38)
where
(39)
(40)
(41)
(42)
(43)
(44)
δ(r) is an index function that takes the value 1 if the relationship r is true and 0 otherwise; an is a function of n; βn (i) and γn (i) are functions of n and i; P (j′ → j, m′ → m|n, n′) is a function of j′, j, m′, m, n′ and n; P (k, i; k, j|n) is a function of i, j, k and n; Pa (k, i; k′, j|n) and Pb (k, i; k′, j|n) are functions of i, j, k, k′ and n; the meanings and formulas of these functions can be found in Appendix C.
Then
(45)
For E (ξijaξlma), consider the following conditions:
where ζka and are the numbers of mutations accumulated on two different lines of state k after T;
(50)
(51)
(52)
Since the numbers of mutations before and after T are independent to each other given the state at T,
(53)
(54)
Finally,
(55)
(56)
5 Number of private, shared and fixed mutations
The expectations of E (Kf), E (Ks), E (Kp1) and E (Kp2) can be derived from the results of Wakeley and Hey (1997). That is,
(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)
(65)
(66)
(67)
(68)
(69)
(70)
In Section 2 we also showed that Kf, Ks, Kp1 and Kp2 can also be expressed as the sum of ξij using (4–7). Obviously, their expectations can also be calculated using E (ξij)s, i.e.,
(71)
(72)
(73)
(74)
We can also use the summation of E (ξijξlm) to calculate the expectations of the products of Kf, Ks, Kp1 and Kp2:
(75)
(76)
(77)
(78)
(79)
(80)
(81)
(82)
(83)
(84)
with which variances and covariances of Kf, Ks, Kp1 and Kp2 can be calculated easily.
6 Average number of mutations since the MRCA of two sequences between samples
It is easy to see that
(85)
(86)
and
(87)
(88)
where i ≠ l and j ≠ k. Since can also be expressed and calculated as
(89)
can be obtain by solving equations for . That is
(90)
With and E (Π12),
(91)
7 Discussion
Longitudinal samples of DNA sequences are the DNA sequences sampled from the same population at different time points. For fast evolving organisms, e.g. RNA virus, these kind of samples have increasingly been used to study the evolutionary process in action. Longitudinal samples provide some interesting new summary statistics of genetic variation, such as the frequency of mutation of size i in one sample and size j in another (ξij), the average number of mutations accumulated since the MRCA of two sequences each from a different sample (Π12), and the number of private, shared and fixed mutations (Kp1, Kp2, Ks and Kf). All these statistics were studied under the two-sample model, which assumes two longitudinal samples. Because in the application of longitudinal samples of RNA viruses, a sampling is often followed by medical treatments, we also studied a two-sample-two-stage model, which further assumes an instantaneous population size change right after the first sampling and no population size change before or after that event. We derived the formulas for calculating statistical properties, e.g. expectations, variances and covariances, of these new summary statistics that can be used to develop methods for better analysis of longitudinal samples.
The results developed in this paper can be utilized in several ways. One is to estimate various parameters. For example, θ is a very important parameter in population genetics. Fu and Li (1993) showed that a better estimator than the widely used Watterson (1975)’s can be found by partitioning the total number of mutations K into some non-overlapping categories and building an estimator on the more detailed information. The best linear unbiased estimators (BLUEs) were developed on that rational (Fu, 1994a,b). One natural partition for longitudinal samples is ξij, which can provide much more detailed partition than that of ξ in a single sample (n1n2 + n1 + n2 − 4 possible partitions vs. n − 1 possible partitions). So based on ξij it is possible to build a better estimator of θ. The construction of BLUE needs to know the estimation, variances and covariance of these summary statistics, which can be computed using the formulas derived in this paper. One advantage of BLUE based on the ξs over the estimators based on the likelihood of genealogy is its robustness when the sequences have undergone recombination. Recombination does not affect the expectations of ξs, but it will reduce the variances of them, which actually will improve the accuracy of the estimations. So that, for studying the organisms with relative high recombination rate, such as HIV, the BLUE is preferred. For the parameters whose linear or quadratic relationship with summary statistics does not exist, e.g. growth rate, the BLUE can not apply. One alternative is to use χ2 estimator developed by Fu and Chakraborty (1998). Another alternative is to use moment based estimators by solving a group of equations numerically, since the estimations of summary statistics can be expressed as functions of these parameters. For instance, Wakeley and Hey (1997) showed an example of using Kp1, Kp2, Ks and Kf to estimate θ1, θ2 and T under the infinite site model.
Hypothesis tests can also be constructed with the summary statistics studied in this paper. One of the most interested tests in empirical population genetics studies is testing the hypothesis of neutral mutations. Significant progress has been made over the last decade in developing such statistical tests (see reviews by Kreitman, 2000; Ford, 2002). One type of tests uses the difference between two canonical summary statistics of genetic variation within a single population (Tajima, 1989; Fu and Li, 1993; Fu, 1997; Fay and Wu, 2000). Most of the tests of this type are of the form
(92)
where L1 and L2 are summary statistics with the expectation of L1 − L2 equals zero under the null hypothesis that all mutations are selective neutral. Since longitudinal samples provide more summary statistics than a single sample, there exists a potential to find more powerful tests of form (92). Other hypotheses tests useful for longitudinal samples include testing the significance of genetic difference between two longitudinal samples and testing the population size change or mutation rate change. One example was given by Liu and Fu (2007), which proposed several methods for testing genetical isochronism or detecting significant genetical heterochronism between two longitudinal samples. Two best tests named Tc2 and Tc3 achieved 75% power with 5% significance level if μg = 1, θ1 = 10 and n1 = n2 = 20. With the same μg = 1 but much larger θ1 = 40, they can still achieve about 65% power with 5% significance level if n1 = n2 = 40. These methods can be used to determine the necessary sample size and sampling interval in experimental design or to combine genetically isochronic samples for better data analysis.
Acknowledgments
This research was supported by National Institutes of Health grants GM60777 and GM50428. We thank Sara Barton for preparing the manuscript. We thank three anonymous reviewers for their comments and suggestions. Part of the work was done during an academic visit to the Institute of Biostatistics, Fudan University, Shanghai, China. We thank Dr. Zewei Luo for his invitation.
Appendix
A Probability density of coalescent time under two-sample model
Assuming the population size change follows a determinate function
where N (0) and N (t) are the population sizes at time 0 and time t respectively. Here time is scaled in N (0) generations.
For sample 1,
(A.1)
where
(A.2)
is the conditional probability density function of the m-th coalescent time given at time s there are m ancestral sequences (Griffiths and Tavaré, 1994).
is the conditional probability density of convolution of given at time s there are n ancestral sequences (Polanski et al., 2003).
For sample 2,
(A.3)
(A.4)
(A.5)
where
(A.6)
is the conditional probability of n sequences coalesce to k sequences in t given there are n sequences at time s (Griffiths and Tavaré, 1994).
B Coalescent time under two-sample-two-stage model
The properties of coalescent time of sample 1 are the same as a single sample, which have been well known assuming the population size is large and a continuous approximation of time (Kingman, 1982a,b). Further, if N1 = N2 or λ = 1, the properties of tm (0) are the same as a single sample. Here we only consider tm (0) under the situation when λ ≠ 1. Since this section is for the two-sample-two-stage model, we dropped the “TT” in the “ETT” temporarily.
B.1 Calculate E (tm (0)) and
Given there are k sequences at T, for all m ≤ k, tmb has the same properties as m-th coalescent time of a single sample. Obviously, we have
is the probability density of coalescent time m; the details of functions fn,k (t), fn,k,−m (t), Pn,k (T), Pn,k,−m (T), ϕ (n, k), ψ (n, k, r), ϕ−m (n, k), ψ−m (n, k, r), g1 (n, T), g2 (n, T), h1 (n, i, T) and h2 (n, i, T) can be found in Appendix C.
All together,
(B.31)
(B.32)
B.2 Calculate E (tm (0) tm′ (0))
If λ ≠ 1, the coalescent time may not be independent. Given there are k sequences at T, tm (0) and tm′ (0) are dependent only if m, m′ ≥ k. Here we show how to compute E (tmatm′a|k) assuming m > m′ ≥ k without loss of generality.
where details of functions fn,k,−m,−m′ (T), Pn,k,−m,−m′ (T), ϕ−m,−m′ (n, k), ψ−m,−m′ (n, k, r), w1 (m, i, T) and w2 (m, m′, i, T) can be found in Appendix C.
In general, given m > m′
(B.42)
All together,
(B.43)
C Functions
(C.1)
is the probability that a randomly chosen line of state k is of size i at state n;
(C.2)
is the probability that k′ lines in state k grow to j lines in state n;
(C.3)
is the probability of n sequences coalesce to k sequences in T;
(C.4)
(C.5)
(C.6)
(C.7)
(C.8)
(C.9)
is the joint probability of mutations of size j′ at state n′ which grow to size j at state n and mutations of size m′ at state n′ which grow to size m at state n (Johnson and Kotz, 1977);
(C.10)
is the probability that two randomly chosen lines of state k are of size i and j at state n;
is the probability that a line of state k and a line of state k′ are of size i and j at state n respectively (Fu, 1995), where
(C.11)
is the probability for the case that the line of state k′ is a descendant of the line of state k and
(C.12)
is the probability for the case that the line of state k′ is not a descendant of the line of state k;
(C.13)
(C.14)
(C.15)
(C.16)
(C.17)
(C.18)
(C.19)
(C.20)
is the probability density of the convolution of fn,fn−1, …, fk+1;
(C.21)
is the probability density of the convolution of fn,fn−1, …, fm+1, fm−1, …, fk+1;
(C.22)
is the probability density of the convolution of fn,fn−1, …, fm+1, fm−1, …, fm′+1, fm′−1, …, fk+1;
(C.23)
(C.24)
(C.25)
(C.26)
(C.27)
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
Drummond A, Forsberg R, Rodrigo AG. The inference of stepwise changes in substitution rates using serial sequence samples. Mol Biol Evol. 2001;18(7):1365–1371. doi: 10.1093/oxfordjournals.molbev.a003920. [DOI] [PubMed] [Google Scholar]
Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002;161(3):1307–1320. doi: 10.1093/genetics/161.3.1307. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fisher RA. The Genetical Theory of Natural Selection. Oxford: Clarendon; 1930. [Google Scholar]
Ford MJ. Applications of selective neutrality tests to molecular ecology. Mol Ecol. 2002;11(8):1245–1262. doi: 10.1046/j.1365-294X.2002.01536.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics. 1994a Dec;138(4):1375–1386. doi: 10.1093/genetics/138.4.1375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. A phylogenetic estimator of effective population-size or mutation-rate. Genetics. 1994b;136(2):685–692. doi: 10.1093/genetics/136.2.685. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48(2):172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997 Oct;147(2):915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX, Chakraborty R. Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics. 1998;150(1):487–497. doi: 10.1093/genetics/150.1.487. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 1994;344(1310):403–410. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]
Hey J. The structure of genealogies and the distribution of fixed differences between DNA sequence samples from natural populations. Genetics. 1991;128(4):831–840. doi: 10.1093/genetics/128.4.831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johns G, Joyce G. The promise and peril of continuous in vitro evolution. J Mol Evol. 2005;61:253–263. doi: 10.1007/s00239-004-0307-1. [DOI] [PubMed] [Google Scholar]
Johnson NL, Kotz S. Urn Models and Their Application. New York: Wiley; 1977. [Google Scholar]
Kingman JFC. The coalescent. Stoch Process Appl. 1982a;13:235–248. [Google Scholar]
Kingman JFC. On the genealogy of large populations. J Appl Prob. 1982b;19A:27–43. [Google Scholar]
Kreitman M. Methods to detect selection in populations with applications to the human. Annu Rev Genomics Hum Genet. 2000;1:539–559. doi: 10.1146/annurev.genom.1.1.539. [DOI] [PubMed] [Google Scholar]
Liu X, Fu Y-X. Test of genetical isochronism for longitudinal samples of DNA sequences. Genetics. 2007;176:327–342. doi: 10.1534/genetics.106.065037. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
Rambaut A. Estimating the rate of molecular evolution: Incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics. 2000;16(4):395–399. doi: 10.1093/bioinformatics/16.4.395. [DOI] [PubMed] [Google Scholar]
Rodrigo AG, Felsenstein J. Coalescent approaches to HIV-1 population genetics. Ch. 8. In: Crandall KA, editor. The Evolution of HIV. Baltimore, Maryland: Johns Hopkins University Press; 1999. pp. 233–272. [Google Scholar]
Rodrigo AG, Goode M, Forsberg R, Ross HA, Drummond A. Inferring evolutionary rates using serially sampled sequences from several populations. Mol Biol Evol. 2003;20(12):2010–2018. doi: 10.1093/molbev/msg215. [DOI] [PubMed] [Google Scholar]
Seo T-K, Thorne JL, Hasegawa M, Kishino H. Estimation of effective population size of HIV-1 within a host: A pseudomaximum-likelihood approach. Genetics. 2002;160(4):1283–1293. doi: 10.1093/genetics/160.4.1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F. Evolutionary relationship of DNA-sequences in finite populations. Genetics. 1983;105(2):437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré S. Line-of-descent and genealogical processes, and their applications in population-genetics models. Theor Popul Biol. 1984;26(2):119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
Wakeley J, Hey J. Estimating ancestral population parameters. Genetics. 1997;145(3):847–855. doi: 10.1093/genetics/145.3.847. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson GA. Number of segregating sites in genetic models without recombination. Theor Popul Biol. 1975;7(2):256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]