Abstract
Recent advances in sequencing technologies have made available an ever-increasing amount of ancient genomic data. In particular, it is now possible to target specific single nucleotide polymorphisms in several samples at different time points. Such time-series data are also available in the context of experimental or viral evolution. Time-series data should allow for a more precise inference of population genetic parameters and to test hypotheses about the recent action of natural selection. In this manuscript, we develop a likelihood method to jointly estimate the selection coefficient and the age of an allele from time-serial data. Our method can be used for allele frequencies sampled from a single diallelic locus. The transition probabilities are calculated by approximating the standard diffusion equation of the Wright–Fisher model with a one-step process. We show that our method produces unbiased estimates. The accuracy of the method is tested via simulations. Finally, the utility of the method is illustrated with an application to several loci encoding coat color in horses, a pattern that has previously been linked with domestication. Importantly, given our ability to estimate the age of the allele, it is possible to gain traction on the important problem of distinguishing selection on new mutations from selection on standing variation. In this coat color example for instance, we estimate the age of this allele, which is found to predate domestication.
Keywords: iallele age, ancient DNA, population genetics, selection, time-serial data
TIME-series analysis is widespread in several fields, such as meteorology, economics, and physics (Hamilton 1994) with the relation being statistical models designed to deal with a time-ordered sequence of observations. Such observations are also prevalent in several areas of biology. Until recently, however, time-series molecular data were only available for time spanning a few generations in higher organisms. Therefore, in the context of population genetics, time-serial data were mostly limited to viral or experimental evolution (Wichman et al. 2005; Bollback and Huelsenbeck 2007; Nelson and Holmes 2007; Gresham et al. 2008).
However, with recent advances in DNA sequencing and DNA preparation techniques, the study of extinct and long dead organisms is now entering a new era, an era in which time-sampled measurements spanning hundreds or thousands of generations for even mammalian species may be obtained. For example, while previous studies were limited to short segments of mitochondrial DNA, whole nuclear genomes are now available from several ancient samples (Rasmussen et al. 2010; Reich et al. 2010), and it is now additionally possible to target specific DNA regions in ancient organisms (Lalueza-Fox et al. 2007; Ludwig et al. 2009; Rusk 2009). Therefore, time-serial data will become increasingly available for a whole range of organisms allowing one to test evolutionary questions using not only present day samples, but also samples from extinct populations.
The relevant theory to describe such temporal changes in allele frequency has existed since the advent of population genetics (Fisher 1922; Wright 1931). Although not very common, several statistical methods and estimators to deal with time-serial data have been developed and applied to, for example, estimate historical changes in population size (Waples 1989; Williamson and Slatkin 1999; Anderson et al. 2000; Drummond and Rambaut 2007). More recently, in 2008, Bollback et al. developed a method to coestimate the effective population size, Ne, and the selection coefficient, s, from temporal allele frequency data. They model the evolution of the allele frequency of a diallelic locus with a diffusion process that approximates a Wright–Fisher population genetic model (WF), under the assumption that the locus is under constant natural selection acting on diploid individuals.
Our work is a natural extension of Bollback et al.’s (2008) method to also allow for the estimation of the allele age (i.e., the time since the mutational event), t0. Allele age is an omnipresent parameter in population genetics and along with the selection coefficient it plays a crucial role in determining the sojourn time of a beneficial mutation (see Slatkin and Rannala 2000 for a review). Additionally, given the recent focus on the important question of distinguishing between models of selection on new vs. standing mutations—a phenomenon that speaks to the fundamental mode and tempo of the process of adaptation—the ability to estimate the time of a mutational event is of paramount importance (see review of Barrett and Schluter 2008).
Our extension allows these competing models to be addressed, unlike the previous work of Bollback et al. (2008) that assumed that at the first time of sampling the population allele frequency was uniformly distributed. It follows from this latter assumption that even if the allele was not sampled at the oldest sampling time, it had to be present in the population. Here, we present an approach to coestimate s, Ne, and t0 by computing the likelihood of these parameters for a suitable model.
In the Theory section we explain how we approximate the WF model with a one-step process. We then discuss the numerical details of the implementation in both Numerics sections. We show how our method performs on the basis of simulations in both Simulations sections. We analyze a data set of horses for the ASIP locus for samples dating from the Pleistocene up to the present in both Real data sections. We conclude and offer some further perspectives in Conclusion.
Materials and Methods
Theory
We assume that there is a single, panmictic population evolving according to a WF population genetic model. Under this model, the frequency of an allele A is a homogeneous discrete-time Markov chain. We denote the Markov chain describing the frequency of the allele A through time by Xt. We assume that selection is constant from the time the allele arose up to present. The allele under selection arises only once and there is no recurrent mutation. In other words, the only evolutionary forces acting on that allele are genetic drift and selection.
Selection is modeled as acting on diploid individuals. If we denote the two alleles by A and a, we can choose the genotypic fitness to be wAA = 1 + s, wAa = 1 + sh, and waa = 1, where s is the selection coefficient and h is the dominance coefficient (s > −1 and h ε [0, 1]; see, e.g., Ewens 2004). If Ne is the effective population size, the states of Xt are the allelic frequencies, with respect to the population size for 0 ≤ j ≤ 2Ne. Therefore, the state space is . We define the rescaled selection coefficient γ = 2Nes.
We compute the likelihood of the allele age t0, the rescaled selection coefficient γ, and the effective population size Ne. To simplify the notation, define θ ≡ (γ, Ne, t0) the parameters of interest. Assume that we have samples from m distinct sampling time points. We suppose that M = (n1, n2, …, nm) chromosomes were collected, among which I = (i1, i2, …, im) are of the A type, and that the chromosomes were drawn at times T = (t1, t2, …, tm), where time is measured in generations with tk−1 < tk (see Figure 1). The likelihood function of the parameters, for a given M and h, is ℓ(θ) = p(i1, …, im|θ, T).
Figure 1 .
Notation used throughout the text. The chromosomes M = (n1, n2, …, nm) are sampled at times T = (t1, t2, …, tm) and there are I = (i1, i2, …, im) A alleles at each sampling time.
To compute the likelihood, we can condition and sum over all the population allelic frequencies, xj1, …, xjm, at each sampling time t1, t2, …, tm. We can then rewrite the likelihood:
| (1) |
Conditional on the population allelic frequencies, the number of A alleles ij at each sampling time are independent of one another. The first term of the summation of Equation 1 becomes
| (2) |
In the WF model the population is large and panmictic; therefore, we can assume that we sample the chromosomes with replacement and write for k ε {0, .., m}
| (3) |
Since Xt is a Markov chain, the second term of the summation of Equation 1 is given by
| (4) |
where xj0 is the frequency of the allele when it first arose in the population, i.e., . We can rewrite the transition probabilities of Xt , for a given θ and T. By substituting Equation 2 and 4 into 1 we obtain
| (5) |
The solution for the transition probabilities for the nonneutral case of the WF model is elaborate (Ewens 2004 and citations therein, but see also Song and Steinrucken 2012). Rescaling the time by 2Ne, the Markov chain, Xt, can be approximated by a diffusion process (“WF diffusion process”), Yτ (see, e.g., Durrett 2008). Time is now in units of 2Ne generations and is continuous and we replace T by y = (τ1, …, τm), where . The state space is also continuous with states denoted by y ε [0, 1]. This holds in the limit of large Ne, where . The transition densities of the diffusion process are denoted . In this article we further approximate the diffusion process itself by a one-step process that we denote Zτ (see, e.g., Van Kampen 1992). A one-step process is a continuous-time Markov chain (i.e., discrete in space and continuous in time) where jumps are only allowed between two states that are adjacent to each other. As before, the states of the process Zτ are certain population allelic frequencies that we denote by {z0, z1, …, zH−1}, where H is an integer. The states are chosen such that z0 and zH−1 are, respectively, the 0 and 1 allelic frequencies. These are absorbing states since there is no recurrent mutation. The other states are chosen such that 0 < zk < 1 and zk−1 < zk for 0 < k < H − 1. The infinitesimal generator Q of such a process is a tridiagonal H × H matrix. By denoting βi (respectively, δi) the rate of jumping to the right (respectively, the left) of state i, we have that
| (6) |
where ηk = −(βk + δk). The transition probability between two states and zjk of the process is . With the appropriate choice of βi and δi (see Supporting Information, File S1.1), one can show that for large H, Zτ ≃ Yτ. In particular, βi and δi will be functions of zj, zj−1, zj+1, γ and h. Note that Yτ is a continuous variable whereas Zτ is discrete. Therefore, choosing and we have that
| (7) |
where the denominator is necessary since Yτ has a continuous-state space and Zτ has a discrete-state space. We can approximate the likelihood described in Equation 5 by replacing the original process Xt by the one-step process Zτ. We then have
| (8) |
where from Equation 3.
In the case of experimental evolution this unconditional process should be realistic since in principle one might want to estimate the selection coefficient for any locus. We now consider one special case of what is presented above, motivated by ancient DNA data. We assume that the allele is segregating at the last sampling time (i.e., the process has not reached states 0 or 1). This case corresponds to what we think is a realistic scenario for how ancient DNA data would be collected, where presumably the locus of interest is polymorphic at present. Indeed, only such loci would be selected for inference.
We can rewrite the resulting likelihood as
| (9) |
where
| (10) |
We consider the subprocess defined on the reduced state space {z1, …, zH−2} ⊂ {z0, z1 … zH−2, zH−1}. The infinitesimal generator qC of such a process is the matrix Q without the first and last rows and columns, i.e.,
| (11) |
Denoting the transition probabilities of this subprocess we have that for (see File S1.2 for more details).
Finally, to compute the likelihood of Equations 8 and 9, all that remains is to compute the matrix exponentiation eQτ and , respectively.
Numerics
We evaluate numerically the matrix exponentiation. The advantage of the current approach compared to Bollback et al.’s is that we do not need to do a numerical integration step since the state space is already finite. The description of the matrix exponentiation is given in File S1.2.
Although asymptotically the one-step process is equivalent to the WF model, since the state space of Zτ has a finite number of states, the accuracy of the approximation depends on the choice of the states, or what we call from now on the “grid.” We investigate three grids strongly inspired by Gutenkunst et al. (2009). The first one is a uniform grid with a point added at . The second and third grid are a “quadratic grid” and an “exponential grid.” The last two grids were chosen to be refined around the boundaries in such a way that the distance between adjacent points changes smoothly. The details for the grids are given in File S1.2. All three grids have a point at .
Since the likelihood function is complex, we were not able to compute the maximum of the function analytically. Therefore, to find the maximum, we first computed the likelihood over a large range of parameters. We verified that there is a single maximum for each time interval defined by adjacent sampling times, i.e., if t0 < t1, the time intervals are (−∞, t0), (t1, t2), …, (tm−1, tm), and the likelihood surface is smooth. We used the SciPy (Jones et al. 2001) implementation of the Nelder–Mead simplex algorithm (Nelder and Mead 1965) to find the maximum for each time interval.
Our implementation is written in Python and C++ making use of the Numpy (Oliphant 2006), SciPy, and mpack (Nakata 2010) libraries for computations and of the Matplotlib library (Hunter 2007) for plotting and is available upon request.
Simulations
To test our model, we simulate several data sets with the WF model forward in time. Simulating the WF model can be time consuming if the population size is large, so we picked a small population size (Ne = 500). In principle, however, the conclusions hold for higher population size. We then infer the maximum-likelihood estimates (MLEs) using our one-step method. We use two different sampling schemes. The first one is similar to the real data set we analyze below, i.e., 6 sampling times each with 50 chromosomes. The second one corresponds to having twice as many sampling times with half the number of chromosomes, i.e., 12 sampling times and 25 chromosomes. We searched for the MLEs across a finite domain, i.e., Ne ε [100, 1000], t0 ε [−3000, 0], and γ ε [−200, 200]. We assess the accuracy of our estimator and compare the sampling schemes by looking at the bias of the estimates and the root mean square error (RMSE).
Real data
In 2009, Ludwig et al. sequenced several loci encoding coat color in horses. Each locus has been shown to be linked with a color phenotype in present day horses. In other words, the phenotype associated with each locus is segregating in present populations. We reanalyze in this article one of the loci encoding for the agouti-signaling protein (ASIP), that controls the distribution of the black pigment (Rieder et al. 2001). We investigate the hypothesis that at the beginning of domestication some coat colors in horse were positively selected for.
The samples sequenced were obtained from Siberia, Middle and Eastern Europe, China, and the Iberian Peninsula. As in Ludwig et al. (2009), we group the samples into six sampling times, t1 ≃ 2000, t2 ≃ 13100, t3 ≃ 3700, t4 ≃ 2800, t5 ≃ 1100, and t6 ≃ 500 years BC. We assume that the generation time of horses is 5 years, following Ludwig et al. (2009). The wild-type horses are presumed to have been of bay color. The mutation of interest for the ASIP locus is recessive, since only horses homozygous for the ASIP locus will be black. We test for directional selection and set h = 0.
To compute a possible range for the population sizes we use data from Cieslak et al. (2010). They sequenced part of the control region of the mtDNA for 78 samples that are part of Ludwig et al. (2009)’s data set. The control region of the mtDNA is a noncoding region. One way to compute the population size Ne is to compute the diversity π of the samples. Then, assuming the region is neutral and ignoring hitchhiking effects due to nearby selected sites, we use the relationship that relates the diversity of a sample to the population size, , where μ is the mutation rate per base pair per generation. To obtain an estimate of the mean and standard error of π of the mtDNA sample, we use the maximum-likelihood method implemented in MEGA (Tamura et al. 2011) with default parameters. We approximate the standard error for the diversity by performing 1000 bootstraps. We use Jazin et al. (1998)’s estimate for the mutation rate (i.e., μ ε (3.0 × 10−6, 4.4 × 10−5)). Jazin et al. (1998) used human families to obtain direct estimates of the mutation rate for mtDNA control region for a single generation. Although the mutation rate is an important parameter, we do not have direct estimate in horses and we have to rely on results for other species. To obtain conservative upper/lower bounds for Ne we use the 95% confidence interval (CI) bounds of the mutation rate and the diversity. If the CIs for μ and π are denoted (μlow, μup) and (πlow, πup) respectively, we defined
and
To find the MLEs we use a domain defined by Ne ε [200, 5000], t0 ε [−10000, 0], and γ ε [−200, 200] for the parameters. We fix H = 400 for this computation.
For the CIs, several asymptotic results that apply for maximum likelihood exist, especially for a time-serial Markov chain. Our sample sizes are generally small, however, so we chose to compute the CIs with a parametric bootstrap approach.
Note that several assumptions of our model are violated with this data set, such as constant population size and potentially random mating (since the samples are taken from all around the world); moreover, the MC1R locus, encoding a melanocortin receptor and related to the black pigment production, is known to have an epistatic interaction with ASIP (Rieder et al. 2001). Nevertheless, we analyze these data as described above to have a more direct comparison between our results with those obtained with Bollback et al.’s method on the same data set.
Results and Discussion
Numerics
To validate the method, we compared several known analytical results for the WF model with the one-step process. For the neutral case, it is possible to compute the likelihood since the transition probabilities are known for the diffusion process (see, e.g., Ewens 2004). As can be seen in Figure S3 (see File S1), even for a grid of size 100 the one-step process is a very good approximation of the diffusion process.
We also compare the relative error between the diffusion and the one-step process and demonstrate that when we increase the grid size the one-step process converges toward the diffusion process. The results for a particular choice of parameters is shown in Figure S4 (see File S1). We see that the one-step process does converge as expected with increasing grid size. In general we see that a quadratic grid and an exponential grid perform better than a uniform grid in general (see File S1.3 for details). In the applications below we use a quadratic grid of size between 100 and 400.
Simulations
We picked a population size of Ne = 500 and set the allele age to t0 = −1400. We fix the selection coefficient to seven potential values: γ ε {−10, −5, 0, 5, 10, 15, 20}.
First, we fix the sampling times to T = (−1000, −800, −600, −400, −200, 0) generations and sample 50 chromosomes at each time point. Then we look at a scheme where the samples are taken every 100 generations from −1100 up to 0 (i.e., 12 samples). At each sampling time we sample 25 chromosomes. The intent is to quantify whether it is better to sample more chromosomes at fewer time points, or the opposite.
The boxplot results for the MLEs for these simulations are shown on Figure 2. They are standard boxplots showing the five-point summary (the minimum, the first quartile, the median, the third quartile, and the maximum). The plots for the bias and the RMSE are shown on Figure S5 (see File S1) for both schemes.
Figure 2 .
Boxplots for the MLEs of each simulation replicate, for γ ε {−10, −5, …, 20}, Ne = 400, and t0 = −1400. At the top is the scheme with 6 sampling times and 50 chromosomes sampled. At the bottom is the scheme with 12 sampling times and 25 chromosomes sampled. On each plot, the estimates for the population size, Ne (left), the rescaled selection coefficient, γ (middle), and the allele age, t0 (right). For all subplots the triangle represents the mean of the estimates, and the circle represents the true value. The rectangles of the boxplots are for the first and third quartiles and the black line represents the median. The outliers are also indicated by crosses.
For the population size, the MLEs span all the potential range of Ne values, but the bulk of the results exclude very low population sizes. This suggests nevertheless that it is hard to estimate Ne with our method, at least with a precision higher than one order of magnitude. Our estimator is biased upward for both schemes but this might be explained by the presence of outliers since the median is largely accurate. Moreover, the second scheme, with fewer chromosomes and more sampling times, leads to a smaller bias and a smaller RMSE for most cases. Intuitively, we think that most of the information to estimate s comes from the general trend of change in allele frequency, while most of the information to estimate Ne comes from the oscillations around that general trend. In other words, to get a precise estimate of Ne, we need a dense sampling over time, which is not the case for our simulations that we chose to match the real data setting.
In contrast, the results for the selection coefficient are essentially unbiased, with a symmetric distribution, and the median matching the mean of the distribution. The variance remains large, and only when γ is quite high can one reject neutrality. In particular, the higher the selection coefficient, the higher the variance. The RMSE this time is worse for the second sampling scheme.
The results for the allele age also exhibit a large variance. The tail of the distribution is large. This can be explained by the use of the conditional process. Indeed for weak selection, if the number of derived alleles is high at the first sampling time the likelihood becomes uninformative for the allele age [i.e., the likelihood is flat for older allele ages; Figure S3 (see File S1)]. This leads to difficulties for the optimization algorithm to converge to the global maximum. The results seem to be systematically biased upward, although the median is accurate. For strong selection the likelihood is more informative and the estimator is unbiased. Also, for strong selection the scheme with more samples through time performs considerably better. In conclusion, sampling fewer chromosomes over more sampling times will lead to better results especially for strong selection in our examples.
Real data
The change in allelic frequency of this locus is shown in Figure 3. Although the frequency is increasing in around 3000 generations from 0 to ∼0.8 between the first and the third sampling time, suggesting positive selection, it then drops down to 0.4 in ∼500 generations. It is interesting to note that the archeological evidence for domestication suggests a date of 3500 years BC (Outram et al. 2009), which would correspond to the third sampling time (i.e., when the sample frequencies start decreasing).
Figure 3 .
Change in allelic frequency over time for the ASIP locus. The sample sizes are M = (10, 22, 20, 20, 36, 38) and the number of derived alleles are I = (0, 1, 15, 12, 15, 18). The times have been offset so that the last sampling time is 0. Domestication is thought to have happened around 3500 years BC, which would correspond to around −600 generations on this plot, i.e., the third sampling time.
The first step is to choose a potential range for the population size. We found π = 0.024 with a 95% CI of (0.018, 0.030). Together with the 95% CI of the mutation rate, this leads to a range for Ne of (200, 5000). This is a small population size. It might be explained by the fact that the horses are a domesticated species and most samples are taken after the beginning of domestication, resulting in a small Ne. On the other hand it might be that the mutation rate calculated for the human population for the control region is not appropriate for horses.
In Figure S6 (see File S1) we plot the likelihood surface for four values of Ne. This helps us confirm that we have found a global maximum. We note that the higher the population size the higher the selection coefficient and the older the allele age that maximize the likelihood. For example, if the population is fixed at Ne = 200 then γmax = −1.5 and . In contrast, if we fix Ne = 5000, then γmax = 9.1 and . In other words, if the mutation rate is overestimated by, say, an order of magnitude, our potential range for the population size will also be much higher.
Since there is no mutant allele at the first time of sampling, the allele might have arisen after the first sampling time. We denote dom1 the range between (−∞, −3893] generations, and dom2 the range (−3893, −2516]. As discussed before, the likelihood is therefore discontinuous as a function of the allele age with discontinuities at the sampling times. It is important to look for the global maximum in dom1 and dom2 separately. Moreover, we compute the 95% CI in dom1 and in dom2 separately. We build the confidence interval as a union of (potentially) disconnected domains.
The values for the MLEs and 95% CI are shown in Table 1. The first thing to note is that they are compatible with the results of Figure S6 (see File S1). The MLEs were found in dom2: with CI (−4760, −3893] ∪ (−3893, 2516], γmle ≅ −1.3 with CI [−27.7, 60.7], and with CI [200, 5000] .
Table 1 . Maxima for the ASIP locus sequenced in Ludwig et al. (2009).
| dom1 | dom2 | |
|---|---|---|
| Local optimum | Local optimum | |
| ℓ | 14.9 | 13.1 |
| t0 | −3893 | −2577 |
| γ | −0.61 | −1.3 |
| Ne | 1617 | 652 |
The MLEs are in the rightmost column.
In Figure 4 we plot the distribution for the bootstrap replicates for each parameter and for the maximum-likelihood values. The confidence interval was constructed as the interval between 2.5th and 97.5th percentile. We ran a total of 1400 replicates. For about 30 of those simulations, the optimizer did not converge. Among successful runs, ∼500 did not have an MLE in dom1 or dom2 and were discarded. From the remaining, about 823 were found in dom2 and 34 in dom1.
Figure 4 .
Bootstrap estimate of the sampling distribution of ML estimators of the three variables Ne, γ, and t0 for the parametric bootstrap. Out of 1400 simulations, 856 were compatible with the data (i.e., the maximum for t0 was in dom1 or dom2). In each case the local maximum is indicated.
The MLEs and the bootstrap results have several implications. First, we do not find evidence for positive selection as could be anticipated by the archeological evidence for domestication. The discrepancy between this study and Ludwig et al. (2009) is first the method used and second the parameter range assumed. Indeed, the results in Ludwig et al. (2009) were obtained using Bollback et al. (2008)’s method. Since our is in dom2, and Bollback et al. (2008) assume that the allele was already present in the first time of sampling, it is to be expected that our results will be very different. Moreover, the potential range for the population size in Ludwig et al. (2009) is from 10,000 to 100,000; i.e., it does not overlap with the range for Ne that we assume here. As noted above, if we had assumed a larger population size, the γmle would be larger.
The distribution of each parameter from the bootstrap replicates are almost unbiased relative to the true value (as could be expected from the results in the Simulation sections). The distribution for γ resembles a normal distribution while the distribution for Ne and t0 are not as simple. For Ne, the distribution is bimodal with a second mode at the upper bound. This mode is a reflection of the finite domain we impose on the search for the MLE rather than an actual mode. Similarly, for t0 there is a mode at the lower bound for dom2, an artifact of the bounds from the sampling times.
As could be expected from the simulations above, the 95% CI for Ne suggests that with these data we have little ability to estimate Ne, which we would expect from a sparse sampling over time as discussed earlier. Similarly, we cannot distinguish between negative and positive selection as γ’s CI is between −27.7 and 60.7. On the other hand, the bootstrap replicates suggest that the allele arose in dom2. We can indeed test the hypothesis that the allele age is not in dom2; that is, we can test the null hypothesis H0: t0 ∉ dom2 vs. the alternative hypothesis that the allele age is in dom2, H1: t0 ε dom2. We reject the null hypothesis H0 with P-value .
The domain dom2 corresponds to 20,000 to 13,100 years BC. In other words, from the data, one could have already deduced that the allele had to be present before −13,100 years (i.e., before the presumed start of domestication). Indeed, domestication in horses is thought to have started about 3500 years BC (Outram et al. 2009). Our analysis shows that it is likely to have arisen within the last 20,000 years, thus clearly indicating that it was present as a standing variant at the time of domestication.
Conclusion
The allele age, the strength of selection, and the population size are all crucial parameters in population genetics. Although the volume of molecular data have been growing exponentially in recent years, it often remains a challenge to estimate those key parameters.
We develop a maximum-likelihood approach to estimating these parameters that deals with a particular type of data—temporal data. Our method is based on an approximation to the WF diffusion process and has the advantage of being quite flexible and appropriate for hypothesis testing. Moreover, it is fast for small γ, as one evaluation of the likelihood function takes ∼0.1 sec for γ 40 on a laptop with a i5 2.53 GHz CPU, for a data set like the one we analyze here.
We show through simulations that, for a sample of realistic size, although the variance of our estimator is quite large, our MLE is unbiased for estimating selection and is nearly unbiased for the age of the allele and the effective population size. On the other hand, our method is not appropriate for estimating the population size, even for simulations in which the model used to simulate the data matches the method used to infer the parameters. Indeed, for a realistic sampling scenario, the MLEs for Ne, although unbiased, can span several orders of magnitude. This is not surprising. The effective population size is a parameter notoriously difficult to estimate, and our method considers only a single locus.
The sampling scheme has of course an impact on the accuracy of the estimator. We investigated two different sampling strategies and concluded that, in the cases considered, it is better to increase the number of sampling times rather than the number of samples per time point. It is indeed intuitive that to be able to estimate the allele age, for the conditional process, it is necessary to have a sample taken at a time close to the allele age. Indeed, in the conditional process, an allele will never get fixed or lost. Thus, after several units of rescaled time, the likelihood is flat.
We reanalyze a locus that was previously found to be under positive selection, ASIP, by evaluating samples ranging from the Pleistocene to the present. In this study, we test for directional selection and we do not have sufficient resolution to distinguish positive from negative selection for this locus. This may be due to an insufficient amount of data, but it could also be due to an underestimate of the effective population size, or a violation of one or more assumptions of our null model, as discussed earlier. Although we are not able to estimate the selection coefficient as precisely as we would like, we find the age of the ASIP mutation to range between 20,000 to 13,100 years BC with an MLE at 13,400 years BC, which well predates domestication.
Even though we analyze a mammalian data set, our method can in principle be applied to data sets obtained in experimental evolution or viral data. But, it is important to note that our approximation to the WF model will be valid only when the diffusion approximation to the WF model is appropriate, and hence only when 2Nes is not too large.
Importantly, our framework readily lends itself to being extended to multiple loci, the topic of future investigation. This extension is anticipated to provide greatly improved estimates of Ne and to permit the inference of fluctuations in historical population size—both issues of outstanding importance in gaining refined estimates of selection coefficients and allele age.
Supplementary Material
Acknowledgments
A.S.M. thanks Fernando Perez for help with the numerical analysis and Numpy, Scipy, Philip Johnson, and Emilia Huerta-Sanchez for helpful discussion. This work was part of A.S.M.’s Ph.D. thesis in the Department of Integrative Biology, University of California, Berkeley. It was funded in part by an Ernst and Lucie Schmidheiny fellowship to A.S.M., by the French Ministry of Industry (DGCIS) and the Region Ile-de-France in the framework of the LaBS Project, by a National Science Foundation grant DMS-0907630 to S.N.E., and by a National Institutes of Health grant (R01-GM40282) to M.S.
Footnotes
Communicating editor: L. M. Wahl
Literature Cited
- Anderson E. C., Williamson E. G., Thompson E. A., 2000. Monte Carlo evaluation of the likelihood for Ne from temporally spaced samples. Genetics 156: 2109–2118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett R., Schluter D., 2008. Adaptation from standing genetic variation. Trends Ecol. Evol. 23(1): 38–44 [DOI] [PubMed] [Google Scholar]
- Bollback J. P., Huelsenbeck J. P., 2007. Clonal interference is alleviated by high mutation rates in large populations. Mol. Biol. Evol. 24(6): 1397–1406 [DOI] [PubMed] [Google Scholar]
- Bollback J. P., York T. L., Nielsen R., 2008. Estimation of 2Nes from temporal allele frequency data. Genetics 179: 497–502 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cieslak M., Pruvost M., Benecke N., Hofreiter M., Morales A., et al. , 2010. Origin and history of mitochondrial DNA lineages in domestic horses. PLoS ONE 5(12): e15311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond A. J., Rambaut A., 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7: 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durrett R., 2008. Probability Models for DNA Sequence Evolution, Ed. 2. Springer, New York.
- Ewens W. J., 2004. Mathematical Population Genetics, Ed. 2. Springer, New York.
- Fisher R., 1922. On the dominance ratio. Proc. R. Soc. 42: 321–341 [Google Scholar]
- Gresham D., Desai M. M., Tucker C. M., Jenq H. T., Pai D. A., et al. , 2008. The repertoire and dynamics of evolutionary adaptations to controlled nutrient-limited environments in yeast. PLoS Genet. 4(12): e1000303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5(10): e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamilton D., 1994. Time Series Analysis. Princeton University Press, Princeton, NH [Google Scholar]
- Hunter J. D., 2007. Matplotlib: A 2d graphics environment. Comput. Sci. Eng. 9(3): 90–95 [Google Scholar]
- Jazin E., Soodyall H., Jalonen P., Lindholm E., Stoneking M., et al. , 1998. Mitochondrial mutation rate revisited: hot spots and polymorphism. Nat. Genet. 18(2): 109–110 [DOI] [PubMed] [Google Scholar]
- Jones E., Oliphant T., Peterson P., et al. 2001. {SciPy}: open source scientific tools for {Python}. http://www.scipy.org/
- Lalueza-Fox C., Rompler H., Caramelli D., Staubert C., Catalano G., et al. , 2007. A melanocortin 1 receptor allele suggests varying pigmentation among Neanderthals. Science 318(5855): 1453–1455 [DOI] [PubMed] [Google Scholar]
- Ludwig A., Pruvost M., Reissmann M., Benecke N., Brockmann G. A., et al. , 2009. Coat color variation at the beginning of horse domestication. Science 324(5926): 485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakata M., 2010. The MPACK (MBLAS/MLAPACK): a multiple precision arithmetic version of BLAS and LAPACK. http://mplapack.sourceforge.net/
- Nelder J. A., Mead R., 1965. A simplex method for function minimization. Computer Journal 7: 303–313 [Google Scholar]
- Nelson M. I., Holmes E. C., 2007. The evolution of epidemic influenza. Nat. Rev. Genet. 8(3): 196–205 [DOI] [PubMed] [Google Scholar]
- Oliphant T., 2006. Guide to NumPy. Trelgol Publishing [Google Scholar]
- Outram A. K., Stear N. A., Bendrey R., Olsen S., Kasparov A., et al. , 2009. The earliest horse harnessing and milking. Science 323(5919): 1332–1335 [DOI] [PubMed] [Google Scholar]
- Rasmussen M., Li Y., Lindgreen S., Pedersen J. S., Albrechtsen A., et al. , 2010. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463(7282): 757–762 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich D., Green R. E., Kircher M., Krause J., Patterson N., et al. , 2010. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468(7327): 1053–1060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rieder S., Taourit S., Mariat D., Langlois B., Guerin G., 2001. Mutations in the agouti (ASIP), the extension (MC1R), and the brown (TYRP1) loci and their association to coat color phenotypes in horses (Equus caballus). Mamm. Genome 12(6): 450–455 [DOI] [PubMed] [Google Scholar]
- Rusk N., 2009. Targeting ancient DNA. Nat. Methods 6: 629 [Google Scholar]
- Slatkin M., Rannala B., 2000. Estimating allele age. Annu. Rev. Genomics Hum. Genet. 1: 225–249 [DOI] [PubMed] [Google Scholar]
- Song Y., Steinrucken M., 2012. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190: 1117–1129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tamura K., Peterson D., Peterson N., Stecher G., Nei M., et al. , 2011. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 28(10): 2731–2739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Kampen N., 1992. Stochastic Processes in Physics and Chemistry. Elsevier, New York [Google Scholar]
- Waples R. S., 1989. A generalized approach for estimating effective population size from temporal changes in allele frequency. Genetics 121: 379–391 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wichman H. A., Millstein J., Bull J. J., 2005. Adaptive molecular evolution for 13,000 phage generations: a possible arms race. Genetics 170: 19–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williamson E. G., Slatkin M., 1999. Using maximum likelihood to estimate population size from temporal changes in allele frequencies. Genetics 152: 755–761 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S., 1931. Evolution in Mendelian populations. Genetics 16: 97–159 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




