Abstract
The large amount and high quality of genomic data available today enable, in principle, accurate inference of evolutionary histories of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytical form. Existing approximations build on the computationally intensive diffusion limit or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (0 and 1). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here we introduce the beta with spikes, an extension of the beta approximation that explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations with comparable performance to an existing state-of-the-art method.
Keywords: Wright-Fisher, beta, pure genetic drift, linear evolutionary pressures, divergence times
ADVANCES in sequencing technologies have revolutionized the collection of genomic data, increasing both the volume and quality of available sequenced individuals from a large variety of populations and species (Romiguier et al. 2014; Gudbjartsson et al. 2015). These data, which may involve up to millions of single-nucleotide polymorphisms (SNPs), contain information about the evolutionary history of the observed populations. There has been a great focus in the recent years on inferring such histories, and to this end, one of the most widely used models is the Wright-Fisher model (Gautier et al. 2010; Sirén et al. 2011; Malaspinas et al. 2012; Pickrell and Pritchard 2012; Gautier and Vitalis 2013; Steinrücken et al. 2014; Terhorst et al. 2015).
The Wright-Fisher model characterizes the evolution of a randomly mating population of finite size in discrete nonoverlapping generations. The model describes the stochastic behavior in time of the number of copies (frequency) of alleles at a locus. The frequency is influenced by a series of factors, such as random genetic drift, mutations, migrations, selection, and changes in population size. When inferring the evolutionary history of a population, the effects of the different factors have to be untangled. Mutation, migration, and selection affect the allele frequency in a deterministic manner (Ewens 2004). We collectively refer to these as evolutionary pressures. The frequency also varies from one generation to the next as a result of random sampling of a finite-sized population (random genetic drift). Mutations and migrations result in linear changes of the sampling probability, while selection is a nonlinear pressure (Kimura 1964; Crow and Kimura 1970) and therefore is more difficult to study analytically.
A crucial step for carrying out statistical inference in the Wright-Fisher model is determination of the distribution of allele frequency (DAF) as a function of time, conditional on an initial frequency. Even though the Wright-Fisher model has a very simple mathematical formulation, no tractable analytical form exists for the DAF (Ewens 2004). Therefore, various approximations have been developed, ranging from purely analytical to purely numerical. They generally either build on the diffusion limit of the Wright-Fisher model or rely on matching moments of the true DAF. Both types of approximations have been used successfully for inference of populations divergence times (Sirén et al. 2011; Gautier and Vitalis 2013), populations admixture (Pickrell and Pritchard 2012), SNPs under selection (Gautier et al. 2010), and selection coefficients from time-serial data (Malaspinas et al. 2012; Steinrücken et al. 2014; Terhorst et al. 2015).
Wright (1945) was the first to use the diffusion approximation to determine the stationary DAF. Kimura (1955) solved the diffusion limit and found the time-dependent distribution for pure genetic drift, and Crow and Kimura (1956) extended the solution to include linear evolutionary pressures. However, these approaches contain infinite sums, making their use cumbersome in practice. After decades dominated by inference based on the dual coalescent process (Rosenberg and Nordborg 2002; Hoban et al. 2012), diffusion has recently received increasing attention, and researchers have started to investigate other ways to solve analytically or approximate the diffusion equation (McKane and Waxman 2007; Waxman 2011; Malaspinas et al. 2012; Song and Steinrücken 2012; Zhao et al. 2013; Steinrücken et al. 2013, 2014).
Moment-based approximations are less ambitious in that they aim at fitting mathematically convenient distributions by equating the first moments of the true DAF. Such approximations typically use either the normal distribution (Nicholson et al. 2002; Coop et al. 2010; Gautier et al. 2010; Pickrell and Pritchard 2012; Terhorst et al. 2015) or the beta distribution (Balding and Nichols 1995, 1997; Sirén et al. 2011; Sirén 2012). The rationale behind the use of these distributions is twofold. First, they are motivated by the diffusion limit: the normal distribution is the resulting DAF when drift is small (Nicholson et al. 2002), while the beta distribution is the stationary DAF under linear evolutionary pressures (Wright 1945; Crow and Kimura 1956). Second, they are entirely determined by their mean and variance. One major difference between the two is their support. Because the normal distribution is defined over the whole real line, it needs to be truncated to [0,1] (Nicholson et al. 2002; Coop et al. 2010; Gautier et al. 2010). The truncated normal distribution has two atoms at 0 and 1 (corresponding to the allele being lost or fixed) containing the densities in the intervals and , respectively. However, the truncation procedure leads to a variance that no longer matches the variance of the true DAF (Gautier and Vitalis 2013). Alternatively, the full distribution can be applied for intermediary frequencies only, when the probabilities of lying outside the 0 and 1 boundaries are small and therefore can be ignored (Pickrell and Pritchard 2012; Terhorst et al. 2015). Unlike the normal distribution, the beta distribution has the interval as support, but because of its continuous nature, the probabilities at the boundaries will always be 0. Under a Wright-Fisher model, the loss and fixation events have a positive probability. The beta distribution provides a good fit for intermediary frequencies but fails at capturing the nonzero boundary probabilities, as illustrated for pure genetic drift in Figure 1, A–C. When time is small, most of the probability mass is found close to the initial value (Figure 1A). As time becomes larger, the allele frequency drifts away from , and more and more probability accumulates at the boundaries (Figure 1, B and C).
Figure 1.
Fit of the beta and beta with spikes approximations for a population of size . (A–F) The true discrete DAF as given by the Wright-Fisher model under pure genetic drift and the corresponding discretized beta (A–C) and beta with spikes (D–F) approximations. The distributions are conditional on an initial frequency and for different time points (A, D), (B, E), and (C, F), where t is the number of discrete generations that the population has evolved. (G, H) The Hellinger distance between the true DAF and the beta (G) and beta with spikes (H) as a function of and . Each row and column corresponds to specific values of the scaled parameters and . The distances corresponding to the distributions from A–F are marked with arrows. The discretization procedure and Hellinger distance are detailed in File S1.
Here we propose an accurate extension of the beta distribution under linear evolutionary pressures called beta with spikes that explicitly models the probabilities at the boundaries. We show that the addition of spikes greatly improves the fit to the true DAF. We use simulation experiments and published chimpanzee exome data to demonstrate that beta with spikes can be used for inference of population divergence times under pure genetic drift with performance comparable to that of a state-of-the-art diffusion-based method (Gautier and Vitalis 2013) and less computational burden. We additionally discuss how beta with spikes can be used in future development to account for variable population size and selection.
Beta with Spikes Approximation
Consider a diploid randomly mating population of size 2N and a biallelic locus with alleles A1 and A2. Under a Wright-Fisher model, the count of one of the alleles, A1, at the discrete generation t is a random variable . Let be the corresponding allele frequency. The evolution of is shaped by random genetic drift and evolutionary pressures that affect the sampling probability (Ewens 2004). We capture the joint effect of the evolutionary pressures in , a polynomial in the allele frequency . Conditional on , follows a binomial distribution (Ewens 2004)
| (1) |
Here we consider only linear evolutionary pressures, such as mutation and migration. Then takes the form
| (2) |
Because represents the sampling probability in equation (1), we must have , for all . From this we find that a and b satisfy . The case where , for which , for all , has no biological meaning, and we therefore assume that .
Under pure genetic drift, . If mutations happen with probabilities u (from to ) and v (from to ), then and . Migration can be modeled, for example, by assuming that individuals can migrate away from the population under study and that there is an influx of individuals from a large population with constant frequency . Then, with probabilities and , individuals migrate from and to the population under study, respectively. We have and . Mutation and migration can be modeled jointly, resulting in and . In the following, we treat the general linear case.
We are interested in the DAF conditional on as a function of the generation t
| (3) |
For simpler notation, we leave out the explicit condition on and implicit condition on population size and evolutionary pressures.
Under the beta approximation, the DAF is (Balding and Nichols 1995)
| (4) |
where is the beta function. The two shape parameters of the beta distribution are entirely determined by its mean and variance
| (5) |
Therefore, in order to fit to f, we need to calculate and . These can be obtained in closed analytical form (see Supporting Information, File S1 for a full derivation). The mean is entirely determined by the initial frequency and the parameters a and b of the linear evolutionary pressures, while the variance also depends on the population size. Under pure genetic drift (), we have
| (6) |
When , we get
| (7) |
In the limit of infinite population size, the preceding formulas are equivalent to the mean and variance obtained by Sirén (2012). We note that this equation corrects some typographic errors in Sirén (2012), as confirmed by correspondence with the author (see also File S1).
To account for loss and fixation probabilities, we surround the beta distribution with two spikes
| (8) |
where is the Dirac delta function, and . The incorporation of the loss and fixation probabilities into the DAF by means of Dirac delta functions has also been used to obtain a complete solution of the diffusion equation (McKane and Waxman 2007).
To fit to f, we need to determine the mean and variance of conditional on polymorphism () and the probabilities and of loss and fixation, respectively. Given , , , and , the conditional mean and variance can easily be calculated (see File S1). The shape parameters and are calculated as in equation (5), where the mean and variance are replaced by the conditional mean and variance, respectively. Therefore, we only require a means of calculating the loss and fixation probabilities to fully specify the beta with spikes approximation. We use a recursive approach in which we calculate the probabilities for by relying on the approximated . We additionally assume that a and b are small to obtain the following approximation for loss and fixation probabilities (see File S1 for a full derivation):
| (9) |
Figure 1, D–F depicts the beta with spikes approximation for the same cases as in Figure 1, A–C. When time is small (Figure 1, A and D), the beta and beta with spikes distributions are equivalent, but as the time becomes larger, the advantage of adding the spikes becomes evident.
To investigate the approximation quality of the beta with spikes, we calculated the Hellinger distance between the true DAF and the beta and beta with spikes for a range of initial frequencies , times t and parameters a and b (see File S1 for details). The Hellinger distance lies between 0 and 1, with 0 indicating a perfect match between the two distributions. Figure 1, G and H, shows that the addition of spikes drastically improves the fit of the beta approximation to the true DAF under a Wright-Fisher model. It is apparent from the figure that the beta distribution approximates well the true DAF when it is not close to the boundaries: either the initial frequency is close to 0.5 and the time is not too large or the parameters a or b are large enough to keep the allele frequency away from 0 and 1. The beta with spikes has a more consistent behavior because the corresponding Hellinger distance does not vary as much across the different parameters as it does for the beta distribution.
Inference of Divergence Times
To further illustrate the advantage of incorporating the spikes, we inferred divergence times between populations using both simulated data and exome sequencing data from three chimpanzee subspecies (Bataillon et al. 2015).
Populations are represented as successive descendants of a single ancestral population. We assume that after each split, the new populations evolve in isolation (no migration) under pure genetic drift. A rooted tree (Figure 2) can be used to describe the joint history of several present populations, located at the leaves, while the common ancestral population is represented as the root. The data consist of I independent SNPs for J populations in the present: the (arbitrarily defined) reference (A1) allele count in a sample of size () for each locus and population .
Figure 2.
History of three populations in the present. Ancestral population 5 splits in populations 3 and 4, which further splits in populations 2 and 1. For each SNP i and present population , the data consist of the sample size and allele count . The branch length between populations k and j is given as and represents the scaled number of generations that population j evolved since the split from the ancestral population k. The unknown allele frequencies of each population are denoted as , with .
Conditional on the topology (i.e., tree without branch lengths), we inferred the scaled branch lengths by numerically maximizing the likelihood of the data.
Likelihood of the data
Assuming Hardy-Weinberg equilibrium, the probability of observing alleles in a sample of size given the population allele frequency is given by the binomial distribution
| (10) |
However, the allele frequencies are unobserved, and the likelihood of the data for SNP i is obtained by integrating over the unknown allele frequencies
| (11) |
where is the joint distribution of the values at the leaves. The likelihood is a function of the scaled branch lengths, denoted here as Θ, and π, the unknown DAF at the root. The joint distribution is, in turn, an integral over the allele frequencies in the ancestral populations, represented as internal nodes in the tree. We approximate the integrals with sums by discretizing the allele frequencies. The discretized joint distribution is then obtained using a peeling algorithm (Felsenstein 1981), where the transition probabilities on each branch are given by the DAF as a function of the branch length (see File S1 for details). Because the SNPs are assumed to be independent, the full likelihood is a product over the SNPs
| (12) |
Because SNP data contain only polymorphic sites, we further condition the preceding likelihood on polymorphic data as follows:
| (13) |
where
| (14) |
Here and are data corresponding to site i where the allele was lost or fixed, respectively, in the samples from all populations
| (15) |
We treat π, the root DAF, as a nuisance distribution, which we assume to be a beta distribution. For a given topology (i.e., tree without branch lengths), the most likely branch lengths and shape parameters of π are recovered by numerically maximizing the likelihood conditional on polymorphism.
Simulated data
Using the topology depicted in Figure 2, we simulated multiple data sets containing independent SNPs under a Wright-Fisher model, given an ancenstral frequency sampled from π, the root DAF, which we set to be a beta distribution. We used two different scenarios, labeled I and II, summarized in Table 1. Scenario I has a uniform π and large sample sizes, while scenario II is built to produce data that resemble the chimpanzee exome data analyzed later. For this, we used the chimpanzee sample sizes and scaled branch lengths and root DAF as inferred by the beta with spikes on the chimpanzee data (Table 2).
Table 1. Simulation study scenarios.
| Scenario I | Scenario II | |
|---|---|---|
| Shape parameters of π | 1, 1 | 0.0188, 0.0195 |
| Number of SNPs | 5000 | 10,000 |
| Sample sizes , , | 100, 100, 100 | 22, 24, 12 |
| Replicates | 50 | 50 |
This table indicates the values used for branch lengths t, population sizes N, and scaled branch lengths , shape parameters of the beta distribution π, the root DAF, and number of SNPs and sample sizes used in the two simulation scenarios.
Table 2. Inferred scaled branch lengths for the chimpanzee exome data.
| Method | log L | ||||
|---|---|---|---|---|---|
| Beta | 0.273 | 0.086 | 0.002 | 0.955 | −209,646 |
| Beta with spikes | 0.132 | 0.044 | 0.028 | 0.595 | −204,045 |
| Kim Tree | 0.160 | 0.019 | 0.018 | 0.729 | −205,838 |
| ABC1 | 0.183 | 0.027 | 0.333 | 1.914 | −233,802 |
For each simulated data set, we estimated the branch lengths using both the beta and beta with spikes, as described previously. We additionally ran Kim Tree (Gautier and Vitalis 2013) using the default settings. Kim Tree is a method designed for inference of divergence times between populations evolving under pure genetic drift. It uses Kimura’s solution to the diffusion limit for the DAF (Kimura 1955) and relies on a Bayesian Markov chain Monte Carlo (MCMC) approach. Here we use the posterior means of the branch lengths reported by this method as point estimates.
All methods estimated the branches leading to populations 1 and 2 well (Figure 3). Beta with spikes estimates the branch lengths more accurately and with lower variance than the beta approximation (Figure S1). Despite the fact that the spikes’ probabilities do not perfectly match the true loss and fixation probabilities (Figure 1, E and F), this seems to have little effect on the accuracy of branch-length estimation for beta with spikes. For both scenarios, the branch leading to population 2 and the inner branch from the root to population 4 have similar lengths, but the beta approximation and Kim Tree provide a worse estimate for the inner branch. This could be due to the fact that there are no data available resulting directly from the evolution on that branch, making the estimation problem harder. A similar result was obtained by Gautier and Vitalis (2013), who used trees with the same topology. Interestingly, beta with spikes recovers the inner branch much more accurately than either beta or Kim Tree. When measuring the accuracy of the inferred lengths as an average over all four branches (Table S1), it is clear that beta with spikes outperforms Kim Tree for both scenarios.
Figure 3.
Inference of divergence times for simulation scenarios I (A) and II (B). The figure shows box plots summarizing the inferred lengths for the four branches of the tree, indicated at the top of each column in black. The inferred lengths are plotted for beta (B), beta with spikes (BS), and Kim Tree (KT) (Gautier and Vitalis 2013). The (true) simulated length of each branch is plotted as a horizontal line. Each plot is scaled relative to the corresponding simulated branch length τ, with the limits of the y-axis set to .
Chimpanzee data
The chimpanzee data analyzed here consisted of allele counts of autosomal synonymous SNPs obtained from exome sequencing of the Eastern, Central, and Western chimpanzee subspecies (Bataillon et al. 2015) for 11, 12, and 6 individuals, respectively. From the original data set containing 59,905 synonymous SNPs, we filtered the SNPs in which there were missing data, obtaining a total of 42,063 SNPs. We inferred the scaled branch lengths (Figure 4 and Table 2) using beta, beta with spikes, and Kim Tree on the full data set and on 50 smaller data sets containing only 10,000 randomly sampled SNPs. Beta with spikes and Kim Tree infer comparable branch lengths, with the exception of the branch leading to the Western chimpanzee subspecies. We additionally report in Table 2 the likelihood of the full data calculated using beta with spikes for the branch lengths inferred using the three methods and the ones reported in the original study (Bataillon et al. 2015). Bataillon et al. (2015) used an approximate Bayesian computation (ABC) approach to fit a demographic model to the synonymous SNPs. Their results are consistent with the results obtained here for the branches leading to the Eastern and Central chimpanzees. However, we obtained very different estimates for the remaining two branches. The likelihoods in Table 2 show that the branch lengths obtained by beta with spikes have the highest likelihood. This does not indicate which of the estimates is most correct because the methods use different modeling approaches. However, it does show that the differences between beta with spikes and beta/Kim Tree/ABC are a direct result of the modeling approach and not merely an artifact of the likelihood numerical optimization being trapped in a local optimum. The discrepancy between the results of beta with spikes and those of ABC is, perhaps, not surprising because the difference in inferred branch lengths seems to correlate with the goodness of fit of the ABC demographic model to the observed data. Bataillon et al. (2015) reported that their inferred demographic model shows a very good fit for the Central chimpanzees (absolute and relative difference in inferred branch length 0.017 and 0.386, respectively), a slightly less good fit for the Eastern chimpanzees (absolute and relative difference in inferred branch length 0.051 and 0.386, respectively), and a poorer fit for the Western chimpanzees (absolute and relative difference in inferred branch length 1.319 and 2.217, respectively).
Figure 4.
Inference of divergence times for the chimpanzee exome data. The figure shows box plots summarizing the inferred lengths using 50 data sets with 10,000 SNPs that were randomly sampled from the full data set. The corresponding tree branches are indicated at the top of each plot in black. The inferred lengths are plotted for beta (B), beta with spikes (BS), and Kim Tree (KT) (Gautier and Vitalis 2013). The nonsolid lines indicate the inferred lengths when running the methods on the full data set of 42,064 SNPs. The populations at the leaves are Eastern (E), Central (C), and Western (W). Each plot is scaled relative to the corresponding branch length τ inferred by beta with spikes on the full data set. The limits of the y-axis are set to .
Data availability
The beta, beta with spikes approximations, inference of divergence times, and simulation under a Wright-Fisher model were implemented in Python 2.7. The code is freely available at https://github.com/paula-tataru/SpikeyTree.
Discussion
We have developed a new approximation to the DAF as a function of time, conditional on an initial frequency, under a Wright-Fisher model with linear evolutionary pressures. Our work provides an accurate extension of the beta approximation (Balding and Nichols 1995, 1997; Sirén et al. 2011; Sirén 2012). As noted by Gautier and Vitalis (2013), the beta distribution ignores the possibility of loss or fixation of alleles. We addressed this issue by explicitly modeling the loss and fixation probabilities as two spikes at the boundaries. We showed that the addition of the spikes improves the quality of the approximation and results in more accurate inference of divergence times between populations that have been evolving under pure genetic drift. The DAF obtained as a solution of the diffusion equation is exactly the DAF expected under the Wright-Fisher model when the population size is large and the evolutionary pressures are not too strong, while the beta distribution is motivated only by the stationary DAF under linear evolutionary pressures. We therefore expected the beta with spikes to provide a less accurate approximation to the true DAF than the diffusion limit. Nevertheless, we showed that our method can infer divergence times more accurately than Kim Tree (Gautier and Vitalis 2013), a software built for inference of divergence times using Kimura’s solution to the diffusion limit (Kimura 1955). We would like to note here that the use of likelihood that is explicitly conditioning on polymorphic data [equation (13)] could potentially be the cause of beta with spikes outperforming Kim Tree.
Computational complexity
The advantage of beta with spikes becomes clearer when one considers its computational complexity. Diffusion methods rely on heavy computations, such as calculations of Gegenbauer polynomials (Gautier and Vitalis 2013), spectral decomposition of large matrices (Steinrücken et al. 2013, 2014), or matrix inverse (Zhao et al. 2013). In contrast, beta with spikes requires operations that are performed in constant time per iteration. Perhaps the most expensive evaluation is the beta function used in the loss and fixation probabilities, but very efficient approximations exist for this (Abramowitz and Stegun 1964). The difference in computational complexity is noticeable when comparing the running times of beta with spikes, implemented in Python 2.7, and Kim Tree, implemented in Fortran. For the chimpanzee data set of 42,063 SNPs, beta with spikes ran in just under 5 min, while Kim Tree took almost an hour, even though Python 2.7 is a programming language that is less efficient than Fortran. We also note that the two inference methods are inherently different because here we used a numerical optimization procedure, while Kim Tree uses a Bayesian MCMC approach. Additionally, unlike Kim Tree, beta with spikes is a recursive method, and its accuracy and speed depend on the number of iterations performed (see File S1 for accuracy results for different numbers of iterations).
Extensions
We end this section by discussing possible extensions of the beta with spikes approximation and how these can be used in inference problems. Throughout this paper, we assumed that the population size is constant. Given its recursive formulation, beta with spikes lends itself naturally to incorporating variable population size without any increase in computational complexity. This can then be used for inference of population size backward in time, similar to methods relying on the coalescent with recombination (Li and Durbin 2011; Sheehan et al. 2013; Schiffels and Durbin 2014). A recently published method (Liu and Fu 2015) illustrates that allele frequency data, summarized as site frequency spectra, can be used efficiently for inference of variable population size backward in time. Even though Liu and Fu (2015) assumed that sites are independent and did not use linkage information, their method can handle larger data sets than the method of Li and Durbin (2011), which leads to more accurate inference of population sizes for the recent past. The results obtained by Liu and Fu (2015) indicate that beta with spikes could be used successfully for such demographic inference.
Another extension of the presented approximation would be to incorporate selection, which is a nonlinear evolutionary pressure. In the recent years, there has been a great focus on inference of selection coefficients from time-series data under a Wright-Fisher model (Malaspinas et al. 2012; Bank et al. 2014; Steinrücken et al. 2014; Foll et al. 2015; Terhorst et al. 2015). A newly developed statistical method aims at modeling the evolution of multilocus alleles under a Wright-Fisher model with selection (Terhorst et al. 2015) by fitting a multivariate normal distribution from the first moments of the DAF. Using the approach of Terhorst et al. (2015) for moment calculation, beta with spikes can be extended to nonlinear evolutionary pressures. Terhorst et al. (2015) did not treat the loss and fixation probabilities. However, because selection is expected to drive allele frequencies toward the boundaries faster than pure genetic drift, including the explicit spikes becomes crucial.
Supplementary Material
Acknowledgments
It is a pleasure to thank Thomas Mailund for helpful discussions. We also thank the two anonymous reviewers and the associate editor for their constructive suggestions and comments that helped to improve the manuscript. This work was supported, in part, by the European Research Council under the European Union’s Seventh Framework Program (FP7/20072013, ERC grant number 311341) and the Danish Research Council (grant number DFF-4002-00382).
Footnotes
Communicating editor: Y. S. Song
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179606/-/DC1
Literature Cited
- Abramowitz M., Stegun I. A., 1964. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover Publications, Mineola, NY. [Google Scholar]
- Balding D. J., Nichols R. A., 1995. A method for quantifying differentiation between populations at multiallelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12. [DOI] [PubMed] [Google Scholar]
- Balding D. J., Nichols R. A., 1997. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78: 583–589. [DOI] [PubMed] [Google Scholar]
- Bank C., Ewing G. B., Ferrer-Admettla A., Foll M., Jensen J. D., 2014. Thinking too positive? Revisiting current methods of population genetic selection inference. Trends Genet. 30: 540–546. [DOI] [PubMed] [Google Scholar]
- Bataillon T., Duan J., Hvilsom C., Jin X., Li Y., et al. , 2015. Inference of purifying and positive selection in three subspecies of chimpanzees (Pan troglodytes) from exome sequencing. Genome Biol. Evol. 7: 1122–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coop G., Witonsky D., Di Rienzo A., Pritchard J. K., 2010. Using environmental correlations to identify loci underlying local adaptation. Genetics 185: 1411–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crow J., Kimura M., 1956. Some genetic problems in natural populations, pp. 1–22 in Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol4 University of California Press, Oakland, CA. [Google Scholar]
- Crow J. F., Kimura M., 1970. An Introduction to Population Genetics Theory. Harper & Row, New York. [Google Scholar]
- Ewens W. J., 2004. Mathematical Population Genetics 1: Theoretical Introduction, Vol27 Springer, New York. [Google Scholar]
- Felsenstein J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17: 368–376. [DOI] [PubMed] [Google Scholar]
- Foll M., Shim H., Jensen J. D., 2015. WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol. Ecol. Resour. 15: 87–98. [DOI] [PubMed] [Google Scholar]
- Gautier M., Hocking T. D., Foulley J.-L., 2010. A Bayesian outlier criterion to detect SNPs under selection in large data sets. PLoS One 5: e11913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gautier M., Vitalis R., 2013. Inferring population histories using genome-wide allele frequency data. Mol. Biol. Evol. 30: 654–668. [DOI] [PubMed] [Google Scholar]
- Gudbjartsson D. F., Helgason H., Gudjonsson S. A., Zink F., Oddson A., et al. , 2015. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47: 435–444. [DOI] [PubMed] [Google Scholar]
- Hoban S., Bertorelle G., Gaggiotti O. E., 2012. Computer simulations: tools for population and evolutionary genetics. Nat. Rev. Genet. 13: 110–122. [DOI] [PubMed] [Google Scholar]
- Kimura M., 1955. Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41: 144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M., 1964. Diffusion models in population genetics. J. Appl. Probab. 1: 177–232. [Google Scholar]
- Li H., Durbin R., 2011. Inference of human population history from individual whole-genome sequences. Nature 475: 493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X., Fu Y.-X., 2015. Exploring population size changes using SNP frequency spectra. Nat. Genet. 47: 555–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malaspinas A.-S., Malaspinas O., Evans S. N., Slatkin M., 2012. Estimating allele age and selection coefficient from time-serial data. Genetics 192: 599–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKane A., Waxman D., 2007. Singular solutions of the diffusion equation of population genetics. J. Theor. Biol. 247: 849–858. [DOI] [PubMed] [Google Scholar]
- Nicholson G., Smith A. V., Jónsson F., Gústafsson Ó., Stefánsson K., et al. , 2002. Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. Series B Stat. Methodol. 64: 695–715. [Google Scholar]
- Pickrell J. K., Pritchard J. K., 2012. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8: e1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romiguier J., Gayral P., Ballenghien M., Bernard A., Cahais V., et al. , 2014. Comparative population genomics in animals uncovers the determinants of genetic diversity. Nature 515: 261–263. [DOI] [PubMed] [Google Scholar]
- Rosenberg N. A., Nordborg M., 2002. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3: 380–390. [DOI] [PubMed] [Google Scholar]
- Schiffels S., Durbin R., 2014. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46: 919–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheehan S., Harris K., Song Y. S., 2013. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194: 647–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sirén, J., 2012 Statistical models for inferring the structure and history of populations from genetic data. Ph.D. thesis, University of Helsinki. [Google Scholar]
- Sirén J., Marttinen P., Corander J., 2011. Reconstructing population histories from single nucleotide polymorphism data. Mol. Biol. Evol. 28: 673–683. [DOI] [PubMed] [Google Scholar]
- Song Y. S., Steinrücken M., 2012. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190: 1117–1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinrücken M., Bhaskar A., Song Y. S., 2014. A novel spectral method for inferring general diploid selection from time series genetic data. Ann. Appl. Stat. 8: 2203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinrücken M., Wang Y. R., Song Y. S., 2013. An explicit transition density expansion for a multiallelic Wright-Fisher diffusion with general diploid selection. Theor. Popul. Biol. 83: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Terhorst J., Schlötterer C., Song Y. S., 2015. Multilocus analysis of genomic time series data from experimental evolution. PLoS Genet. 11: e1005069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waxman D., 2011. A compact result for the time-dependent probability of fixation at a neutral locus. J. Theor. Biol. 274: 131–135. [DOI] [PubMed] [Google Scholar]
- Wright S., 1945. The differential equation of the distribution of gene frequencies. Proc. Natl. Acad. Sci. USA 31: 382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao L., Yue X., Waxman D., 2013. Complete numerical solution of the diffusion equation of random genetic drift. Genetics 194: 973–985. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The beta, beta with spikes approximations, inference of divergence times, and simulation under a Wright-Fisher model were implemented in Python 2.7. The code is freely available at https://github.com/paula-tataru/SpikeyTree.




