Abstract
Demographic inference methods in population genetics typically assume that the ancestry of a sample can be modeled by the Kingman coalescent. A defining feature of this stochastic process is that it generates genealogies that are binary trees: no more than 2 ancestral lineages may coalesce at the same time. However, this assumption breaks down under several scenarios. For example, pervasive natural selection and extreme variation in offspring number can both generate genealogies with “multiple-merger” events in which more than 2 lineages coalesce instantaneously. Therefore, detecting violations of the Kingman assumptions (e.g. due to multiple mergers) is important both for understanding which forces have shaped the diversity of a population and for avoiding fitting misspecified models to data. Current methods to detect deviations from Kingman coalescence in genomic data rely primarily on the site frequency spectrum (SFS). However, the signatures of some non-Kingman processes (e.g. multiple mergers) in the SFS are also consistent with a Kingman coalescent with a time-varying population size. Here, we present a new statistical test for determining whether the Kingman coalescent with any population size history is consistent with population data. Our approach is based on information contained in the 2-site joint frequency spectrum (2-SFS) for pairs of linked sites, which has a different dependence on the topologies of genealogies than the SFS. Our statistical test is global in the sense that it can detect when the genome-wide genetic diversity is inconsistent with the Kingman model, rather than detecting outlier regions, as in selection scan methods. We validate this test using simulations and then apply it to demonstrate that genomic diversity data from Drosophila melanogaster is inconsistent with the Kingman coalescent.
Keywords: Kingman coalescent, demographic inference, multiple mergers, beta coalescent
Introduction
The genetic diversity within a population reflects its demographic and evolutionary history. Learning about this history from contemporary sequence data is the domain of modern population genetics (see Hahn 2018). The fundamental tools of the trade are simplified mathematical models, which connect unobserved quantities such as the population size to observable features of genetic data. However, populations are complicated and, moreover, vary in their complications. No simple model can capture the processes governing every species’ evolution, and a misspecified model will generate misleading inferences. It is, therefore, crucial to understand the limits of population genetics models and to assess when a model is appropriate for a particular data set.
One of the most widely used models is the Kingman coalescent (Kingman 1982a, 1982b; Hudson 1983; Tajima 1983). The Kingman coalescent is a stochastic process that generates gene genealogies: trees representing the patterns of shared ancestry of sampled individuals. Inference methods use these genealogies as latent variables linking demographic parameters to genetic data (Rosenberg and Nordborg 2002). The Kingman coalescent has a number of convenient properties that facilitate both analytical calculations (e.g. Tajima 1989) and efficient stochastic simulations (e.g. Hudson 2002): tree topologies are independent of waiting times; waiting times are generated by a Markov process; and neutral mutations are modeled as a Poisson process conditionally independent of the tree. Moreover, the model can be extended to study a variety of biological phenomena including recombination, population structure, and variation in sex ratios or ploidy (see generally Wakeley 2009).
An important application of the Kingman coalescent is inferring historical population sizes from genetic data (Schraiber and Akey 2015). In its simplest form, the model has a single parameter, the coalescent rate, which determines the branch lengths of genealogies (Kingman 1982b). Under many conditions, the coalescent rate is inversely proportional to the population size (Kingman 1982a). Accordingly, a growing or shrinking population may be modeled by a time-varying coalescence rate (Griffiths and Tavaré 1994, 1998). Patterns of genetic diversity depend on the ratio of the coalescent rate to other evolutionary rate parameters. For example, the site frequency spectrum (SFS)—the number of mutations segregating at different frequencies in a sample—is determined by the ratio of the mutation rate to the (time-varying) coalescent rate. Kingman coalescent-based inference methods solve the inverse problem of determining the population size history that best explains particular features of the data, such as the SFS (e.g. Bhaskar et al. 2015) or variations in heterozygosity along a chromosome (e.g. Li and Durbin 2011).
A serious problem for this class of inference methods is that different models of evolution generate different relationships between historical population sizes and genetic diversity. For example, one of the basic assumptions of the Kingman coalescent is that natural selection is negligible in determining the distribution of genealogies. When this assumption is violated, Kingman-based inference methods are misspecified (Gillespie 2000a, 2000b, 2001). For instance, when a beneficial mutation increases rapidly in frequency, it distorts the genealogies at nearby sites (see e.g. Coop and Ralph 2012). If these “selective sweeps” occur regularly, they may be the dominant factor determining the distribution of genealogies. In this case, the average coalescent rate is proportional to the number of beneficial mutations introduced per generation, which is itself directly, rather than inversely, proportional to the population size. It follows that the relationship between the population size and the expected number of neutral mutations in a sample is inverted: larger populations will be less diverse than smaller populations.
While the example above is extreme, it is well established that violations of the neutrality assumption can distort or mask the signatures of population size changes. For example, Schrider et al. (2016) and Johri et al. (2021) demonstrated that several popular inference methods give misleading results in the presence of selective sweeps and background selection. In a similar vein, Cvijović et al. (2018) showed that reduction of genetic diversity by purifying selection is accompanied by distortions in the SFS, leading to a false signal of population growth. Moreover, genomic evidence from multiple species suggests that such violations of neutrality may be widespread (Sella et al. 2009; Corbett-Detig et al. 2015; Kern and Hahn 2018; Johri et al. 2020).
An important extension of the Kingman coalescent is a family of models known as multiple-merger coalescents (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999; Eldon 2016), which arise in a variety of contexts both with and without selection. Whereas in the Kingman coalescent lineages may coalesce only pairwise, multiple-merger coalescents permit more than 2 lineages to coalesce in a single event. The more general class of simultaneous-multiple-merger coalescents (Schweinsberg 2000; Möhle and Sagitov 2001; Sagitov 2003) permits more than one distinct multiple-merger event at the same time. Multiple-merger and simultaneous-multiple-merger models are relevant for species with “sweepstakes” reproductive events (Eldon and Wakeley 2006; Sargsyan and Wakeley 2008), fat-tailed offspring number distributions (Schweinsberg 2003; Hallatschek 2018), recurring selective sweeps at linked sites (Durrett and Schweinsberg 2005; Coop and Ralph 2012), rapid adaptation (Desai et al. 2013; Neher and Hallatschek 2013), and purifying selection at sufficiently many sites (Seger et al. 2010; Nicolaisen and Desai 2012; Good et al. 2014; Cvijović et al. 2018).
In each of these contexts, the coalescent timescale is not necessarily proportional to the population size. For example, with fat-tailed offspring distributions, the rate of coalescence is a power law in the population size (Schweinsberg 2003), while with linked sweeps, it is determined by the rate of linked sweeps, as described above (Durrett and Schweinsberg 2005). In these settings, interpreting the level of genetic diversity in terms of an “effective population size” is misleading, and inferences based on the Kingman coalescent may be qualitatively and quantitatively incorrect.
It is, therefore, important to determine whether the Kingman model is appropriate for a given data set before performing demographic inference. This task is distinct from “selection scan” methods designed to detect particular regions of the genome that are under selection (see Vitti et al. 2013). Selection scan methods typically assume that most of the genome is evolving neutrally and that the genome-wide distribution of summary statistics reflects demographic factors. Genomic regions that are outliers from this distribution are presumed to be under selection. In contrast, we are interested in detecting when the genome-wide background itself is not well-modeled by the Kingman coalescent.
There has been much recent interest in identifying in genomic data departures from the Kingman coalescent caused by multiple mergers. One approach is to use the SFS as a summary statistic. To this end, Birkner et al. (2013), Blath et al. (2016), and Spence et al. (2016) derived methods for computing the expected SFS of (simultaneous) multiple-merger coalescents. Further, Eldon et al. (2015) showed that it is possible to use the SFS to distinguish beta and Dirac (multiple-merger) coalescents from Kingman coalescents with strictly exponential or algebraic growth. Koskela (2018) and Koskela and Wilke Berenguer (2019) extended this work and used the SFS to distinguish multiple mergers caused by selection from those caused by sweepstakes reproduction. In a related approach, Rödelsperger et al. (2014) detected widespread linked selection in the nematode Pristionchus pacificus by demonstrating that the SFS is nonmonotonic, a signature of multiple mergers (Birkner et al. 2013; Neher and Hallatschek 2013). Several more recent papers have used this nonmonotonicity in the SFS to identify departures from the Kingman coalescent in Atlantic cod (Árnason et al. 2023) and a variety of other organisms (Freund et al. 2023). In other recent work, Freund and Siri-Jégousse (2021) have introduced a new statistic, the minimum observable clade size, and used it, along with SFS-derived statistics, to discriminate between several coalescent models (including multiple-merger and Kingman coalescents both with and without population growth) using an approximate Bayesian computation (ABC) framework. Several other recent papers have used related combinations of linkage disequilibrium and SFS-derived statistics in a similar ABC framework to detect evidence for multiple-merger genealogies (Menardo et al. 2021) or to jointly infer the action of demography and selection (Johri et al. 2020, 2021; Lepers et al. 2021).
However, methods that derive their power primarily from the SFS are limited in their ability to distinguish multiple mergers from general models of population-size change. While previous work has demonstrated that the SFS does contain information that can discriminate multiple-mergers from particular forms of the Kingman coalescent, a Kingman coalescent with a more general model of population size change can accurately fit many aspects of the multiple-merger SFS (Myers et al. 2008; Bhaskar and Song 2014). This fundamentally limits the ability to discriminate between population models using SFS-based statistics alone. The nonmonotonic SFS identified by Rödelsperger et al. (2014), Árnason et al. (2023), and Freund et al. (2023) is a more robust signature of multiple mergers, but identifying that the SFS increases at high frequencies requires both knowledge of the ancestral allele at each site and a large enough sample size to accurately sample rare, high-frequency alleles, and either condition may be violated in real-world data.
Here, we propose that statistics based on the 2-site frequency spectrum (2-SFS)—the generalization of the SFS to pairs of nearby sites (Hudson 2001; Ferretti et al. 2018)—are useful for distinguishing between the Kingman coalescent with population growth and multiple-merger coalescents. This is fundamentally different from approaches based primarily on the single-site SFS (e.g. Birkner et al. 2013; Eldon et al. 2015; Blath et al. 2016; Spence et al. 2016; Freund and Siri-Jégousse 2021; Freund et al. 2023) because 2-SFS-based statistics depend on tree topologies and coalescent rates in a manner unique from SFS-based statistics. Thus, these 2-SFS statistics introduce new information not contained in the SFS that can be used to discriminate models that produce the same SFS. Furthermore, these statistics may be calculated efficiently from single-nucleotide polymorphism (SNP) data, do not require recombination maps or ancestral allele identification, and are informative even with small sample sizes. Together, these properties make the 2-SFS useful for demographic model-checking in a wide range of species.
In this paper, we show that 2-SFS-based statistics can be used to discriminate Kingman from non-Kingman coalescence. By validating with simulations, we demonstrate high power to reject incorrect Kingman population-size-change models for biologically realistic sample sizes. We present a Snakemake pipeline for analyzing real-world population data and demonstrate our pipeline using genomic data from Drosophila melanogaster (Lack et al. 2015).
Definitions and background
Following the notation of Fu (1995), we define the SFS of a sample of n haploid genomes as ξ, where is the number of sites containing a mutation with derived allele count i in the sample . When the ancestral allele is unknown, mutations at frequency are indistinguishable from mutations at frequency i, and the folded SFS, η, is used instead, where is the number of sites with minor allele count i in the sample, . Here, is the Kronecker delta ( if and 0 otherwise). The SFS and folded SFS can be calculated from a set of SNPs without knowing the physical location of the SNPs.
In contrast, the 2-SFS, ϕ, is a statistic of pairs of sites. We define the 2-SFS, , as the number of pairs of polymorphic sites separated by d bases for which there is a mutation with derived allele count i at one site and a second mutation with derived allele count j at the other site. Note that by symmetry. The 2-SFS has been studied for nonrecombining sites by Ferretti et al. (2018) in a neutral model and by Xie (2011) in a model with selection. When the ancestral allele is unknown, we define the folded 2-SFS, , by analogy to the folded SFS: represents the number of pairs of sites separated by d bases in which one site has minor allele count i and the other has minor allele count j. For nonrecombining sites, the 2-SFS is independent of the distance, so we will suppress the d in our notation when considering the nonrecombining case.
In the limit of low per-site mutation rate () and no recombination, all polymorphic sites are bi-allelic and the expected SFS and 2-SFS are related to moments of the genealogical branch length distribution by
| (1) |
| (2) |
where is the total length of branches subtending i leaves of a gene genealogy and represents the expectation over the distribution of gene genealogies defined by a coalescent model. Thus, the SFS and 2-SFS depend on the distribution of coalescent times as well as the distribution of tree topologies. In the opposite limit of high recombination between sites (i.e. fully unlinked loci), the genealogies of the sites come from independent draws of the generating coalescent model, and the 2-SFS can be determined directly from the SFS: . Thus, for a recombining population, the 2-SFS is a function of the genomic distance d between sites and only contains information not found in the SFS for nearby, linked sites.
Fu (1995) calculated the first and second moments of the branch-length distribution for a nonrecombining infinite-sites locus under the standard time-homogeneous Kingman coalescent. He found that for all . This result, combined with Eq. (1) and Eq. (2), implies a negative correlation between mutations at different frequencies: trees generating a mutation with derived allele count i are less likely than average to generate a second mutation with derived allele count . (There are positive correlations between mutations at complementary frequencies induced by genealogies whose root node partitions the tree into subtrees of size i and .)
Birkner et al. (2013) extended Fu’s calculation to a family of multiple-merger coalescents called beta coalescents. This one-parameter family interpolates between the Kingman coalescent and the Bolthausen–Sznitman coalescent (Bolthausen and Sznitman 1998) as the parameter, α, ranges from 2 to 1. Beta coalescents arise in models with fat-tailed offspring distributions (Schweinsberg 2003; Steinrücken et al. 2013), and the Bolthausen–Sznitman coalescent is the limiting distribution of genealogies in populations that are rapidly adapting or experiencing extensive purifying selection (Neher and Hallatschek 2013). The calculations of Birkner et al. (2013) show positive correlations between and for (Figures 5 and 6 of Birkner et al. 2013. Thus, unlike the standard Kingman coalescent, the beta coalescent can generate positive associations between mutations with different minor allele counts. Together, these results suggest that the differences in associations between mutations at different frequencies (i.e. differences in the 2-SFS) can be used to distinguish multiple-merger coalescents from the Kingman coalescent.
Fig. 5.
a) Observed and fit SFS for the 4 D. melanogaster chromosome arms investigated in this study. Note that SFS of each chromosome arm are shifted vertically to improve visibility. The SFS of the fit demographies closely match those from the data. The demographic models that produce the fit site frequency spectra are plotted in b).
Fig. 6.
a–d) Log-ratios between the observed 2-SFS and the 2-SFS expected from a Kingman coalescent fit to the SFS for each of the D. melanogaster chromosome arms investigated in this study. Note the clear visual mismatch between the observed and expected 2-SFS. e) Empirical null KS distributions (shaded regions) and measured KS distances between the data and the Kingman fit (stars) for each of the chromosome arms investigated. All 2-SFS deviate significantly from the null distributions ().
A 2-SFS-based test for the Kingman coalescent
Motivated by this reasoning, we developed a method to use information in the 2-SFS to determine whether a Kingman coalescent (with any demographic history) is consistent with real-world genomic data. The basic idea is to first use the observed SFS to determine the best-fit demographic history within the Kingman model. We then simulate the expected 2-SFS predicted by this best-fit Kingman demographic history, and use a goodness-of-fit statistic to determine whether this expected 2-SFS is consistent with the data. We illustrate this pipeline in Fig. 1 and describe each step in more detail below.
Fig. 1.
Schematic of the model-checking pipeline. The pipeline follows the arrows from a to h. Briefly, after data collection and cleaning a), we construct the SFS and 2-SFS b) and fit a Kingman demography (the null model) to the SFS c). We simulate the 2-SFS expected from a Kingman model with this null demography for several values of the recombination rate and choose the recombination rate, , that maximizes the P-value for rejecting the Kingman model based on the KS distance between the 2-SFS of the data and the null d–f). We then resample g) and compute the KS distance between these resampled distributions and to generate a null KS distribution. We compare the KS distance between and to this null KS distribution to generate a P-value h).
Computing the SFS and 2-SFS from population data
We begin by generating the folded SFS and 2-SFS, and , using sequence data from a sampled population. In practice, we will often restrict our analysis to patterns at 4-fold degenerate sites, since these are regarded as more likely to be selectively neutral. We, therefore, typically only consider values of d that are multiples of 3. If the ancestral allele is known, the unfolded SFS and 2-SFS can be used instead. To increase computational efficiency, we lump alleles with a frequency larger than into one high-frequency bin, choosing by eye such that the high-frequency tail of the SFS is low noise (though we note that the pipeline is robust to the exact choice of , (see Supplementary Fig. S1), and can be implemented without this lumping if preferred).
Inferring the best-fit Kingman demography
We next use the observed SFS to infer the best-fit Kingman demographic model. To do so, we fit a 5-epoch piecewise-constant Kingman demography, , to the lumped SFS of the data using a modification of the fastNeutrino algorithm (Bhaskar et al. 2015). As in fastNeutrino, we find the that minimizes the Kullback–Leibler (KL) divergence between the expected and observed SFS using the L-BFGS-B algorithm with automatic differentiation. Unlike fastNeutrino, we apply regularization to the vector of log population sizes. Regularization helps the solver find well-behaved solutions by penalizing very short epochs with very large population sizes, which do not affect the SFS. We note that the specific choice of demographic fitting algorithm is not crucial, and any demographic inference method that accurately predicts the SFS (as this algorithm does, see Fig. 2) could be substituted without altering downstream analyses. Python implementation of the fitting algorithm, which we refer to as fitsfs, is available in a GitHub repository at https://github.com/desai-lab/twosfs.
Fig. 2.
a) Simulated site frequency spectra for example constant-size-Kingman, beta coalescent, positive-selection, and exponential population size growth models, compared with the expectations from the corresponding best-fit Kingman demographic models. The examples shown here are for (beta coalescent); , (positive selection); and , (exponential growth). Note that site frequency spectra are shifted vertically relative to each other to aid in visibility (thus, while relative frequencies in each curve are accurate, the overall normalization is not). b) The inferred best-fit Kingman demographic models for each of the 4 examples shown in (a). Population size and time in the past have units of an arbitrary coalescent timescale.
Our choice of this 5-epoch model is designed to be conservative in allowing for highly flexible Kingman population histories, when compared with more restrictive assumptions such as a piecewise constant model with only one or 2 epochs, or models that make assumptions about the shape of past population growth. As we will see below, the inferred 5-epoch Kingman demographic model is typically an excellent fit to the observed SFS, even when the underlying model is very different (this is precisely why the SFS alone has limited power to test the Kingman assumptions).
Null 2-SFS and recombination rate
Once the best-fit demography has been inferred from the SFS, we generate the 2-SFS predicted by the Kingman coalescent with that demography, which we refer to as , by simulating genealogies using msprime. We note that this predicted 2-SFS depends on the recombination rate r, which determines how quickly the 2-SFS decays towards the product of the corresponding SFSs as a function of d. However, the correct choice of recombination rate may often be unknown. One possible approach would be to restrict our analysis to pairs of sites that belong to segments that have not recombined in the history of the sample. However, errors in our inferences of the boundaries of these nonrecombined blocks could lead to incorrect rejection of the Kingman model. Therefore, to be conservative in the face of uncertainty in the recombination rate, we instead simulate multiple candidate null 2-SFS with different recombination rates, and choose the recomination rate that minimizes our ability to reject the Kingman model (as described in more detail below).
Statistic for comparing expected and observed 2-SFS
We wish to compare the expected 2-SFS under the best-fit Kingman demographic model, , to the 2-SFS observed in the data, . To do so, we use a form of the Kolmogorov–Smirnov (KS) distance (Kolmogorov 1933; Smirnov 1948) generalized to 3 variables (i, j, and d; we treat r as a constant here), by implementing the procedure described in Gosset (1987). The multidimensional KS distance is a nonparametric statistic that measures the degree to which an empirical distribution (here ) matches a proposed generating distribution (here ). In summary, it is the maximum absolute distance between the cumulative distribution functions (CDFs) generated by and , maximized again over all 8 cumulation directions when defining the multidimensional CDF (i.e. , , etc). We direct readers to Gosset (1987) for a more thorough description of this statistic.
Null KS distribution and P-value
We next wish to determine whether the observed multidimensional KS distance is consistent with being drawn from . In other words, is the observed 2-SFS consistent with the 2-SFS expected based on the best-fit Kingman demographic model? The complex natures of our KS statistic and the noise associated with mutation accumulation and population sampling mean that it is not possible to derive an analytic expression for a range of “typical” KS distances to be expected assuming the null model is correct. Therefore, we approximate the null KS distribution through a resampling procedure. Specifically, we generate a low-noise null 2-SFS distribution by averaging demographic simulations under the null demographic model. At every genomic distance d, we draw multinomial samples from this null distribution, where the pair density is the number of pairs of sites at a distance d in the sample. This generates a resampled 2-SFS, , which has the same number of pairs of sites at a distance d as the sampled data, but with an expectation value at every d equal to . Intuitively, this resampled 2-SFS distribution can be thought of as a version of the null 2-SFS “noised” to the level of the sampled data. We repeat the multinomial sampling (but not the simulations) times to generate resampled 2-SFS distributions, (). By calculating the KS distance between each and , we generate an approximate null KS distribution to which the KS distance between and can be compared. We then use this comparison to generate a P-value for the rejection of the Kingman model.
As noted above, depends on the recombination rate, which is often unknown. We, therefore, compute this multidimensional KS statistic and use it to generate a P-value to compare with for a range of different values of r. We then choose the value of the recombination rate, , which maximizes the P-value (i.e. minimizes our chance of rejecting Kingman coalescence). This ensures that we are conservative in rejecting the Kingman model in the face of uncertainty about the recombination rate.
To efficiently find , we choose candidate recombination rates using a golden-section search (Kiefer 1953). Starting with conservative lower and upper bounds for , the golden-section search algorithm iteratively proposes new candidate recombination rates and narrows the bounds on through sequential evaluations of the KS distance between and . The algorithm can be run for a given number of iterations or until some other stopping criteria is met; in this paper, we run the algorithm for 5 iterations.
Model-checking pipeline
We have implemented this 2-SFS-based model-checking procedure in a Snakemake pipeline, which can be used to test whether any real-world or simulated population data is consistent with the Kingman coalescent, publicly available in a GitHub repository at https://github.com/desai-lab/twosfs. This repository has code to reproduce all results and figures from this manuscript and is straightforward to edit to test parameter values ouside those explored in this paper. Users wishing to test real-world data using the pipeline must supply a JSON file containing the locations of all polymorphic sites and their associated derived or minor allele counts. This file uses a specific custom format, though we supply code for conversion from both VCF and text file formats. The pipeline further requires an upper and lower bound for the recombination rates and contains flags for various data cleanup choices. We direct readers to the README located in our GitHub repository for further details and instructions. Computational requirements for all steps in the model-checking pipeline are available in Supplementary Table S1.
Validation of our 2-SFS-based test with simulations
To test the performance of our model-checking procedure, we simulated coalescent histories using msprime (Baumdicker et al. 2022) and SLiM (Haller et al. 2019) under 4 classes of models: (1) the neutral, constant-size Kingman coalescent; (2) a neutral, exponentially growing Kingman coalescent; (3) a neutral, constant-size beta coalescent; and (4) a constant-size population undergoing positive selection at many sites along the genome. For each type of simulation, we tested a range of relevant parameter values, as shown in Table 1.
Table 1.
Models and parameter ranges for simulated coalescent processes.
| Model | Parameter range |
|---|---|
| Constant-size, neutral Kingman | N/A |
| Exponential growth | Growth rate γ: 0.25–2.0 per |
| Growth time : 0.5– | |
| Beta coalescent | α: 1.05–1.95 |
| Positive selection | Selective coefficient s: 0.005–0.08% |
| Rate of selective mutations μ: – per site per genome per generation | |
| Diploid population size | |
| Genome length |
Note that the characteristic timescale is an arbitrary scaling factor that does not affect tree topologies, as the total population growth is controlled by the product of γ and . We do not specify a neutral mutation rate for any of the models because neutral diversity is added after the simulations finish in both msprime and SLiM (see Haller et al. 2019; Baumdicker et al. 2022 for more details).
A flexible Kingman demography reproduces features of a non-Kingman SFS
For every model-parameter combination, we first simulated the expected folded SFS of 100 samples. Because of the stochasticity at higher frequencies, we combined all mutations with frequency into one “lumped” high-frequency bin. As described above, we then used fitsfs to fit a piecewise-constant neutral demographic model to each lumped, folded SFS. We show one example of the resulting SFS from each of the 4 types of models we simulated, along with the corresponding fitsfs fits, in Fig. 2. We see that the observed site frequency spectra deviate strongly from the constant-size Kingman expectation for the 3 examples where this was not the underlying model. However, we find that a Kingman coalescent with a flexible population size can be fit to all 4 spectra nearly perfectly. This implies that any statistics based solely on the SFS, or transformations thereof, will have minimal power to distinguish the non-Kingman scenarios (here beta coalescent and positive selection models) from a sufficiently flexible Kingman demography.
The 2-SFS can distinguish demographic models with matching SFS
By contrast, we expect that the 2-SFS should allow us to distinguish non-Kingman scenarios from a Kingman demographic model that generates an identical SFS. To show this, we used msprime to simulate the Kingman coalescent with the piecewise-constant demographic histories inferred by fitsfs for the simulated models described above. This produced a set of pairs of simulations, each consisting of an original (potentially non-Kingman) model, along with the corresponding Kingman model with the piecewise-constant demographic history that is the best fit to the SFS from the original model.
By construction, these simulated Kingman coalescents produce nearly identical SFS as the corresponding original models. We then compared the 2-SFS produced by these simulated Kingman coalescents to those produced by our simulations of the original models. To visualize this comparison, in Fig. 3, we plot 4 examples of the log-ratio of the 2-SFS produced by the original models to those produced by the best-fit Kingman demographic model. We see that for the beta and positive selection cases, where the original model is not Kingman, there is a striking visual difference with the 2-SFS of the corresponding Kingman demographic model, despite the near-perfect fit to the SFS. On the other hand, the constant-size and exponentially growing Kingman coalescents show a signal consistent with simulation noise. Taken together, these results imply that the 2-SFS can distinguish between Kingman and non-Kingman coalescent models, even when the SFS fails to do so.
Fig. 3.
Log-ratio of the 2-SFS of the 4 example models shown in Fig. 2 with the 2-SFS expected under the corresponding best-fit piecewise-constant Kingman demographies.
Power analysis of our model-checking procedure
The examples shown above demonstrate visually that there is information in the 2-SFS that can potentially be used to distinguish Kingman from non-Kingman coalescent processes. To determine whether the statistical test we introduced above effectively uses this information, we validated our model-checking pipeline with the simulated models from Table 1. Proper validation requires several replicate simulated SFS and 2-SFS (i.e. multiple simulations of and ), which are computationally expensive to generate from scratch for the large number of models we simulate. Therefore, to save computational resources, we reemployed the resampling method described earlier. For each model-parameter combination, we generated 100 simulated 2-SFS, , by resampling the low-noise 2-SFS 100 times at for genomic distances , 6, …, , approximately matching the pair density of 4-fold degenerate sites in the D. melanogaster dataset we describe below. Again, each of these can be thought of as a 2-SFS whose expectation value matches the simulated coalescent model but is noised to mimic real-world data. We ran our model-checking pipeline independently for each , generating 100 validation runs of the procedure for every model-parameter combination.
We plot the power to reject Kingman coalescence at a p-value threshold of 0.05 in Fig. 4. As seen in Fig. 4a–c, we have high power to reject Kingman coalescence for models that involve highly skewed offspring distributions and strong positive selection and low false-rejection rates for neutral exponential growth for biologically realistic sample sizes. In other words, we correctly reject Kingman coalescence whenever the underlying model involves sufficiently strong non-Kingman processes, but do not incorrectly reject the model in any of the scenarios involving exponential growth. This trend holds despite the 3 model classes spanning similar levels of distortion of the SFS, as measured by Tajima’s D (Fig. 4d).
Fig. 4.
a–c) Power to reject Kingman coalescence in simulations across a range of parameter values for several different classes of models. For the beta coalescent and positive selection, power increases as simulations move away from neutrality, as expected. In exponentially growing Kingman coalescents, false rejection rates remain low for all parameter values. d) Tajima’s D, which is a measure of the degree to which the SFS is distorted relative to its expectation under a constant-size Kingman model versus power to reject Kingman coalescence. Each point is the average Tajima’s D for all simulations of a specific model-parameter combination. Note that our model-checking pipeline demonstrates high power to detect non-Kingman coalescence and low false-rejection rates for Kingman models with nonconstant population size history, despite similar distortions to the SFS as measured by Tajima’s D.
Real-world genomic data often has complexities not directly included in simulated data—for example, sequencing noise can have a significant impact on measured diversity. Furthermore, pairs of SNPs, particularly those at close distances, may not come from 2 independent mutations (as we assume in this analysis) but rather a single, complex mutation. Researchers may, therefore, wish to exclude pairs of sites at (e.g. to minimize the effect of complex mutations) or reduce the maximum genomic distance analyzed (e.g. to reduce the impact of larger structural variation). We therefore reran the simulated data through our model-checking pipeline after artificially adding varying levels of sequencing noise (Supplementary Fig. S2), dropping pairs of sites at genomic distance from the 2-SFS (Supplementary Fig. S3), or varying the maximum distance of pairs of sites included in the analysis (Supplementary Fig. S4). Our model-checking pipeline maintains high power and low false-rejection rates in all cases except for the largest level of sequencing noise we tested.
Analysis of D. melanogaster data
We next applied our method to analyze sequence data from the DPGP3 data set, which consists of haploid consensus sequences from flies, obtained via the haploid embryo method of Langley et al. (2011). The SNP calls that characterize these sequences were subjected to a variety of quality filters as described in Lack et al. (2015). We obtained the DPGP3 consensus sequence files version 1.1 for the 2L, 2R, 3L, and 3R chromosome arms from www.johnpool.net/genomes.html. These files contain sequence alignments of all flies in the sample on all chromosome arms. We also downloaded the November 3, 2016 spreadsheet of inversions available at the same link. For each chromosome arm, we excluded any samples with an inversion in that arm and then randomly down-sampled to flies. As a result, the data for each chromosome arm is from a different subset of individuals.
To ensure our analyses focused on putatively neutral variation, we filtered called SNPs to 4-fold degenerate sites. We then calculated the average pairwise diversity, Π, as a function of position for each autosomal chromosome arm. Pairwise diversity is high in the middle of each chromosome arm and lower near the centromeres and telomeres, in agreement with calculations by Corbett-Detig et al. (2015). Our modeling—and coalescent-based demographic inference in general—assumes that the distribution of gene genealogies is homogeneous along the chromosome. Therefore, we selected a 13–16 Mb “central” region of each arm with relatively homogeneous values of Π for further analysis. The boundary positions of these central regions are given in Table 2.
Table 2.
Central regions and fraction of sites above 90% coverage for the 4 D. melanogaster chromosome arms analyzed in this study.
| Chromosome arm | Central region analyzed | Fraction of sites kept |
|---|---|---|
| 2L | 1–17 Mb | 0.916 |
| 2R | 6–19 Mb | 0.928 |
| 3L | 1–17 Mb | 0.928 |
| 3R | 10–26 Mb | 0.912 |
The cutoff positions of central regions are referenced to the DPGP3 reference genome available at http://www.johnpool.net/genomes.html.
In order to ensure that the segregating mutations reflect true genetic diversity and not variation in calling errors, we excluded sites with fewer than 90 of the 100 genotypes called. This leaves over 90% of all sites and does not substantially alter the fraction of polymorphic sites (Table 2). For remaining sites with missing calls, we probabilistically imputed missing genotypes as either the major or minor allele based on the proportion of called genotypes at that site. Every missing read was assigned the minor allele with probability P and the major allele with probability , with P equal to the minor allele fraction of called genotypes at that site.
We ran each chromosome arm independently through our model testing pipeline, fitting Kingman demographies to the SFS, and comparing the 2-SFS of the data to the fit demographies. We plot these results in Fig. 5. For all chromosome arms, the best-fit Kingman demographies closely match the observed SFS (Fig. 5a) and show a recent roughly doubling of the population size (Fig. 5b). However, the 2-SFS of our inferred demographies do not match the 2-SFS observed in the data (Fig. 6), as can be seen visually (Fig. 6a–d) and verified numerically using our KS statistic (Fig. 6e). This implies that this D. melanogaster data is inconsistent with the Kingman coalescent, and that the best-fit Kingman demographies are not an accurate representation of the effective population size history but are instead fitting the effects of other types of non-Kingman processes. Our finding is consistent with recent work by Freund et al. (2023), who argue that the unfolded SFS in this population is inconsistent with a Kingman model (though that study is limited to considering models with exponential growth).
Discussion
We have shown that the 2-SFS is sensitive to multiple mergers, but largely invariant to population growth in the Kingman coalescent, making it well-suited for coalescent model-checking. We developed and validated a model-checking procedure that uses this information to discriminate Kingman from non-Kingman coalescence, and demonstrated the power of our approach in simulated data. We then applied this method to data from D. melanogaster, which is believed to be strongly shaped by natural selection, and found evidence that population growth alone cannot explain the correlation structure in the 2-SFS in this system.
We emphasize that our 2-SFS-based test is fundamentally different from approaches based on the SFS or on statistics derived from the SFS. For example, several recent studies have developed methods to use the SFS to distinguish multiple-merger coalescents from Kingman models with specific forms of population growth (Birkner et al. 2013; Eldon et al. 2015; Blath et al. 2016; Spence et al. 2016; Koskela 2018; Koskela and Wilke Berenguer 2019; Árnason et al. 2023; Freund et al. 2023). Many of these methods rely on signal in the unfolded SFS (because deviations from Kingman models create a “U-shaped” SFS), and are therefore sensitive to orientation errors. In contrast to this work, our approach uses the SFS to infer the best-fit Kingman demographic model, and then asks whether this best-fit Kingman model is consistent with the different information contained in the joint frequency spectrum of pairs of sites. This takes advantage of information about genealogies that is not present in the SFS, and also avoids the sensitivity to orientation errors that is inherent to unfolded data. The relationship between our method and other more recent work (Johri et al. 2020; Freund and Siri-Jégousse 2021; Lepers et al. 2021; Menardo et al. 2021; Árnason et al. 2023) that uses an ABC framework to distinguish between coalescent models is more complex. These studies make use of several SFS-derived statistics as well as additional statistics related to clade size and linkage disequilibrium. These additional statistics are not directly related to the 2-SFS but may contain some related information.
We can get an intuitive understanding for why 2-SFS-based statistics are useful in distinguishing between coalescent models by considering how the 2-SFS depends on the distribution of branch lengths and tree topologies. Mathematically, the expected 2-SFS, can be directly related to the set Ψ of tree topologies allowed by the coalescent model:
| (3) |
where denotes the frequencies of mutations at some sites 1 and 2 and ψ is a particular tree topology. We note here that the first term in this equation, , depends only on the distribution of branch lengths (which can be manipulated arbitrarily using an appropriate choice of historical population size). On the other hand, the second term, , reflects only the distribution of tree topologies, which depends heavily on the particular coalescent model. This expression for the 2-SFS can be further expanded as:
| (4) |
Using Bayes’ Theorem, we can rewrite this as:
| (5) |
We note again that the first term, , depends only on the distribution of branch lengths, while the last term, , is just the expected SFS, . As argued above, by allowing the population size (and thus the coalescent rate) to be explicitly time-dependent, the SFS can be made arbitrarily similar between the Kingman coalescent and broad classes of multiple-merger coalescents. Therefore, we find a condition for 2 coalescent models to be theoretically distinguishable using the 2-SFS:
| (6) |
In summary, the dependence of the 2-SFS on tree topologies contains a term that depends on the coalescent model but not on the SFS. In other words, the 2-SFS distinguishes between models with identical SFS when the trees used to generate the SFS differ between models.
We can further see from the above discussion why the 2-SFS is particularly useful in distinguishing Kingman from multiple-merger coalescents. In any coalescent model, the presence of a site at frequency i implies that there is a branch in the coalescent history that subtends i leaves. However, given this, the probability that the next coalescent event creates a branch that subtends k of these i leaves is uniformly distributed in the Kingman model, while more skewed offspring distributions can be created by multiple-merger events. These types of effects mean that the probability of a given topology conditional on observing the mutation at frequency i can differ substantially between Kingman and multiple-merger models.
We note that we have chosen to implement our model-checking procedure by first using the SFS to infer the best-fit Kingman demographic model, because this is the standard pipeline for demographic inference. We then test for consistency of this model with the observed 2-SFS. However, in principle, we could instead attempt to jointly fit both the SFS and 2-SFS with a Kingman demographic model, and then test whether we can reject this model based on the deviations of both of these spectra from the best-fit Kingman prediction. We expect that such an approach would find similar power to reject the Kingman model, since the inconsistency between the SFS and 2-SFS under Kingman assumptions arises from the deviations in tree topologies described above, which do not depend on how demographic inference is conducted. However, a rigorous analysis of this would require the development of a demographic inference method based on joint fitting of both the SFS and 2-SFS, and it is not clear how best to implement such an approach. This is an interesting topic for future work.
Throughout this study, we have focused on developing a statistical test that allows us to reject Kingman coalescents with flexible time-dependent population size histories. We have analyzed the power of this method when the true population history involves either a beta coalescent or recurrent positive selection. However, these are far from the only genealogical models that may describe a population’s history. For example, population structure or cultural transmission of reproductive success could also lead to deviations from Kingman assumptions. Researchers may often be interested in discriminating arbitrarily between these models, rather than simply rejecting a Kingman coalescent. For example, the differences in the 2-SFS produced by the beta coalescent and positive selection (Fig. 3) suggest that it may be possible to use 2-SFS-based statistics to discriminate between these 2 models. More generally, extending our framework to allow for comparison between 2 or more arbitrary coalescent models is an exciting area for future work. However, an important prerequisite is to develop methods to infer the parameters of such models that best fit the SFS. For example, to use our approach to distinguish between multiple-merger models with different values of α, we would first need to implement a method to jointly infer α and demography from the SFS.
We have also focused in this study on a single application of our statistical test to data from D. melanogaster. However, there are a broad range of possible further empirical applications. For example, one interesting direction would be to use 2-SFS-based statistics to assess the evidence for variation in multiple-merger coalescence within genomes and between species, potentially identifying genomic regions and organisms that are more likely to be under strong selection. Alternatively, one could survey multiple species using a data set such as the diversity data compiled by Corbett-Detig et al. (2015) or the data analyzed using a method based on the unfolded SFS by Freund et al. (2023). These are interesting avenues for future work, which hold the potential to reveal new information about the suitability of widely used population genetic models, and could provide further insight into the forces that determine genetic diversity.
Supplementary Material
Acknowledgments
We thank Arjun Biddanda, Maryn Carlson, Ivana Cvijović, Ben Good, Dick Hudson, Evan Koch, Joe Marcus, Richard Neher, Matthias Steinrücken, John Wakeley, and Aleksandra Walczak for helpful discussions and comments on the manuscript.
Contributor Information
Eliot F Fenton, Department of Physics, Harvard University, Cambridge, MA 02138, USA.
Daniel P Rice, Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; SecureBio, Cambridge, MA 02138, USA.
John Novembre, Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA; Department of Ecology & Evolution, University of Chicago, Chicago, IL 60637, USA.
Michael M Desai, Department of Physics, Harvard University, Cambridge, MA 02138, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.
Data availability
Data and code used in our study are publicly available in a GitHub repository at https://github.com/desai-lab/twosfs.
Supplemental material available at GENETICS online.
Funding
DPR was supported by the Chicago Fellows Program of the University of Chicago. JN acknowledges support for this work from NIH grants GM108805 and HG007089. MMD acknowledges support from grant PHY-1914916 from the NSF and grant GM104239 from the NIH. This work was completed in part with resources provided by the University of Chicago Research Computing Center and the Harvard Faculty of Arts and Sciences Research Computing Center.
Literature cited
- Árnason E, Koskela J, Halldórsdóttir K, Eldon B. 2023. Sweepstakes reproductive success via pervasive and recurrent selective sweeps. Elife. 12:e80781. doi: 10.7554/eLife.80781. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, et al. 2022. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 220(3):iyab229. doi: 10.1093/genetics/iyab229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A, Song YS. 2014. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A, Wang YXR, Song YS. 2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birkner M, Blath J, Eldon B. 2013. Statistical properties of the site-frequency spectrum associated with lambda-coalescents. Genetics. 195(3):1037–1053. doi: 10.1534/genetics.113.156612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blath J, Cronjäger MC, Eldon B, Hammer M. 2016. The site-frequency spectrum associated with Ξ-coalescents. Theor Popul Biol. 110:36–50. doi: 10.1016/j.tpb.2016.04.002. [DOI] [PubMed] [Google Scholar]
- Bolthausen E, Sznitman AS. 1998. On Ruelle’s probability cascades and an abstract cavity method. Commun Math Phys. 197(2):247–276. doi: 10.1007/s002200050450. [DOI] [Google Scholar]
- Coop G, Ralph P. 2012. Patterns of neutral diversity under general models of selective sweeps. Genetics. 192(1):205–224. doi: 10.1534/genetics.112.141861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Corbett-Detig RB, Hartl DL, Sackton TB. 2015. Natural selection constrains neutral diversity across a wide range of species. PLoS Biol. 13(4):e1002112. doi: 10.1371/journal.pbio.1002112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cvijovic I, Good BH, Desai MM. 2018. The effect of strong purifying selection on genetic diversity. Genetics. 209(4):1235–1278. doi: 10.1534/genetics.118.301058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Desai MM, Walczak AM, Fisher DS. 2013. Genetic diversity and the structure of genealogies in rapidly adapting populations. Genetics. 193(2):565–585. doi: 10.1534/genetics.112.147157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donnelly P, Kurtz TG. 1999. Particle representations for measure-valued population models. Ann Probab. 27(1):166–205. doi: 10.1214/aop/1022677258. [DOI] [Google Scholar]
- Durrett R, Schweinsberg J. 2005. A coalescent model for the effect of advantageous mutations on the genealogy of a population. Stoch Process Their Appl. 115(10):1628–1657. doi: 10.1016/j.spa.2005.04.009. [DOI] [Google Scholar]
- Eldon B. 2016. Inference methods for multiple merger coalescents. In: Pontarotti P, editor. Evolutionary biology: convergent evolution, evolution of complex traits, concepts and methods. Springer International Publishing. p. 347–371. [Google Scholar]
- Eldon B, Birkner M, Blath J, Freund F. 2015. Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents? Genetics. 199(3):841–856. doi: 10.1534/genetics.114.173807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldon B, Wakeley J. 2006. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics. 172(4):2621–2633. doi: 10.1534/genetics.105.052175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferretti L, Klassmann A, Raineri E, Ramos-Onsins SE, Wiehe T, Achaz G. 2018. The neutral frequency spectrum of linked sites. Theor Popul Biol. 123:70–79. doi: 10.1016/j.tpb.2018.06.001. [DOI] [PubMed] [Google Scholar]
- Freund F, Kerdoncuff E, Matuszewski S, Lapierre M, Hildebrandt M, Jensen JD, Ferretti L, Lambert A, Sackton TB, Achaz G. 2023. Interpreting the pervasive observation of u-shaped site frequency spectra. PLoS Genet. 19(3):1–18. doi: 10.1371/journal.pgen.1010677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freund F, Siri-Jégousse A. 2021. The impact of genetic diversity statistics on model selection between coalescents. Comput Stat Data Anal. 156:107055. doi: 10.1016/j.csda.2020.107055. [DOI] [Google Scholar]
- Fu YX. 1995. Statistical properties of segregating sites. Theor Popul Biol. 48(2):172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
- Gillespie JH. 2000a. Genetic drift in an infinite population: The pseudohitchhiking model. Genetics. 155(2):909–919. doi: 10.1093/genetics/155.2.909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gillespie JH. 2000b. The neutral theory in an infinite population. Gene. 261(1):11–18. doi: 10.1016/S0378-1119(00)00485-6. [DOI] [PubMed] [Google Scholar]
- Gillespie JH. 2001. Is the population size of a species relevant to its evolution? Evolution. 55:2161–2169. doi: 10.1111/j.0014-3820.2001.tb00732.x. [DOI] [PubMed] [Google Scholar]
- Good BH, Walczak AM, Neher RA, Desai MM. 2014. Genetic diversity in the interference selection limit. PLoS Genet. 10(3):e1004222. doi: 10.1371/journal.pgen.1004222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gosset E. 1987. A three-dimensional extended Kolmogorov-Smirnov test as a useful tool in astronomy. Astron Astrophys. 188(1):258–264. ADS Bibcode: 1987A&A…188.258G. [Google Scholar]
- Griffiths RC, Tavaré S. 1994. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 344(1310):403–410. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]
- Griffiths RC, Tavaré S. 1998. The age of a mutation in a general coalescent tree. Commun Stat Stoch Models. 14(1–2):273–295. doi: 10.1080/15326349808807471. [DOI] [Google Scholar]
- Hahn M. 2018. Molecular population genetics. Oxford University Press. (Sinauer Series). [Google Scholar]
- Hallatschek O. 2018. Selection-like biases emerge in population models with recurrent jackpot events. Genetics. 210(3):1053–1073. doi: 10.1534/genetics.118.301516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller BC, Galloway J, Kelleher J, Messer PW, Ralph PL. 2019. Tree-sequence recording in slim opens new horizons for forward-time simulation of whole genomes. Mol Ecol Resour. 19(2):552–566. doi: 10.1111/men.2019.19.issue-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. 1983. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 23(2):183–201. doi: 10.1016/0040-5809(83)90013-8. [DOI] [PubMed] [Google Scholar]
- Hudson RR. 2001. Two-locus sampling distributions and their application. Genetics. 159(4):1805–1817. doi: 10.1093/genetics/159.4.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Johri P, Charlesworth B, Jensen JD. 2020. Toward an evolutionarily appropriate null model: Jointly inferring demography and purifying selection. Genetics. 215(1):173–192. doi: 10.1534/genetics.119.303002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johri P, Riall K, Becher H, Excoffier L, Charlesworth B, Jensen JD. 2021. The impact of purifying and background selection on the inference of population history: Problems and prospects. Mol Biol Evol. 38(7):2986–3003. doi: 10.1093/molbev/msab050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern AD, Hahn MW. 2018. The neutral theory in light of natural selection. Mol Biol Evol. 35(6):1366–1371. doi: 10.1093/molbev/msy092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiefer J. 1953. Sequential minimax search for a maximum. Proc Am Math Soc. 4(3):502–506. doi: 10.1090/proc/1953-004-03. [DOI] [Google Scholar]
- Kingman JFC. 1982a. On the genealogy of large populations. J Appl Probab. 19(A):27–43. doi: 10.2307/3213548. [DOI] [Google Scholar]
- Kingman JFC. 1982b. The coalescent. Stoch Process Their Appl. 13(3):235–248. doi: 10.1016/0304-4149(82)90011-4. [DOI] [Google Scholar]
- Kolmogorov A. 1933. Sulla determinazione empirica di una legge di distribuzione. G Ist Ital Degli Attuari. 4:83–91. [Google Scholar]
- Koskela J. 2018. Multi-locus data distinguishes between population growth and multiple merger coalescents. Stat Appl Genet Mol Biol. 17(3). doi: 10.1515/sagmb-2017-0011. [DOI] [PubMed] [Google Scholar]
- Koskela J, Wilke Berenguer M. 2019. Robust model selection between population growth and multiple merger coalescents. Math Biosci. 311:1–12. doi: 10.1016/j.mbs.2019.03.004. [DOI] [PubMed] [Google Scholar]
- Lack JB, Cardeno CM, Crepeau MW, Taylor W, Corbett-Detig RB, Stevens KA, Langley CH, Pool JE. 2015. The Drosophila genome nexus: A population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics. 199(4):1229–1241. doi: 10.1534/genetics.115.174664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langley CH, Crepeau M, Cardeno C, Corbett-Detig R, Stevens K. 2011. Circumventing heterozygosity: Sequencing the amplified genome of a single haploid Drosophila melanogaster Embryo. Genetics. 188(2):239–246. doi: 10.1534/genetics.111.127530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lepers C, Billiard S, Porte M, Méléard S, Tran VC. 2021. Inference with selection, varying population size, and evolving population structure: Application of abc to a forward-backward coalescent process with interactions. Heredity. 126:335–350. doi: 10.1038/s41437-020-00381-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Durbin R. 2011. Inference of human population history from individual whole-genome sequences. Nature. 475(7357):493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menardo F, Gagneux S, Freund F. 2021. Multiple merger genealogies in outbreaks of mycobacterium tuberculosis. Mol Biol Evol. 38(1):290–306. doi: 10.1093/molbev/msaa179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Möhle M, Sagitov S. 2001. A classification of coalescent processes for haploid exchangeable population models. Ann Probab. 29:1547–1562. [Google Scholar]
- Myers S, Fefferman C, Patterson N. 2008. Can one learn history from the allelic spectrum? Theor Popul Biol. 73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
- Neher RA, Hallatschek O. 2013. Genealogies of rapidly adapting populations. Proc Natl Acad Sci U S A. 110(2):437–442. doi: 10.1073/pnas.1213113110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicolaisen LE, Desai MM. 2012. Distortions in genealogies due to purifying selection. Mol Biol Evol. 29(11):3589–3600. doi: 10.1093/molbev/mss170. [DOI] [PubMed] [Google Scholar]
- Pitman J. 1999. Coalescents with multiple collisions. Ann Probab. 27(4):1870–1902. doi: 10.1214/aop/1022874819. [DOI] [Google Scholar]
- Rödelsperger C, Neher RA, Weller AM, Eberhardt G, Witte H, Mayer WE, Dieterich C, Sommer RJ. 2014. Characterization of genetic diversity in the nematode pristionchus pacificus from population-scale resequencing data. Genetics. 196(4):1153–1165. doi: 10.1534/genetics.113.159855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA, Nordborg M. 2002. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet. 3(5):380–390. doi: 10.1038/nrg795. [DOI] [PubMed] [Google Scholar]
- Sagitov S. 1999. The general coalescent with asynchronous mergers of ancestral lines. J Appl Probab. 36(4):1116–1125. doi: 10.1239/jap/1032374759. [DOI] [Google Scholar]
- Sagitov S. 2003. Convergence to the coalescent with simultaneous multiple mergers. J Appl Probab. 40(4):839–854. doi: 10.1239/jap/1067436085. [DOI] [Google Scholar]
- Sargsyan O, Wakeley J. 2008. A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms. Theor Popul Biol. 74(1):104–114. doi: 10.1016/j.tpb.2008.04.009. [DOI] [PubMed] [Google Scholar]
- Schraiber JG, Akey JM. 2015. Methods and models for unravelling human evolutionary history. Nat Rev Genet. 16(12):727–740. doi: 10.1038/nrg4005. [DOI] [PubMed] [Google Scholar]
- Schrider DR, Shanku AG, Kern AD. 2016. Effects of linked selective sweeps on demographic inference and model selection. Genetics. 204(3):1207–1223. doi: 10.1534/genetics.116.190223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schweinsberg J. 2000. Coalescents with simultaneous multiple collisions. Electron J Probab. 5:1–50. doi: 10.1214/EJP.v5-68. [DOI] [Google Scholar]
- Schweinsberg J. 2003. Coalescent processes obtained from supercritical Galton–Watson processes. Stoch Process Their Appl. 106(1):107–139. doi: 10.1016/S0304-4149(03)00028-0. [DOI] [Google Scholar]
- Seger J, Smith WA, Perry JJ, Hunn J, Kaliszewska ZA, Sala LL, Pozzi L, Rowntree VJ, Adler FR. 2010. Gene genealogies strongly distorted by weakly interfering mutations in constant environments. Genetics. 184(2):529–545. doi: 10.1534/genetics.109.103556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sella G, Petrov DA, Przeworski M, Andolfatto P. 2009. Pervasive natural selection in the Drosophila genome? PLoS Genet. 5(6):e1000495. doi: 10.1371/journal.pgen.1000495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smirnov N. 1948. Table for estimating the goodness of fit of empirical distributions. Ann Math Stat. 19(2):279–281. doi: 10.1214/aoms/1177730256. [DOI] [Google Scholar]
- Spence JP, Kamm JA, Song YS. 2016. The site frequency spectrum for general coalescents. Genetics. 202(4):1549–1561. doi: 10.1534/genetics.115.184101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinrücken M, Birkner M, Blath J. 2013. Analysis of DNA sequence variation within marine species using beta-coalescents. Theor Popul Biol. 87:15–24. doi: 10.1016/j.tpb.2013.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics. 105(2):437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 123(3):585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vitti JJ, Grossman SR, Sabeti PC. 2013. Detecting natural selection in genomic data. Annu Rev Genet. 47(1):97–120. doi: 10.1146/genet.2013.47.issue-1. [DOI] [PubMed] [Google Scholar]
- Wakeley J. 2009. Coalescent theory: an introduction. Roberts & Company. [Google Scholar]
- Xie X. 2011. The site-frequency spectrum of linked sites. Bull Math Biol. 73(3):459–494. doi: 10.1007/s11538-010-9534-3. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data and code used in our study are publicly available in a GitHub repository at https://github.com/desai-lab/twosfs.
Supplemental material available at GENETICS online.






