Abstract
General genealogical processes such as Λ- and Ξ-coalescents, which respectively model multiple and simultaneous mergers, have important applications in studying marine species, strong positive selection, recurrent selective sweeps, strong bottlenecks, large sample sizes, and so on. Recently, there has been significant progress in developing useful inference tools for such general models. In particular, inference methods based on the site frequency spectrum (SFS) have received noticeable attention. Here, we derive a new formula for the expected SFS for general Λ- and Ξ-coalescents, which leads to an efficient algorithm. For time-homogeneous coalescents, the runtime of our algorithm for computing the expected SFS is where n is the sample size. This is a factor of faster than the state-of-the-art method. Furthermore, in contrast to existing methods, our method generalizes to time-inhomogeneous Λ- and Ξ-coalescents with measures that factorize as and respectively, where ζ denotes a strictly positive function of time. The runtime of our algorithm in this setting is We also obtain general theoretical results for the identifiability of the Λ measure when ζ is a constant function, as well as for the identifiability of the function ζ under a fixed Ξ measure.
Keywords: frequency spectrum, identifiability, multiple merger, simultaneous merger
WHEN summarizing sequence data from n individuals, a natural and often-used statistic is the site frequency spectrum (SFS), where is simply the number of sites at which k of n individuals carry the mutant (or the derived) allele. Despite being only numbers, the SFS still contains a surprising amount of information about the history and structure of the population from which the individuals were sampled. Indeed, for neutrally evolving populations that are well modeled by Kingman’s coalescent (Kingman 1982), the expected value of the SFS was first computed for populations of constant size (Fu 1995), extended to populations of variable size (Griffiths and Tavaré 1998; Polanski et al. 2003; Polanski and Kimmel 2003), and has since been used as a statistic for demographic inference in numerous studies (e.g., Nielsen 2000; Gutenkunst et al. 2009; Coventry et al. 2010; Gravel et al. 2011; Excoffier et al. 2013; Bhaskar et al. 2015; Kamm et al. 2015; Gao and Keinan 2016).
Yet not all populations are well modeled by Kingman’s coalescent. In fact, Kingman’s coalescent can be viewed as a special case of a broader class of coalescent processes called Λ-coalescents (Pitman 1999; Sagitov 1999). While Kingman’s coalescent permits only pairwise mergers of lineages, Λ-coalescents allow two or more lineages to merge simultaneously in a single coalescence event. Such events arise when a single individual has many offspring (Möhle and Sagitov 2001; Eldon and Wakeley 2006), under models of recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005), in populations undergoing continuous strong selection (Neher and Hallatschek 2013; Schweinsberg 2015), and in many other models. Λ-Coalescents can further be seen as special cases of a broader class of coalescents called Ξ-coalescents (Schweinsberg 2000). In Ξ-coalescents, more than one merger event can occur simultaneously, resulting in simultaneous multiple mergers. While Ξ-coalescents have received less attention than Λ-coalescents in the literature, they still arise in certain models of selection (Huillet 2014), models of selective sweeps (Durrett and Schweinsberg 2005), models with repeated strong bottlenecks (Birkner et al. 2009), and for certain diploid mating models (Möhle and Sagitov 2003). Also, since Ξ-coalecents generalize Λ-coalescents, any results presented about Ξ-coalescents immediately pertain to Λ-coalescents.
More formally, time-homogeneous Ξ-coalescents are governed by a measure on the set Furthermore, we consider time-inhomogeneous Ξ-coalescents with measures that decompose into a time-independent part and a strictly positive function of time, where represents (for historical reasons) the inverse intensity. That is, the coalescent is now governed by the measure For example, for Kingman’s coalescent, the point mass at zero, and corresponds to the scaled effective population size at time t. For other models, does not necessarily correspond to the population size, but has an interpretation specific to the model. For example, Neher and Hallatschek (2013) show empirically that the rate of coalescence events in a model of continuous strong selection is a nonlinear function of the population size and the first two moments of the distribution of mutational effects. For a review of the mechanics of Λ-coalescents, see Pitman (1999) and for a review of Ξ-coalescents, see Schweinsberg (2000). For an alternative perspective, see Donnelly and Kurtz (1999) and Birkner et al. (2009) for a lookdown construction of particle systems with general reproduction mechanisms.
As mentioned above, the expected SFS for Kingman’s coalescent is well understood and can, in fact, be computed for an arbitrary ζ in time (Polanski and Kimmel 2003). For Λ- and Ξ-coalescents, however, the expected SFS can be computed only for constant ζ and the method for Λ-coalescents takes time (Birkner et al. 2013a) and the method for Ξ-coalescents takes time exponential in n as a sum over partitions of the first n numbers must be performed (Blath et al. 2015). Here we present a method that can compute the expected SFS for time-inhomogeneous Λ- and Ξ-coalescents with arbitrary ζ in time. In the case where ζ is a constant function, our method can compute the expected SFS in time given the rate matrix Q of the ancestral process, which is defined more precisely below. We also prove some results about the sample size needed to make Λ identifiable for popular classes of Λ measures for constant ζ, as well as results about the sample size needed to make ζ identifiable for a fixed
There has also been some related work on determining the asymptotic behavior of the expected SFS as In this setting, Berestycki et al. (2007, 2014) derive some simple formulas for time-homogeneous Λ-coalescents that come down from infinity. For finite n, however, these asymptotic formulas can be rather inaccurate. Indeed, even for Birkner et al. (2013a) show that for some Λ-coalescents, there is a sizable discrepancy between the asymptotic formulas and the SFS obtained by simulation, highlighting the need for finite-sample calculations. Nevertheless, such asymptotic results highlight some interesting properties of Λ-coalescents and are reviewed in Berestycki (2009).
The remainder of this article is organized as follows. We first present our main results about the computation of the SFS for time-inhomogeneous coalescents and discuss the practical runtime of our implementation. We also investigate the variation in the empirical SFS and study the ability to infer the underlying model, using the empirical SFS. Then, we prove some identifiability results about general coalescents. We conclude with a discussion on the implications of our results.
Main Theoretical Results on the Expected SFS
Here we present our theoretical results on the expected SFS for a general Ξ-coalescent with a measure of the form These results lead to an -time algorithm for computing the expected SFS and can be improved to if ζ is a constant function. Briefly, we use subsampling arguments to show that the expected SFS can be computed from where denotes the expected time to the most recent common ancestor for sample size Then, we show how to compute using a spectral decomposition of the rate matrix Q of the ancestral process (also known as the block-counting process) of the time-homogeneous coalescent corresponding to More specifically, Q is a lower triangular matrix, where is the instantaneous rate at which i unlabeled lineages merge to form j unlabeled lineages when For example, for Kingman’s coalescent,
Using this notation, we are now ready to state our main result. The rest of this section provides lemmas that contain formulas for the matrices in Theorem 1, as well as a proof of those lemmas and Theorem 1.
Theorem 1. Consider an arbitrary time-inhomogeneous Ξ-coalescent governed by a measure such that the expected time to the first coalescence for a sample of size k is finite for Let Then, there exists a universal matrix that does not depend on the measure and a matrix that depends on Ξ but not ζ, such that
where is the population-scaled mutation rate. Furthermore, this allows to be computed in time.
Computing the matrix L in Theorem 1 is costly. For time-homogeneous coalescents, it is possible to compute directly, resulting in the following corollary:
Corollary 1. In the same setting as Theorem 1, if ζ is a constant function, then can be computed in time.
In what follows, Lemmas 1 and 2 provide formulas to compute the universal matrix A, while Lemmas 3 and 4 provide formulas to compute L, which is related to the spectral decomposition of the rate matrix Q. The expected first coalescence times can be computed as (Polanski and Kimmel 2003; Bhaskar et al., 2015)
Note that since A and L do not depend on ζ, the SFS depends on time and the inhomogeneity of the coalescent process only through the first coalescence times
Lemma 1. Let denote the antisingleton entries (i.e., entries where exactly one individual has the ancestral allele and all other individuals have the derived allele) of the SFS for samples of sizes Then,
where the entries of are given by
Proof. We use induction to show that
| (1) |
Using exchangeability and a subsampling argument similar to that of Kamm et al. (2015, lemma 2), we obtain, for
| (2) |
which follows from removing an individual uniformly at random from a sample of size k. Now, define the level of as and note that (1) holds for level 1, i.e., for on the left-hand side. Assume that (1) holds for level Then,
where the first equality holds by the recursion (2) and the second equality holds by the inductive hypothesis, by noting that and are both one level below □
The following lemma relates to
Lemma 2. Let γ, and θ be defined as above. Then,
where is bidiagonal with and for and
Proof. As in the Proof of Lemma 1, we employ a subsampling argument. Consider a sample of size The only way that a subsample of size k can have a different time to the most recent common ancestor is if the removed individual is a singleton after all of the other lineages have coalesced. The probability that we remove that singleton to form our subsample is Then, the expected amount of time during which there is one singleton and all of the other individuals have coalesced scaled by the mutation rate is exactly the antisingleton entry. Thus,
for When there are only two lineages, so the total branch length is the antisingleton entry. Thus, Rewriting this as a matrix equation for completes the Proof. □
By combining Lemmas 1 and 2, we obtain the universal matrix We now show how to compute the Ξ-dependent matrix L. First, we establish the following result on the decomposition of the rate matrix Q; this result was also obtained by Möhle and Pitters (2014, equation 2.3) for the Bolthausen–Sznitman coalescent.
Lemma 3. Fix an arbitrary Ξ-coalescent with for where Let denote the rate matrix of the ancestral process corresponding to (that is, the process counting the number of extant lineages at time t). Then,
where with being the Kronecker delta that equals 1 if and 0 otherwise, and
Proof. By the construction of U,
which implies that Then, since U is triangular and has strictly positive diagonal entries, it is invertible. Therefore, □
The following result relates and
Lemma 4. Let and be defined as above. Fix an arbitrary Ξ measure and a strictly positive function ζ. Now consider a time-inhomogeneous coalescent governed by If for then
where is the diagonal matrix with denoting the first column of and denotes the submatrix of in rows and columns 2 through n.
Proof. Note that Therefore,
where the third equality follows from the fact that Q is lower triangular and hence so is its exponential. Now, since U is lower triangular, its inverse is as well. Therefore, we may ignore the value of Letting but with note that Then we have
Now, note that for all i by Lemma 3 and induction. This implies or Using this identity, we can rewrite the above expression for as
where Collecting these equations over in matrix form leads to the desired result. □
Using Lemma 4, we now see that the matrix L from Theorem 1 is simply Lemma 3 provides a recursion to compute U, and D may be computed by noting that and then since we have
Proof of Theorem 1. Combining Lemmas 1, 2, and 4, we obtain the equations in Theorem 1. For the runtime, note that each of the entries of U requires computations, and so computing U is The matrices composing A are known in closed form, however, and constructing D requires filling only entries, each requiring computations for a total of To then obtain the SFS from simply requires iterated matrix vector products taking time. The overall procedure thus requires
□
Lemma 5. For coalescents of the form where ζ is a constant function, can be computed recursively from and Q as follows:
Proof. The formulas follow immediately from the homogeneity of the process, recursing on the number of individuals, and noting that the probability that the first coalescence event for a sample of size k results in k lineages merging down to l lineages is
□
Proof of Corollary 1. Use Lemma 5 to compute in time. Then, by Theorem 1, which also takes time to compute.
□
Remark 1. Other than computing U, the algorithm presented in Theorem 1 is Thus, for the Bolthausen–Sznitman coalescent (Bolthausen and Sznitman 1998) or Kingman’s coalescent, where U is known in closed form (Möhle and Pitters 2014, theorem 1.1 and appendix), the SFS can be computed in time even for nonconstant ζ.
Remark 2. The above results can easily be extended to a coalescent where both ζ and depend on t, as long as is piecewise constant. For example, in the recent past the population may evolve according to a β-coalescent, whereas for t greater than some the population may evolve according to Kingman’s coalescent. By setting ζ appropriately in Theorem 1, one may obtain a “truncated SFS” (Kamm et al. 2015) for each different Then, using the truncated SFS for each epoch and the same machinery as in Kamm et al. (2015), one may compute the full SFS. The same techniques also allow one to consider multiple populations, with each population perhaps evolving according to its own Ξ measure.
Numerical Results
We implemented Theorem 1 and Corollary 1 in Mathematica, and the notebook is available upon request. We can compute the SFS for an arbitrary coalescent for a sample of size in ∼1 sec and a sample of size in a matter of minutes on a laptop computer, which is orders of magnitude faster than the >1 hr reported for a sample size of using the current state-of-the-art method (Blath et al. 2015). Furthermore, Blath et al. (2015) consider only specific Ξ measures where the number of simultaneous multiple mergers is restricted. Our method has the same runtime for all Ξ measures (after computing the rate matrix and the vector of first coalescence times). See Figure 1 for runtime vs. sample size. Furthermore, as noted above, if the spectral decomposition of the rate matrix Q is known, then the algorithm is We also present runtimes for the Bolthausen–Sznitman coalescent [which has a closed-form solution for the spectral decomposition (Möhle and Pitters 2014)] in Figure 1.
Figure 1.
Runtime result (in seconds). Theorem 1 was used to compute the SFS for the time-homogeneous Bolthausen–Sznitman coalescent. The solid line uses Lemma 3 to compute the spectral decomposition of Q resulting in a cubic runtime. The dashed line uses the closed-form representation of the spectral decomposition of the Bolthausen–Sznitman coalescent (Möhle and Pitters 2014, theorem 1.1) to compute the SFS in quadratic time. The dotted line uses Corollary 1, which is also quadratic.
As long as the rate matrix Q of the ancestral process can be found exactly, our method is numerically stable. This is the case for popular Λ-coalescents such as point-mass coalescents and β-coalescents, as well as point mass Ξ-coalescents. If the rate matrix must be evaluated numerically, however, high-precision computation may be needed to avoid potential numerical problems due to catastrophic cancellation.
Using simulations, we now investigate the variation in the empirical SFS across independent realizations of the coalescent process and study the ability to infer the underlying model, using the empirical SFS. We consider three different ζ’s, illustrated in Figure 2. Due to the association with population sizes in the case of Kingman’s coalescent, we refer to ζ as the history or population size history. However, we caution that depending on the finite population size model, ζ may not represent the population size, but some other biologically relevant parameter. We consider a constant size history, a bottleneck history that undergoes a temporary 10-fold size reduction, and a growth history with repeated population doublings. For each ζ, we consider β-coalescents with Note that corresponds to the Bolthausen–Sznitman coalescent, while corresponds to the Kingman coalescent. For each of the nine distinct values of we simulated independent trees with leaves.
Figure 2.
for three demographic scenarios: a constant size history, a bottleneck history that undergoes a temporary 10-fold size reduction, and a growth history with repeated population doublings. Note that the y-axis is stretched by
In Figure 3, we examine the observed variation in branch lengths across independent realizations of the coalescent process, from which we can deduce the variation in the observed SFS. Specifically, assume that each tree sampled from the coalescent process has the same mutation rate, and, without loss of generality, assume that time has been scaled such that the mutation rate is 1. Let be the sum of branch lengths with k leaves and recall that is the kth entry of the empirical SFS on the n observed individuals. Then, and In Figure 3, we plot for each simulated tree, as well as its expected value Defining for this case of we also plot an estimate of the standard deviation where is the empirical expectation. Now, if we sum the branch lengths and mutations over m independent trees [so then and ], then and describe the limiting behavior of both and as by the central limit theorem, and
Figure 3.
The distribution of the branch length subtending k leaves, for random trees under a β-coalescent and The solid line is the expected value from Theorem 1. We simulated 1000 independent trees per scenario and their branch length results are shown as gray dots and box plots; the dashed line denotes the estimated standard deviation of the distribution. Note the y-axis is stretched by The mean and standard deviation give the limiting behavior of for many independent trees, under the central limit theorem. For most k (say, ), the branch length is usually 0 and has high variance relative to the mean. Thus will tend to have higher relative accuracy for the smaller entries k.
A recent inconsistency result (Koskela et al. 2015, theorem 1) shows that a Λ measure cannot be inferred from a single tree (), even as Indeed, we see in Figure 3 that the branch lengths of a single tree can deviate substantially from For most k (say, ), typically or given a single tree. That is, for a single tree, branches subtending more than a few leaves are either not observed or much larger than the expected branch length. However, smaller k (especially the singletons, ) have smaller relative standard deviation and thus will tend to have lower relative error as m increases.
In the case of Kingman’s coalescent, ζ is inferred by minimizing the Kullback-Leibler (KL) divergence between a normalized version of the empirical SFS and a normalized version of the expected SFS (e.g., Bhaskar et al. 2015, equation 10). Recall that the KL divergence, between two discrete probability distributions, is where is defined to be 0. We investigate how KL divergence behaves as a function of the number m of independent trees simulated in the case of Λ-coalescents. Let be the expected SFS under model and the corresponding branch lengths summed over the first m simulated trees. Define as the true probability distribution of derived alleles under scenario and as the conditional distribution of derived alleles, given the first m trees simulated under In Figure 4, we plot the KL divergence as a function of m, for every considered above (that is, ζ is constant, bottleneck, or growth, and α is 1, 1.5, or 2). In this case, we see that minimizing identifies the true scenario with access to only a moderate number of independent trees (between 10 and 100).
Figure 4.
The KL divergence where is the distribution of derived alleles under scenario and is the conditional distribution of derived alleles, given the first m trees simulated under and a mutation hitting one of those trees. For m large enough, is minimized by the true parameters; i.e., can typically discriminate the true scenario for trees. For trees, is often, but not always, minimized by the true scenario.
Figure 4 is encouraging, as not too many independent trees are needed to distinguish between the different scenarios Unfortunately, in some cases it may be impossible to even sample two independent trees (J. Koskela, personal communication). For example, in the model of Birkner et al. (2013b), a multiple-merger event happens over a single “generation,” which can cause the multiple merger to affect unlinked sites, resulting in correlated coalescence times. However, in other models, multiple-merger events may affect the genome only locally, and thus trees from unlinked sites are independent. For example, in the selective sweep model of Durrett and Schweinsberg (2005), multiple mergers are caused by selective sweeps taking place over “generations,” and a site experiences a multiple merger if where respectively parameterize the population size, selection strength, and recombination distance to the selected site. Thus, the independence of unlinked trees is not necessarily determined by the Λ or Ξ measure itself, but instead by the prelimiting model.
Identifiability Results
Before attempting to infer ζ or Ξ in practice, it is important to know whether such inference is possible using the SFS. For instance, when inferring ζ, if two different functions and produce the same SFS, then it is impossible to distinguish between the two using only the SFS. In such a case, we say that ζ is not identifiable. For Kingman’s coalescent if one allows ζ to be an arbitrary positive function that produces a finite SFS, then ζ is not identifiable (Myers et al. 2008). ζ is identifiable in the case of Kingman’s coalescent, however, if one restricts ζ to be from a set of biologically realistic functions (technically, a set of functions with only a finite number of oscillations) (Bhaskar and Song 2014, theorem 11). We show that a similar result holds for all coalescents of the form where is fixed.
In general it is impossible to infer Ξ from the SFS if Ξ is not restricted. There has been some interest, however, in the case of distinguishing between a subset of Λ-coalescents (Eldon et al. 2015). We prove some results about the identifiability of the measure for various subsets of Λ measures when ζ is a constant function. We also consider the question posed by Eldon et al. (2015) of whether the SFS can distinguish between exponential growth under Kingman’s coalescent and a class of Λ-coalescents with constant ζ, and we show that indeed it is possible to distinguish between these cases with a surprisingly small number of samples. We note that our identifiability results require knowledge of the exact expected SFS, whereas Eldon et al. (2015) focus on the case where the expected SFS is approximated using an empirical SFS, which is what occurs in practice.
Throughout this section we assume that one has the exact expected SFS (i.e., the object computed by Theorem 1).
Identifiability of ζ for fixed Ξ measure
Before proceeding to the results and proofs, we first introduce some notation. Let denote the set of piecewise defined functions with at most K pieces made from some function family ℱ. Furthermore, let denote the sign-change complexity of ℱ. Informally, is the supremum of the number of times crosses 0 over functions which is related to the number of oscillations each is allowed to have [see Bhaskar and Song 2014, definition 4, for a formal definition of ]. We will also write for the number of 0 entries in in the spectral decomposition of Q for a coalescent on n individuals governed by Furthermore, denote by χ the space of Ξ measures such that for all k. That is, χ is the set of Ξ measures where for any sample size there is positive probability of a single pairwise merger. If we are considering only Λ-coalescents, then χ contains all Λ measures except for the star coalescent. We now present our main identifiability results and a conjectured bound on
Our main result on the identifiability of ζ is the following theorem.
Theorem 2. For an arbitrary Ξ-coalescent governed by the measure where is fixed, suppose and Then for each expected SFS there exists a unique consistent with
First, note that in the case of Kingman’s coalescent, for all n, and so in some sense, Kingman’s coalescent is optimal in terms of the number of samples needed to ensure that a certain model space is identifiable. For the Bolthausen–Sznitman coalescent, for all n, which follows from the spectral decomposition (Möhle and Pitters 2014, theorem 1.1). For the point mass Λ-coalescent with mass at for all odd entries of are 0 and so for thus implying that larger samples (relative to Kingman’s coalescent or the Bolthausen–Sznitman coalescent) are needed for this coalescent to ensure that a given model space is identifiable. We suspect that is the worst case among all Ξ-coalescents in χ for identifiability, resulting in the following conjecture:
Conjecture 1. For all and ,
If this conjecture is true, then the bound on the sample size needed to have identifiability in Theorem 2 can be simplified to
Identifiability of the Λ measure for a constant ζ
We also have the following results for Λ-coalescents about the identifiability of the Λ measure.
Theorem 3. Consider the set of point-mass Λ-coalescents: If Λ is restricted to be in this set and then the expected SFS uniquely determines Λ.
Theorem 4. Consider the set of β-coalescents: If Λ is restricted to be in this set and then the expected SFS uniquely determines Λ.
Theorem 5. Consider the set of coalescents that is, Kingman’s coalescent with exponential growth, point-mass coalescents, or β-coalescents. If Λ is restricted to be in this set and then the expected SFS uniquely determines Λ.
Theorem 5 gives a positive theoretical answer to the question of whether the SFS can distinguish between exponential growth and multiple-merger coalescents. Using the techniques presented below, it is straightforward to obtain similar results for other subsets of Λ-coalescents.
Proofs of the identifiability results
The following lemma is used in proving the theorems in this section and may be of independent interest, as it shows that given the SFS for n individuals one can compute the expected time to most recent common ancestor for sample sizes or vice versa.
Lemma 6. For all Λ- and Ξ-coalescents, there is a bijection between the expected SFS and the expected times to the most recent common ancestor.
Proof. Combine Lemmas 1 and 2 to see that with B and C being universal. Then, since B is upper triangular and all of its diagonal entries are nonzero, it is invertible. Furthermore, since C is bidiagonal and the diagonal entries are all nonzero, it is also invertible. Therefore, is invertible and since and are related through an invertible matrix, the transformation is bijective.
□
To prove Theorem 2 we use the following lemma.
Lemma 7. Let For all and all Λ other than (i.e., the star coalescent), the sequence is strictly increasing.
Proof. Consider a sample of size and a subsample of size k. Without loss of generality, assume individual is removed to produce the subsample. The time to the first event is the same for both samples unless the first event involves only individual and one lineage from . That is, the total rate when there are lineages is equal to the total rate when there are k lineages plus k times the rate at which exactly a particular pair of individuals coalesce. Formally,
By assumption, and so the total rates must be strictly increasing.
□
We now prove Theorem 2. Our proof relies heavily on the proof of the corresponding result for Kingman’s coalescent (Bhaskar and Song 2014, theorem 11). We essentially show that this setting satisfies the same hypotheses as the Kingman’s coalescent case and then use that result to complete our proof.
Proof of Theorem 2. By Lemma 6, the SFS is uniquely determined by Then, furthermore, note that from Lemma 3 the matrix U is invertible since it is triangular with all nonzero entries along the diagonal. Then, by the same argument as in Bhaskar and Song (2014, equation 12), we know that if the model space is not identifiable, then for each k not corresponding to a zero in D (contributing to ), must be the root of the Laplace transform of two different functions in the model space. By Lemma 7, these are all distinct, resulting in roots. Then, by taking sufficiently large, we obtain a contradiction via the generalized version of Descarte’s rule of signs (Bhaskar and Song 2014, theorem 4) and the theorem is proved.
□
We now prove Theorems 3, 4, and 5. The idea is to explicitly calculate the for the first few k for each allowed Λ measure and then use Lemma 6 to show that if Λ is uniquely determined by the first few then it is uniquely determined by
Proof of Theorem 3. for all Λ in the set of possible Λ’s. Consider Using Lemma 5 we see that
| (3) |
This is a monotonically decreasing function of and so Λ is uniquely determined by Then, appealing to Lemma 6, we see that Λ is uniquely determined by the SFS for
□
Proof of Theorem 4. A calculation similar to (3) gives for where This is a monotonically increasing function of and the claim follows from the same argument as in the proof of Theorem 3.
□
Proof of Theorem 5. Suppose that two distinct Λ-coalescents within the set of allowed models produce the same expected SFS for Then, by Lemma 6, they would have the same values of and By Theorems 3 and 4, we know that the Λ measures cannot both be point-mass coalescents or β-coalescents. From Bhaskar and Song (2014, corollary 8), we also know that the Λ measures cannot both be Kingman’s coalescent with different exponential growth parameters. There are thus three cases. They are all straightforward, albeit tedious.
Case 1. One Λ measure is a point-mass coalescent and the other is a β-coalescent. Letting (without loss of generality), we can explicitly compute for the point-mass coalescent and the β-coalescent, using the same recursive idea as in the Proof of Theorem 3. Let denote the probability that when there are m lineages, exactly k of them are involved in the next coalescence event. Then, by Lemma 5 In particular, for the point-mass coalescent this implies Now, recalling the expression of in (3) and letting
| (4) |
implies Plugging this into we see that for the point-mass coalescent,
| (5) |
A similar calculation for the β-coalescent shows that
| (6) |
with under the β-coalescent. Equating (5) and (6) and solving for t results in the solution or But, if then we see that the star coalescent, which corresponds to for the β-coalescent, which is not in the set of allowed β-coalescents. If we see that which corresponds to Kingman’s coalescent, and for the β-coalescent, which again is not in the set of allowed β-coalescents. Therefore, a point-mass coalescent and a β-coalescent with cannot have the same and simultaneously.
Case 2. One Λ measure is a point-mass coalescent and the other is Kingman’s coalescent with exponential growth. Without loss of generality, assume that for the point-mass Λ-coalescent. The exponential-growth Kingman’s coalescent model considered here has where is the exponential integral (Bhaskar et al. 2015, supplemental material equation 5). Then, the constraint implies where Furthermore, assuming this constraint and applying Theorem 1 to Kingman’s coalescent, we obtain
| (7) |
| (8) |
Now, in addition to if the two coalescents have the same values of and then the right-hand sides of (4) and (7) must agree, while the right-hand sides of (5) and (8) must agree. This implies
| (9) |
where
However, by Lemma 8 in the Appendix, there is no such that (9) holds.
Case 3. One Λ measure is a β-coalescent and the other is Kingman’s coalescent with exponential growth. If these two coalescents produce the same values of and then we must have in (6), and equating (6) and (8) implies
| (10) |
where
However, by Lemma 9 in the Appendix, there is no such that (10) holds.
Since each of the three cases results in a contradiction, we see that no such Λ measures exist, proving the identifiability claim.
□
Discussion
We have presented an efficient algorithm for computing the SFS for a very general class of coalescents. While Λ- and Ξ-coalescents seem to be primarily used in practice to model the genealogies of marine species (Árnason 2004; Hedgecock and Pudovkin 2011), these coalescents also model a wide range of other phenomena, including continuous strong positive selection (Neher and Hallatschek 2013), recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005), strong bottlenecks (Birkner et al. 2009), and many others. Perhaps one of the reasons these coalescents are less widely used than Kingman’s coalescent is because efficient inference tools have not yet been developed to the same extent.
Multiple-merger coalescents have also attracted some interest recently in the context of extremely large sample sizes (Bhaskar et al. 2014). In such cases the sample size is too large for the assumption of only pairwise mergers of lineages imposed by Kingman’s coalescent to be biologically plausible, and indeed using Kingman’s coalescent to model such populations causes biases in inference (Bhaskar et al. 2014). It should be possible to extend the results presented in this article to discrete-time coalescents, such as the “exact coalescent” (Fu 2006) corresponding to the coalescent arising from the discrete-time Wright–Fisher process, or to any of the discrete-time random-mating models considered by Eldon and Wakeley (2006).
We also presented some encouraging identifiability results. While it is impossible in the general case to infer the inverse intensity function ζ or the measure of a Λ-coalescent from the SFS, for many biologically important cases identifiability does indeed hold. The method we presented for proving that the Λ measure is identifiable for constant ζ is powerful, but straightforward and should make it easy to prove whether the measure is identifiable for other sets of Λ- or Ξ-coalescents. While we considered the identifiability of Λ only for fixed, constant ζ and the identifiability of ζ for fixed Λ or Ξ, it would be interesting to see whether identifiability results can still be obtained for some model spaces while allowing both Λ and ζ to vary. It would also be interesting to extend our identifiability results for the Λ measure to some of the biologically relevant Ξ-coalescents.
Our identifiability results generally assumed access to the expected SFS. In practice, one observes a finite number of sites and so one has only a noisy estimate of the SFS. Our simulation study shows that, given a moderate number of independent trees, the empirical SFS is accurate enough to distinguish for some simple models. However, the effects of noisy data are still largely unknown, especially in cases where convergence to the expected SFS is not guaranteed. The accuracy of inferring ζ with the empirical SFS has been studied for Kingman’s coalescent (Terhorst and Song 2015), and it would be interesting to extend these results to general Λ-coalescents and to the inference of the Λ-measure itself; the results presented here should make such an analysis more tractable.
Acknowledgments
We thank Jere Koskela for helpful discussion on convergence to the expected SFS. This research is supported in part by National Institutes of Health (NIH) grant R01-GM108805, NIH training grant T32-HG000047, and a Packard Fellowship for Science and Engineering.
Appendix
Here we present two lemmas that are used in Theorem 5. Proofs are tedious but straightforward.
Lemma 8. For
where
Lemma 9. For
where
In what follows, let It is clear that for all Additionally,
| (A1) |
which follows from the definition of and a change of variables.
Proof of Lemma 8. First, by noting that it is easy to see that the denominator is strictly negative for We now show that the numerator is strictly positive for First, by rearranging terms we see that
| (A2) |
Then, note
Applying this inequality to the negative term on the right-hand side of (12), we see
which is >0 for any since and for □
Proof of Lemma 9. The denominator of (10) is equal to which is strictly positive for by definition of Furthermore, the numerator is strictly negative for by noting the following:
Therefore, (10) holds. □
Footnotes
Communicating editor: N. H. Barton
Literature Cited
- Árnason E., 2004. Mitochondrial cytochrome B DNA variation in the high-fecundity Atlantic cod: trans-Atlantic clines and shallow gene genealogy. Genetics 166: 1871–1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berestycki J., Berestycki N., Schweinsberg J., 2007. Beta-coalescents and continuous stable random trees. Ann. Probab. 35: 1835–1887. [Google Scholar]
- Berestycki J., Berestycki N., Limic V., 2014. Asymptotic sampling formulae for Λ-coalescents, pp. 715–731 in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Vol. 50. Institut Henri Poincaré; Available at: http://projecteuclid.org/euclid.aihp/1403276996. [Google Scholar]
- Berestycki N., 2009. Recent progress in coalescent theory. Ensaios Matematicos 16(1): 1–193. [Google Scholar]
- Bhaskar A., Song Y. S., 2014. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Stat. 42(6): 2469–2493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A., Clark A. G., Song Y. S., 2014. Distortion of genealogical properties when the sample is very large. Proc. Natl. Acad. Sci. USA 111(6): 2385–2390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A., Wang Y. X. R., Song Y. S., 2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25(2): 268–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birkner M., Blath J., Möhle M., Steinrücken M., Tams J., 2009. A modified lookdown construction for the Xi-Fleming-Viot process with mutation and populations with recurrent bottlenecks. Alea 6: 25–61. [Google Scholar]
- Birkner M., Blath J., Eldon B., 2013a Statistical properties of the site-frequency spectrum associated with λ-coalescents. Genetics 195: 1037–1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birkner M., Blath J., Eldon B., 2013b An ancestral recombination graph for diploid populations with skewed offspring distribution. Genetics 193: 255–290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blath, J., M. C. Cronjäger, B. Eldon, and M. Hammer, 2015 The site-frequency spectrum associated with Ξ-coalescents. bioRxiv Preprint. Available at: http://dx.doi.org/10.1101/025684. [DOI] [PubMed]
- Bolthausen E., Sznitman A. S., 1998. On Ruelle’s probability cascades and an abstract cavity method. Commun. Math. Phys. 197: 247–276. [Google Scholar]
- Coventry A., Bull-Otterson L. M., Liu X., Clark A. G., Maxwell T. J., et al. , 2010. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1: 131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donnelly P., Kurtz T. G., 1999. Particle representations for measure-valued population models. Ann. Probab. 27(1): 166–205. [Google Scholar]
- Durrett R., Schweinsberg J., 2004. Approximating selective sweeps. Theor. Popul. Biol. 66(2): 129–138. [DOI] [PubMed] [Google Scholar]
- Durrett R., Schweinsberg J., 2005. A coalescent model for the effect of advantageous mutations on the genealogy of a population. Stoch. Proc. Appl. 115: 1628–1657. [Google Scholar]
- Eldon B., Wakeley J., 2006. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 172: 2621–2633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldon B., Birkner M., Blath J., Freund F., 2015. Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents? Genetics 199: 841–856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Excoffier L., Dupanloup I., Huerta-Sánchez E., Sousa V. C., Foll M., 2013. Robust demographic inference from genomic and SNP data. PLoS Genet. 9(10): e1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y.-X., 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48(2): 172–197. [DOI] [PubMed] [Google Scholar]
- Fu Y.-X., 2006. Exact coalescent for the Wright-Fisher model. Theor. Popul. Biol. 69(4): 385–394. [DOI] [PubMed] [Google Scholar]
- Gao F., Keinan A., 2016. Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models. Genetics 202: 235–245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S., Henn B. M., Gutenkunst R. N., Indap A. R., Marth G. T., et al. 1000 Genomes Project , 2011. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108(29): 11983–11988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths R., Tavaré S., 1998. The age of a mutation in a general coalescent tree. Commun. Stat. Stoch. Models 14(1–2): 273–295. [Google Scholar]
- Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5(10): e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hedgecock D., Pudovkin A. I., 2011. Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentary. Bull. Mar. Sci. 87(4): 971–1002. [Google Scholar]
- Huillet T. E., 2014. Pareto genealogies arising from a Poisson branching evolution model with selection. J. Math. Biol. 68(3): 727–761. [DOI] [PubMed] [Google Scholar]
- Kamm, J. A., J. Terhorst, and Y. S. Song, 2015 Efficient computation of the joint sample frequency spectra for multiple populations. arXiv Preprint. Available at: http://arxiv.org/abs/1503.01133. [DOI] [PMC free article] [PubMed]
- Kingman J. F. C., 1982. The coalescent. Stoch. Proc. Appl. 13: 235–248. [Google Scholar]
- Koskela, J., P. A. Jenkins, and D. Spanò, 2015 Bayesian non-parametric inference for Λ-coalescents: consistency and a parametric method. arXiv Preprint. Available at: http://arxiv.org/abs/1512.00982.
- Möhle M., Pitters H., 2014. A spectral decomposition for the block counting process of the Bolthausen-Sznitman coalescent. Electron. Commun. Probab. 19(47): 1–11. [Google Scholar]
- Möhle M., Sagitov S., 2001. A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29(4): 1547–1562. [Google Scholar]
- Möhle M., Sagitov S., 2003. Coalescent patterns in diploid exchangeable population models. J. Math. Biol. 47(4): 337–352. [DOI] [PubMed] [Google Scholar]
- Myers S., Fefferman C., Patterson N., 2008. Can one learn history from the allelic spectrum? Theor. Popul. Biol. 73(3): 342–348. [DOI] [PubMed] [Google Scholar]
- Neher R. A., Hallatschek O., 2013. Genealogies of rapidly adapting populations. Proc. Natl. Acad. Sci. USA 110(2): 437–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R., 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pitman J., 1999. Coalescents with multiple collisions. Ann. Probab. 27: 1870–1902. [Google Scholar]
- Polanski A., Kimmel M., 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polanski A., Bobrowski A., Kimmel M., 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63(1): 33–40. [DOI] [PubMed] [Google Scholar]
- Sagitov S., 1999. The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Probab. 36(4): 1116–1125. [Google Scholar]
- Schweinsberg J., 2000. Coalescents with simultaneous multiple collisions. Electron. J. Probab. 5: 1–50. [Google Scholar]
- Schweinsberg, J., 2015 Rigorous results for a population model with selection II: genealogy of the population. arXiv Preprint. Available at: http://arxiv.org/abs/1507.00394.
- Terhorst J., Song Y. S., 2015. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. USA 112(25): 7677–7682. [DOI] [PMC free article] [PubMed] [Google Scholar]




