Numerous studies in population genetics have been based on analyzing the sample frequency spectrum (SFS) summary statistic. Most SFS-based inference methods can display pathological behavior in optimization: some demographic model parameters can degenerate to 0...
Keywords: population size, expected sample frequency spectrum, coalescent theory, algebraic methods
Abstract
The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to zero or diverge to infinity, and show undesirable sensitivity to perturbations in the data. The goal of this article is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographies and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model, and generalize our intuition to arbitrary sample sizes using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only epochs, where is between and The set of expected SFS for piecewise-constant demographies with fewer than epochs is open and nonconvex, which causes the above phenomena for inference from data.
THE sample frequency spectrum (SFS), also known as the site or allele frequency spectrum, is a fundamental statistic in population genomics for summarizing the genetic variation in a sample of DNA sequences. Given a sample of n sequences from a panmictic (i.e., randomly mating) population, the SFS is a vector of length of which the kth entry corresponds to the number of segregating sites, each with k mutant (or derived) alleles and ancestral alleles. The SFS provides a concise way to summarize n sequences of arbitrary length into just numbers, and is frequently used in empirical population genetic studies to test for deviations from equilibrium models of evolution. For instance, the SFS has been widely used to infer demographic history where the effective population size has changed over time (Nielsen 2000; Gutenkunst et al. 2009; Gravel et al. 2011; Keinan and Clark; 2012; Excoffier et al. 2013; Bhaskar et al. 2015) and to test for selective neutrality (Kaplan et al. 1989; Achaz 2009). In fact, many commonly used population genetic statistics for testing neutrality, such as Watterson’s (Watterson 1975), Tajima’s (Tajima 1983), and Fu and Li’s (Fu and Li 1993) can be expressed as linear functions of the SFS (Durrett 2008).
In the coalescent framework (Kingman 1982a,b,c), the unnormalized expected SFS for a random sample of n genomes drawn from a population is obtained by taking the expectation of the SFS over the distribution of sample genealogical histories under a specified population demography. In this work, we will be concerned with well-mixed, panmictic populations with time-varying historical population sizes, evolving according to the neutral coalescent process with the infinite-sites model of mutation. The coalescent arises as the continuum limit of a large class of discrete models of random mating—such as the Wright–Fisher, Moran, and Cannings exchangeable family of models (Möhle and Sagitov 2001)—by a suitable rescaling of time and taking the population size to infinity. The infinite-sites model postulates that every mutation in the genealogy of a sample occurs at a distinct site and is commonly employed in population genetic studies for organisms with low population-scaled mutation rates, such as humans. The SFS also appears in the context of statistical modeling as a vector of probabilities. In particular, the normalized expected SFS defined by normalizing the entries of so that they sum to one, gives the probability that a mutation chosen at random is present in k out of n sequences in the sample. Unless stated otherwise, we use the term expected SFS to refer to the unnormalized quantity
The expected SFS is strongly influenced by the demographic history of the population, and extensive theoretical and empirical work has been done to characterize this dependence (Fu 1995; Wakeley and Hey 1997; Polanski et al. 2003; Marth et al. 2004; Chen 2012; Jouganous et al. 2017; Kamm et al. 2017). Fu (1995) showed that under the infinite-sites model for a panmictic population with constant size and no selection, the expected SFS is given by where denotes the population-scaled mutation rate. When the population size is variable, however, the formula for the expected SFS depends on the entire population size history. In particular, Polanski and Kimmel (2003) (in equations 13–15) showed that the expected SFS under a time-varying population size is given by with being an -by- invertible matrix that only depends on n (formula presented in Appendix) and where denotes the expected time to the first coalescence event in a random sample of size m drawn from the population at present. For any time-varying population size function the quantity is given by the following expression:
| (1) |
Pathologies of SFS-based inference algorithms
Let us consider a hypothetical scenario. Suppose we would like to learn about the population history of a group of finches on a remote island. Fossil evidence indicates that the island experienced many generations with ample resources leading to a large, roughly constant population size. Then, some catastrophe occurred, rendering the island’s resources scarce, leading to a small constant population size until the present. We are given four haplotypes from the population and we hope to infer the following parameters for a demographic model based on the history described above:
How big was the population during the epoch of plenty?
How big was the population during the epoch of scarcity?
When did the catastrophe occur, marking the break point?
First, we compute the SFS for the four haplotypes we collected. (Our choice of a sample size of four is for simplicity in this example, but the principles apply for larger samples.) We count singleton (appearing in only one of the haplotypes), doubleton, and tripleton mutations. We do not attempt to track nonsegregating sites. Now we have the SFS, a vector of three real numbers.
Next, we ask ourselves: would we expect to obtain this SFS for some particular set of parameters, based on our model? If the answer is yes, then that set of parameters is our best guess. In Figure 1, the green region describes the set of SFS we would expect for various parameters under this model. Blue dots indicate measured SFS. When the blue dots land in the green region, we simply infer the parameters corresponding to that point. The red crosses are the expected SFS computed for those parameters, so they coincide with the blue dots.
Figure 1.
The green region, denoted represents the set of expected SFS for two-epoch, piecewise-constant demographies for sample size Each blue hollow circle is the observed SFS simulated using msprime (Kelleher et al. 2016) under a constant population size coalescent with recombination using realistic mutation and recombination rates of mutations and crossovers per base pair per generation per haploid. Each sequence has 1000 unlinked loci of length 10 kb each, resulting in an average of 7300 segregating sites. The red crosses are the expected SFS inferred for these simulated SFS using fastNeutrino (Bhaskar et al. 2015); the dotted blue lines show the correspondence between the observed SFS and their projections onto For observed SFS lying in the interior of the observed SFS and their projections coincide, while the observed SFS lying outside project onto the boundaries of one of the two convex regions that form
What if the answer is no? That is, what if the SFS we measured would not be expected for any choice of parameters in our population history model? We have two options to interpret this situation: (1) statistical noise is making the SFS appear inconsistent with the model, or (2) our model is mis-specified. Let us suppose that noise is the culprit. Our strategy is then to look for the closest SFS that would be expected in our model, and infer the parameters associated with that one.
This runs into two problems: First off, the parameters inferred in this way are often nonsensical. In Figure 1, the blue dots outside of the green region are connected by dotted lines to the closest SFS vectors in the green region. Naturally, these mainly lie on the boundary of the green region. The problem is that the boundary points (with one exception that we will discuss later) do not actually correspond to achievable expected SFS vectors! Those points correspond to population size histories where one of the epochs is infinity or zero.
The second problem: Even though there is, in general, a unique closest SFS to a given point outside of the green region, the process of finding the closest point is highly sensitive to noise. Specifically, if you change the quantities in the vector by a small amount, the resulting closest point may change by a large amount. The reason for this is that the set is nonconvex, meaning that not all of the straight lines between points in the green region lie inside the green region. As a consequence, some of the blue dots point to the left-hand green region, while others nearby point to the right-hand green region. Sensitivity to noise is a big problem for inference. Any demographic inference method would manifest these pathologies; indeed, the commonly used (Gutenkunst et al. 2009), fastsimcoal2 (Excoffier et al. 2013), and fastNeutrino (Bhaskar et al. 2015) all encounter these issues.
If we hypothesize that the model may be mis-specified, we need to support this assertion. The following question will arise: How far away is our measured SFS from the type of SFS that we would expect under the rejected population model? Furthermore, we may be asked to offer an alternative hypothesis, i.e., is there another model that actually does allow for an SFS equal to or near the one that we measured? Both of these questions require an understanding of the set of all possible SFS.
Minimal demographic complexity for SFS reconstruction
Let us slightly change our finch example. Suppose we have no a priori assumptions regarding the demographic history. Instead, we are only interested in determining whether the SFS is consistent with a null hypothesis of a single panmictic population under neutrality. If the measured SFS is equal to the expected SFS for some demography, we may be asked to produce the simplest demography with the expected SFS we want. Work by Myers et al. (2008) implies that there are infinitely many population size histories with a given expected SFS, as long as we allow the demographies to be arbitrarily complicated. Bhaskar and Song (2014) (two of the authors of this article) demonstrated that when we constrain ourselves to a simpler family of population size histories, we may have a unique function achieving the desired expected SFS.
Now suppose that the SFS does not equal the expected SFS for any demography. Again, we would need to quantify how far away it is from being achieved by some demography. This is an intimidating task. How can we be certain to find the SFS corresponding to every demography without leaving any SFS vectors out? After all, the space of possible population size histories is infinite-dimensional! Our hope is to understand the shape of the set of all possible SFS vectors so we know that we have covered everything when we reject the null hypothesis.
For the small example of a sample size of four, we have demonstrated a sequence of constraints placed on SFS vectors in Figure 2. The vectors of interest have three coordinates corresponding to singleton, doubleton, and tripleton mutations. Note that any vector of probabilities must be nonnegative and must sum to one. This means we are constrained to the triangle with vertices and We can ignore the third coordinate since it will always be one minus the others. This triangle is depicted in yellow in Figure 2. One might naively hope that every one of these probability vectors is achievable as the expected SFS of some demography.
Figure 2.
Eliminating candidate normalized SFS vectors for This image considers candidate vectors and eliminates them for different reasons. A priori, any vector adding up to one is a possible SFS. This is represented by the yellow triangle whose third coordinate (not shown) is simply one minus the sum of the other two. Sargsyan and Wakeley (2008) showed that the SFS is nondecreasing, ruling out any vectors outside the blue triangle. Furthermore, they showed that the SFS is convex, therefore ruling out anything outside the red triangle. Finally, our algebraic analysis of the expected SFS for a piecewise-constant demography with two epochs rules out vectors outside the green region at the right.
A result proved by Sargsyan and Wakeley (2008) is that SFS vectors must be nonincreasing—this means that we are left with the triangle with vertices and This is depicted in blue in Figure 2. They further proved that the SFS is convex. This implies that the second coordinate is less than the average of the other two. This further cuts down our possibilities to the triangle with vertices and depicted in red in Figure 2. If we want SFS vectors for population size histories with two constant pieces, we are further constrained to the green region, which we will describe algebraically later.
We will be able to completely describe the shape of all SFS for a sample size of four using algebraic formulas for the boundary. In fact, we will show that to find all possible SFS for a sample size of four it is sufficient to consider piecewise-constant functions with at most three constant pieces. Furthermore, we will use tools from convex and algebraic geometry to extend our intuition from this small case study to the SFS for all sample sizes.
Summary of main results
Studying the geometry of the set of expected SFS will address both of the areas discussed above:
Explaining the pathologies in SFS-based inference.
Describing the full set of SFS for a fixed sample size.
In this way, we can help researchers understand why fitting parameters to certain demographic models will lead to runaway behavior. We also enable researchers to reject a null hypothesis of a single panmictic population under neutrality.
Our main result is Theorem 8 which focuses on piecewise-constant demographies. It shows that for every sample size n, there is a crucial threshold in demographic complexity, which we denote . If we are fitting to a demographic model with fewer than constant pieces, then the set of all SFS will be nonconvex and we must expect pathological behavior as described above. However, once we allow for constant pieces , we get the full set of SFS for all demographies. Proving that this set is convex is left for later work.
Piecewise-Constant Demographies
In this section, we will define two sets: one of them will be the set of expected SFS for piecewise-constant population size histories. As described in the Introduction, this is an important set for inference. The other set is the set of expected coalescence vectors; this is not as commonly used as the SFS, but it helps us build a strong understanding of the SFS. This is because it is related to the SFS by a simple transformation and yet it is much easier to formulate.
Let be the set of piecewise-constant population size functions with k pieces. Any population size function in is described by positive numbers, representing the k population sizes and the time points when the population size changes. Let which we call the SFS “manifold,” denote the set of all expected SFS vectors for a sample of size n that can be generated by population size functions in (Note that the sets and are not technically manifolds; they would be more accurately described as semialgebraic sets. However, for expository purposes, we use the widely known term manifold.) Similarly, let called the -coalescence manifold, denote the set of all vectors giving the expected first coalescence times of samples of size for population size functions in Let and respectively, be equal to the normalization of all points in and by their norms (i.e., the sums of their coordinates). Note that both manifolds live in and their normalized versions live in the -dimensional simplex this is the set of nonnegative vectors in whose coordinates sum to 1.
Now that we have defined our basic objects of study, we can describe the remainder of the article: First, we provide a complete geometric picture of the SFS manifold describing the expected SFS for samples of size under piecewise-constant population size functions with an arbitrary number k of pieces. We make explicit the map between regions of the demographic model space and the corresponding probability vectors, and this will foreshadow some of the difficulties with population size inference in practice. Next, we develop a characterization of the space of expected SFS for arbitrary population size histories. In particular, we show that for any sample size n, there is a finite integer such that the expected SFS for a sample of n under any population size history can be generated by a piecewise-constant population size function with at most epochs. Stated another way, we show that the SFS manifold contains the expected SFS for all possible population size histories, no matter how complicated their functional forms. We establish bounds on that are linear in n and along the way prove some interesting results regarding the geometry of the general SFS manifold.
Before proceeding further, we state a proposition regarding the structure of the map from to which we will call the vector of transformed break points is denoted by and defined below, while the vector of population sizes in the k epochs is denoted by It turns out that we can formulate the expected coalescence times as polynomial functions of the x and y variables. Two different ways of writing those functions down will give us two perspectives on their shape. All proofs of the results presented in this article are deferred to the Appendix.
Proposition 1. Fix a piecewise-constant population size function in with epochs where and which has constant population size value in the epoch for . Let for where (corresponding to time ), and define (corresponding to time ) for convenience. The vectors where and for all j, (uniquely) identify the population size functions in and they satisfy both of the following equations:
| (2) |
| (3) |
where is the expected first coalescence time for a sample of size m, as defined in (1).
These two formulations provide two different ways of looking at the coalescence manifold
-
In (2), the left-hand matrix called has each column of the same form with two parameters; this indicates they all live in a two-dimensional surface. Imagine, for example, the surface of the earth. There are two degrees of freedom: north–south and east–west. Here, too, specifying the value of each column, regardless of the value of n, is dependent on two numbers. Explicitly, each column is given by for some inputs a and b.
Additionally, the vector has all positive entries. That means that, when we combine columns from our surface, they will not cancel in unexpected ways due to negative coefficients. The set of positive combinations of a set of points is called a cone, and it is very nicely behaved geometrically. This means that the vector is contained in the cone over the surface described by the columns of
-
In (3), the left-hand matrix called has each column of the same form with one parameter; this indicates they all live on a curve. Like a train on a track, this has one degree of freedom, only forward–backward. Explicitly, each column is given by for some input a.
The vector on the left-hand side has entries with possibly negative coordinates. So the vector is contained in the linear span of the curve described by the columns of Unfortunately, a linear span is not quite as nicely behaved as a cone. Still, this formulation gains the simplicity of having one degree of freedom instead of two.
Proposition 1 gives us the algebraic mappings that will serve as our objects of interest. Since the SFS manifold is simply a linear transformation of the coalescence manifold, we will use these maps as our entry into understanding the SFS manifold.
The SFS Manifold
A toy model
The first in-depth study will involve the set of all possible expected SFS for a sample size of four. We choose for a number of reasons: First, the cases of sample sizes of two and three are not interesting. When we only have two haplotypes, there is only one entry in the SFS vector, i.e., singletons. The resulting set of possible expected SFS is just the set of all positive numbers. When we have three haplotypes, it is only slightly better. Because there must be fewer doubletons than singletons, the possible expected SFS is somewhere in the wedge between and from the origin; this turns out to be the only constraint.
Second, when the SFS manifold lives in which can be nicely visualized, and the normalized SFS manifold lives in the triangle with vertices and Finally, as observed in Proposition 1, the most interesting phenomena in SFS manifolds of any dimension are fundamentally phenomena of curves and surfaces. These are already captured in the case.
For the sake of completeness, we begin by formally describing the coalescence manifolds for the trivial cases of and
Proposition 2. We list some basic results on the coalescence manifolds with sample size n and k population epochs, for small values of
The manifold for all n.
The manifold for all
The manifold for all
Note that from (2) and (3) for it follows that for In words, rescaling the population sizes in each epoch by a constant a also rescales the first coalescence times by a. This implies that every point in the coalescence manifold generates a full ray contained in the coalescence manifold. Another consequence is that the normalized coalescence manifold is precisely the intersection of the coalescence manifold with the simplex
With that justification, we begin to consider the normalized coalescence manifold living in the simplex. As stated in Proposition 2, is a ray, which implies that is a single point. We now characterize the set Again, this is the set of possible SFS for two-epoch, piecewise-constant population size histories considered as a subset of all vectors summing to one.
Proposition 3. The manifold describing normalized expected times to first coalescence for sample size 4 and two population epochs, is a two-dimensional subset of the 2-simplex which can be described as the union of the point with the interiors of the convex hulls of two curves and The curves are parametrized as follows:
where denotes
This set has some highly unpleasant geometry. First of all, the set is nonconvex; topologically, it is also neither closed nor open because most of the boundary is excluded with the exception of the point The set is visualized in Figure 3A.
Figure 3.
Coalescence and SFS manifolds for sample of size four and two population epochs. (A) The coalescence manifold is the union of red and green cones. The 2-simplex, shaded in blue, intersects in the normalized coalescence manifold The green region corresponds to recent-small, ancient-large demographies; the red region to recent-large, ancient-small demographies. (B) The SFS manifold is the union of red and green cones. The 2-simplex intersects in the normalized SFS manifold Here, too, the green region corresponds to small-then-large demographies; the red region to large-then-small demographies. As mentioned earlier, is obtained from by a linear transformation.
To precisely illustrate the geometry of we will consider how contours in the domain map to contours in the image. Specifically, we plot the images of lines with fixed values of respectively fixed values of to in the 2-simplex. The resulting contours are pictured in Figure 4.
Figure 4.
Fixed-time and fixed-size contours in . (A) The blue line segments correspond to the image of where is a constant fixing the break point between the two demographies. The other input varies over all positive vectors, though scaled vectors point to the same normalized value. As the image approaches and as the image approaches (B) The blue curves correspond to the image of where is a fixed vector indicating the population values and x takes all values in The end points 0 and 1 correspond to break points at and 0, respectively. For x traces a loop in the green region; for x traces a loop in the red region.
Finally, we consider how the map χ acts on the boundaries of the domain. To aid visualization, we limit the inputs to and since all rescalings of and by the same positive constant while keeping fixed map to the same normalized coalescence vector. The resulting map is illustrated in Figure 5.
Figure 5.
Pairing the boundaries of demography space and (A) The domain of Note that for fixed the normalized coalescence vector is the same. (B) The normalized SFS manifold projected onto its first two coordinates. (C) The normalized coalescence manifold projected onto its first two coordinates. The red square on the left, corresponding to maps to the red regions on the right; the green square on the left, corresponding to maps to the green regions on the right. The black line segments on the left [corresponding to and (equivalently ); and (equivalently )] all map to the central black points on the right, since they each mimic a constant demography. The green line corresponding to maps to the curve in and the curve in the red line corresponding to maps to the curve in and the curve in The orange point maps to in and maps to in The blue point maps to in and in The remaining aqua and violet segments map to the segments of the same color.
We note that the map fails to be one-to-one within the domain only when this is also in the preimage of the point The inverse function theorem implies that on the complement of the map is a homeomorphism (a map that preserves topological features like number of components). This is consistent with our observation that the two rectangles in Figure 5A correspond to the two envelopes in Figure 5C. Now, we consider demographies with more than two epochs. This proposition implies that any expected SFS for a sample size of four coming from a single panmictic population under neutrality, regardless of the true population size history, is equal to the expected SFS for some piecewise-constant history with only three pieces. It also shows that all of these SFS vectors live inside of the convex hull of one curve.
Proposition 4. For all values the manifold and is the interior of the convex hull of the following curve:
As we can see from Proposition 4, is open and convex; however, we lose one useful property of the normalized map Specifically, let be given by noting that for Under this definition is generically one-to-one (i.e., one-to-one away from a set of measure zero). Meanwhile, the analogous construction mapping the three-epoch demography with break points and population sizes to the corresponding normalized coalescence vector has two-dimensional preimages, generically. For this reason, contour images do not lend themselves to easy description.
However, as a heuristic, we can choose a distinguished member of this preimage with nice properties. In the orange region adjacent to depicted in Figure 6, every preimage contains a limit demography with first and third epochs set to zero, and second epoch set to one. This can be thought of as a demography with a population boom in the second epoch. In the blue region adjacent to the line segment from to every preimage contains a limit demography with second epoch set to zero. This corresponds to a demography with a population bottleneck in the second epoch. Because the set of demographies mapping to each point is two dimensional, this does not describe all demographies characterized by a chosen SFS, but it does give us intuition for the types of demographies to expect.
Figure 6.
Regions of and sample demographies. The image depicts partitioned into different colored regions. The purple point in the center is the SFS corresponding to the constant demography. The green region contains SFS corresponding to recent-small, ancient-large demographies. The red region corresponds to recent-large, ancient-small demographies. The orange region contains SFS corresponding to three-epoch demographies with a boom in the second epoch. The blue region contains SFS corresponding to three-epoch demographies with a bottleneck in the second epoch. These are not the unique demographies mapping to each region of but they depict, in some sense, the simplest demographies yielding those SFS.
We can also describe the image of the map on the boundaries of our domain. The easiest way to visualize the map is first to understand how the time variables affect the value of the columns of and to view the y variables as specifying points in the convex hull of those three columns. The boundaries of the square map the columns (after rescaling to the simplex) as follows:
The case of is the most interesting: when we fix and we obtain the boundary curve Note that corresponds to a second epoch of length zero. The intuition is that very short population booms at the second epoch lead to coalescence vectors close to The maps encoded by a general column of correspond to the interior of the orange region in Figure 7A. Adding in convex combinations of points gives the lined region, which is the remainder of this is discussed more rigorously in the Appendix. When the number of epochs k steps higher, all columns of still map to the same region of the simplex, so will still be contained in this convex hull. The region is depicted in Figure 7A.
Figure 7.
Coalescence and SFS manifolds for a sample of size four and three population epochs. (A) The coalescence manifold is the entire yellow and orange region. The 2-simplex, shaded in blue, intersects in the normalized coalescence manifold The orange region of bounded by and is the image of the surface described by the columns of while the yellow region adds in vectors gained by using convex combinations. (B) The SFS manifold is the entire yellow and orange region. The 2-simplex intersects in the normalized SFS manifold The SFS manifold is obtained from by a linear transformation. The orange region of bounded by and is the image of the surface described by the columns of while the yellow region adds in vectors gained by using linear combinations.
As mentioned earlier, the SFS manifold is merely a linear transformation of however, since it is of interest in its own right, we include the formulas for analogous to those derived in this section.
Proposition 5. The following hold for the normalized SFS manifold:
is the union of with the convex hulls of two curves:
Here, also, denotes Finally, for all k, and is the convex hull of where
Visualizations of and may be found in Figure 3B and Figure 7B.
General properties
In this section, we examine the constant defined earlier as the smallest index for which for all k. The tools for the proofs in this section come from algebraic geometry (for the derivation of the lower bound) and convex geometry (for the upper bound).
The gist of the algebraic geometry argument is that, under the formulation, the manifold can be seen to be part of another manifold built by a sequence of well-understood algebraic constructions. Details of this perspective are reserved for the proofs section in the Appendix.
Two concrete consequences follow from this observation:
The ability to compute all equations satisfied by using computer algebra.
A formula for the dimension of the coalescence and SFS manifolds.
While the former is harder to explain without more setup, the latter can be simply stated: the dimension of the normalized coalescence manifold is 0 when we have the constant demography If we allow k constant pieces, the manifold has dimension unless is greater than the dimension of the simplex In that case, it has dimension
Proposition 6. The dimension of is given by:
In particular, is a proper subset of for
While Proposition 6 is useful for analyzing individual coalescence manifolds, it also leads to the observation that since the inclusions are proper until that index. It is worth remarking that a slightly weaker lower bound of follows immediately from the identifiability result of corollary 7 in Bhaskar and Song (2014), which states that for a piecewise-constant population size function with k pieces, the expected SFS of a sample of size suffices to uniquely identify the function.
We will illustrate how these algebraic ideas can be applied in the next case we have not seen, namely sample size
Example 7. Note that by Proposition 2. We will use the new ideas above to describe for higher values of k.
Since the normalized coalescence manifold has dimension we know that has dimension 2 inside of the 3-simplex; therefore, we anticipate that it will satisfy one equation, matching its codimension. The degree of the algebraic variety implies that this polynomial should have degree 8. Indeed, when we compute this equation using computer algebra software Macaulay2 (Grayson and Stillman 2002), we obtain a huge degree-8 polynomial with 105 terms, whose largest integer coefficient is Finally, is full-dimensional in the 3-simplex, so it will satisfy no algebraic equations relative to the simplex. It would be defined instead by the inequalities determining its boundary.
The convex geometry argument is more elementary. As we noted, the formulation is contained in the convex hull over the surface described by a general column of Because the columns are related, our selection of points in the surface is not unrestricted. For this reason, it is not obviously equal to the convex hull. However, once we fix some collection of values to be input in the formula for we can use convex geometry for the resulting polytope. In particular, we use Carathéodory’s theorem [Carathéodory (1907) or Barvinok (2002), theorem 2.3], which states that for X a subset of every can be represented as a positive combination of vectors for some
The argument, roughly, allows us to construct any point in that convex hull, with as few as points. This allows us to place the point in for Since no new SFS are generated by using more than epochs, we learn that is bounded above by
Combining the two bounds obtained in this section, we have the main theorem described in the Introduction.
Theorem 8. For any integer there exists a positive integer such that for all Furthermore, satisfies
Additionally, is nonconvex for all values of
This allows us to express the SFS from any piecewise-constant demography as coming from a demography with relatively few epochs. Because the SFS is an integral over the demography, the SFS from a general measurable demography can be uniformly approximated by a piecewise-constant demography with sufficiently many epochs. Our results imply that it can be precisely obtained by a demography with at most epochs.
Data availability
The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.
Discussion
In this work, we characterized the manifold of expected SFS generated by piecewise-constant population histories with k epochs, while giving a complete geometric description of this manifold for the sample size and epochs. This special case is already rich enough to shed light on the issues that practitioners can face when inferring population demographies from SFS data using popular software programs. While we demonstrated these issues in Figure 1 using the fastNeutrino program (Bhaskar et al. 2015), the issues we point out are inherent to the geometry of the SFS manifold and not specific to any particular demographic inference software. Our simulations showed that the demographic inference problem from SFS data can be fraught with interpretability issues, due to the sensitivity of the inferred demographies to small changes in the observed SFS data. These results can also be viewed as complementary to recent pessimistic minimax bounds on the number of segregating sites required to reliably infer ancient population size histories (Terhorst and Song 2015; Baharian and Gravel 2018).
Our investigation of piecewise-constant population histories also lets us show a general result that the expected SFS for a sample of size n under any population history can also be generated by a piecewise-constant population history with at most epochs. This result could have potential applications for developing nonparametric statistical tests of neutrality. Most existing tests of neutrality using classical population genetic statistics such as Tajima’s D (Tajima 1989) implicitly test the null hypothesis of selective neutrality and a constant effective population size (Stajich and Hahn 2004). We have characterized the expected SFS of samples of size n under arbitrary population histories in terms of the expected SFS under piecewise-constant population histories with at most epochs. As a result, the Kullback–Leibler (KL) divergence of an observed SFS to the expected SFS under the best-fitting, piecewise-constant population history with at most epochs is also equal (up to a constant shift) to the negative log-likelihood of the observed SFS under the best-fitting population size history without any constraints on its form. (This assumes the commonly used Poisson random field model where sites being analyzed are unlinked.) One can then use the KL divergence inferred by existing parametric demographic inference programs to create rejection regions for the null hypothesis of selective neutrality without having to make any parametric assumption on the underlying demography. Such an approach would also obviate the need for interpreting the inferred demography itself, since the space of piecewise-constant population histories is only being used to compute the best possible log-likelihood under any single population demographic model. This approach could serve as an alternative to recent works which first estimate a parametric demography using genome-wide sites, and then perform a hypothesis test in each genomic region using simulated distributions of SFS statistics like Tajima’s D under the inferred demography (Rafajlović et al. 2014). We leave the exploration of such tests for future work.
Acknowledgments
We thank Simon Gravel, Jeremy Berg, Laura Hayward, Yuval Simons, and the referees for their careful reading of our manuscript and for providing us with helpful comments. We also thank the Simons Institute for the Theory of Computing, where some of this work was carried out while the authors were participating in the “Evolutionary Biology and the Theory of Computing” program. This research is supported in part by a Math+X Research Grant, a National Science Foundation grant DMS-1149312 (CAREER), a National Institutes of Health grant R01 GM-109454, and a Packard Fellowship for Science and Engineering. Y.S.S. is a Chan Zuckerberg Biohub investigator.
Appendix
Formula for An
Recall that the SFS can be related to times to first coalescence by the formula The formula for is given recursively in Polanski and Kimmel (2003) (equations 13–15) by the following formulas (with variable names changed for clarity):
Proof of Proposition 1
First, we reduce the integral expression for to a finite sum; then we make appropriate manipulations until we arrive at the desired expressions.
Coalescence in the Wright–Fisher model is an inhomogeneous Poisson process with parameter Therefore, the probability density of first coalescence at time T is:
Let To compute the expected time to first coalescence, we have the integral:
Substituting variables, note that Therefore, the integral becomes:
where
The population size is a piecewise-constant function, whose value is if As specified in Proposition 1, and is the vector of population sizes. Observe that is also piecewise constant. In particular,
Let for brevity. The resulting formula is:
We turn the integral into a sum of integrals on the constant epochs:
We now make the substitution Note that the old restriction becomes the new constraint Our formula for the is now:
Noting the linear form of this expression, we factor as a matrix multiplication:
Combining the first three matrices yields (2); combining the first two and last two separately yields (3).
Proof of Proposition 2
We justify each equation in turn:
As mentioned in the Introduction, this is a classical result in population genetics, and can be derived directly from (3).
The inclusion is immediate, so we need only show that any satisfies Using (2), a is written as a sum of products of strictly positive numbers; so
- First, we show that is the interior of the open cone spanned by and Fix (for a positive) and consider
When the second vector approaches when the first vector approaches The vectors are in the interior of that cone for all other permissible values of and To show that note that for larger values of k, the same cone of vectors are produced. In particular, yields
Clearly, the second coordinate of all vectors is bounded between zero and one.
Proof of Proposition 3
First we observe that and are normalizations of the curves defined by parameterizations and where t is constrained to the open interval
Now we claim that the definition in terms of the map is equivalent to the definition in terms of these two curves. We can use the first formulation of χ to prove this:
When the image is the point as stated. When we can use the left-hand expression to view the image as a point on the line segment between and the curve When the right-hand expression can be used to write the image as a point on the line segment between X and This means that the image of χ is contained in the regions and point specified.
To show that the reverse inclusion holds, we fix a point P in the interior of the convex hull of . By convexity, the line segment from X to P is contained in the region; continue in the direction until the line intersects the curve. This must occur because all points in the region are further from the bounding line than X. The point of intersection q is specified as for some By convexity, there exists some ρ such that Fixing and shows that P is in the image of χ. The same argument holds with slight variation for
Proof of Proposition 4
The strategy to prove the equality of and the cone over comes in two steps:
Show that the columns of are always contained in the region R whose boundary is
Divide the convex hull of R into two regions and show that each of these regions are included in
First we demonstrate that the regions map precisely into R. We have already shown in the main text that the boundaries of map to the boundaries of R under the mapping defined by where S is the sum of the coordinates. We compute the Jacobian of this map explicitly in Macaulay2 (Grayson and Stillman 2002). The result is:
Plainly, this is nowhere zero in our domain. The inverse function theorem then implies that the interior is contained in the image of the boundaries. This accomplishes Step 1 of our proof.
For Step 2, we divide the image into two regions:
The triangle defined by vertices and including the two edges and
The remainder of the convex hull of R—explicitly, the interior of the region bounded by and the line segment
To show that the triangle is included, let and let vary. The third column then sits arbitrarily close to and the first column traces out Set and toggle and to obtain the full span, including the interior of the triangle, and the line segment Set and the first column sits at while the third column traces out This catches the missing line segment.
For the remainder of the convex hull, fix a point P in this region. This point lies on a line segment between and some point Q in Suppose it is equal to Set We can choose ɛ and so that the second column is arbitrarily close to P. Furthermore, observe that the first column is approximately equal to the point on corresponding to and the third column is approximately the point on corresponding to Choosing and points us to
Proof of Proposition 5
This is a direct application of the linear map computed as in Polanski and Kimmel (2003):
Proof of Proposition 6
To prove the result about dimension, we show that is a relatively open subset of a certain algebraic variety. Because the relevant operations are native to projective geometry, we transport our objects of interest in the obvious way to projective space. The same scaling properties that allow us to focus on the simplex also lead to good behavior in projective space.
Lemma 9. For the Zariski closure of is the affine cone over where:
- The symbol denotes the projective curve defined by mapping to
The symbol is the projective point
The operation denotes the join of algebraic varieties.
The operation denotes the i-th secant variety. Following Harris (2013), the i-th secant variety is the union of i-dimensional planes generated by points in the variety.
Proof of Lemma 9. The variety is the image of the following map:
where and are not simultaneously zero, and λ is unrestricted.
Define the map sending to
We can recast the expression in (3) as the composition Based on this formulation, the set is clearly contained in To demonstrate the equality of the Zariski closures, we only need to show that the dimensions match and that the variety is irreducible. Both joins and secants have the property that irreducible inputs yield irreducible outputs, so the variety of interest is irreducible. The image of ϕ is open in and the map ψ has deficient rank on a set of positive codimension. Therefore, the composition of has full dimension. This proves the Lemma. □
The i-th secant variety of an irreducible nondegenerate curve in has projective dimension given by (Harris 2013, exercise 16.16). The curve is a toric transformation of a coordinate projection of the rational normal curve. The rational normal curve is nondegenerate and both of these operations preserve that property. This means our secant variety has projective dimension The join with a point adds 1 to the dimension of the variety, while the operation of passing to the affine cone adds 1 to the dimension of the variety and the ambient space. However, normalizing to the -simplex subtracts 1 from both variety and ambient space again. This means that assuming that
Proof of Upper Bound in Theorem 8
Suppose a point is in By definition, this implies that there is a point such that (2) yields
Since the point is in the cone over the q columns of the matrix, Carathéodory’s theorem implies that it is also in the cone over some of the columns. Therefore, we can replace the vector with so that all but (or fewer) are zero.
Passing to the expression in (3), this gives us:
Since at most of the are nonzero, at most of the entries of the vector at right are nonzero. We delete the columns of the X matrix corresponding to zero entries except the first column. A new sequence may then be obtained from the ratio between the first entries in adjacent columns. The new sequence is obtained by taking the sequence of partial sums of the vector.
Proof of Nonconvexity in Theorem 8
To prove this final result, we combine two properties already proven:
The manifold is a proper subset of for all (from Proposition 6).
The manifold is contained in the convex hull of (This follows from Equation 2.)
Since contains and is properly contained in the convex hull of it cannot be convex.
Footnotes
Communicating editor: S. Ramachandran
Literature Cited
- Achaz G., 2009. Frequency spectrum neutrality tests: one for all and all for one. Genetics 183: 249–258. 10.1534/genetics.109.104042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baharian S., Gravel S., 2018. On the decidability of population size histories from finite allele frequency spectra. Theor. Popul. Biol. 120: 42–51. 10.1016/j.tpb.2017.12.008 [DOI] [PubMed] [Google Scholar]
- Barvinok A., 2002. A Course in Convexity, Vol. 54 American Mathematical Society Providence, Providence, Rhode Island. [Google Scholar]
- Bhaskar A., Song Y. S., 2014. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Stat. 42: 2469–2493. 10.1214/14-AOS1264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A., Wang Y. R., Song Y. S., 2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25: 268–279. 10.1101/gr.178756.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carathéodory C., 1907. Über den variabilitätsbereich der koeffizienten von potenzreihen, die gegebene werte nicht annehmen. Math. Ann. 64: 95–115. 10.1007/BF01449883 [DOI] [Google Scholar]
- Chen H., 2012. The joint allele frequency spectrum of multiple populations: a coalescent theory approach. Theor. Popul. Biol. 81: 179–195. 10.1016/j.tpb.2011.11.004 [DOI] [PubMed] [Google Scholar]
- Durrett R., 2008. Probability Models for DNA Sequence Evolution. Springer-Verlag, New York: 10.1007/978-0-387-78168-6 [DOI] [Google Scholar]
- Excoffier L., Dupanloup I., Huerta-Sánchez E., Sousa V. C., Foll M., 2013. Robust demographic inference from genomic and SNP data. PLoS Genet. 9: e1003905 10.1371/journal.pgen.1003905 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y.-X., 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197. 10.1006/tpbi.1995.1025 [DOI] [PubMed] [Google Scholar]
- Fu Y.-X., Li W.-H., 1993. Statistical tests of neutrality of mutations. Genetics 133: 693–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S., Henn B. M., Gutenkunst R. N., Indap A. R., Marth G. T., et al. , 2011. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108: 11983–11988. 10.1073/pnas.1019276108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grayson, D. R., and M. E. Stillman, 2002 Macaulay 2, a software system for research in algebraic geometry. Available at: https://faculty.math.illinois.edu/Macaulay2/.
- Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genet. 5: e1000695 10.1371/journal.pgen.1000695 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris J., 2013. Algebraic Geometry: A First Course, Vol. 133 Springer Verlag, New York. [Google Scholar]
- Jouganous J., Long W., Ragsdale A. P., Gravel S., 2017. Inferring the joint demographic history of multiple populations: beyond the diffusion approximation. Genetics 206: 1549–1567. 10.1534/genetics.117.200493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamm J. A., Terhorst J., Song Y. S., 2017. Efficient computation of the joint sample frequency spectra for multiple populations. J. Comput. Graph. Stat. 26: 182–194. 10.1080/10618600.2016.1159212 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan N. L., Hudson R., Langley C., 1989. The “hitchhiking effect” revisited. Genetics 123: 887–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keinan A., Clark A. G., 2012. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336: 740–743. 10.1126/science.1217283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J., Etheridge A. M., McVean G., 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLOS Comput. Biol. 12: e1004842 10.1371/journal.pcbi.1004842 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman J. F. C., 1982a The coalescent. Stochastic Process. Appl. 13: 235–248. 10.1016/0304-4149(82)90011-4 [DOI] [Google Scholar]
- Kingman J. F. C., 1982b Exchangeability and the evolution of large populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by Koch G., Spizzichino F. North-Holland Publishing Company, Amsterdam. [Google Scholar]
- Kingman J. F. C., 1982c On the genealogy of large populations. J. Appl. Probab. 19: 27–43. 10.2307/3213548 [DOI] [Google Scholar]
- Marth G. T., Czabarka E., Murvai J., Sherry S. T., 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351–372. 10.1534/genetics.166.1.351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Möhle M., Sagitov S., 2001. A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29: 1547–1562. [Google Scholar]
- Myers S., Fefferman C., Patterson N., 2008. Can one learn history from the allelic spectrum? Theor. Popul. Biol. 73: 342–348. 10.1016/j.tpb.2008.01.001 [DOI] [PubMed] [Google Scholar]
- Nielsen R., 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polanski A., Kimmel M., 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polanski A., Bobrowski A., Kimmel M., 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 33–40. 10.1016/S0040-5809(02)00010-2 [DOI] [PubMed] [Google Scholar]
- Rafajlović M., Klassmann A., Eriksson A., Wiehe T., Mehlig B., 2014. Demography-adjusted tests of neutrality based on genome-wide SNP data. Theor. Popul. Biol. 95: 1–12. 10.1016/j.tpb.2014.05.002 [DOI] [PubMed] [Google Scholar]
- Sargsyan O., Wakeley J., 2008. A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms. Theor. Popul. Biol. 74: 104–114. 10.1016/j.tpb.2008.04.009 [DOI] [PubMed] [Google Scholar]
- Stajich J. E., Hahn M. W., 2004. Disentangling the effects of demography and selection in human history. Mol. Biol. Evol. 22: 63–73. 10.1093/molbev/msh252 [DOI] [PubMed] [Google Scholar]
- Tajima F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Terhorst J., Song Y. S., 2015. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. USA 112: 7677–7682. 10.1073/pnas.1503717112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J., Hey J., 1997. Estimating ancestral population parameters. Genetics 145: 847–855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watterson G., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. 10.1016/0040-5809(75)90020-9 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.







