Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Dec 21.
Published in final edited form as: Ann Stat. 2014 Oct 20;42(6):2469–2493. doi: 10.1214/14-AOS1264

DESCARTES’ RULE OF SIGNS AND THE IDENTIFIABILITY OF POPULATION DEMOGRAPHIC MODELS FROM GENOMIC VARIATION DATA1

Anand Bhaskar 1, Yun S Song 1
PMCID: PMC5175586  NIHMSID: NIHMS802346  PMID: 28018011

Abstract

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the “folded” SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes’ rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

MSC2010 subject classifications: Primary 62B10, secondary 92D15

Key words and phrases: Population genetics, identifiability, population size, coalescent theory, frequency spectrum

1. Introduction

Given a sample of homologous genomic sequences from a large population, an important inference problem with a wide variety of important applications is to determine the underlying demography of the population. The population demography can be used to calibrate null models of neutral genome evolution in order to find regions under selection [2, 25, 45]; to stratify samples in genome-wide association studies [3, 28, 33, 37]; to date historical population splits, migrations, admixture and introgression events [10, 18, 24, 26, 40, 43]; and so on. Recently, several large-sample genome- and exome-sequencing datasets have become available [1, 4, 6, 31, 44], shedding new light on patterns of genetic variation that were not previously observable in smaller datasets. Such large-sample studies offer an exciting opportunity to infer demography in unprecedented detail.

One widely-used measure of genetic variation in a set of homologous genome sequences is the sample frequency spectrum (SFS). For a sample of size n, the SFS counts the proportion of dimorphic (i.e., with exactly two distinct observed alleles) sites as a function of the frequency ( bn, where 1 ≤ bn − 1) of the mutant allele in the sample. The SFS is useful for several reasons. First, the SFS is a succinct summary of a large sample of genomic sequences, where the information in n sequences of arbitrary length can be summarized by just n−1 numbers. This makes the SFS both mathematically and algorithmically tractable. In particular, since the SFS ignores linkage information between sites, one can avoid challenging mathematical and computational issues associated with rigorously modeling genetic recombination. Furthermore, the statistical properties of the SFS and their dependence on the population demographic history are well understood under the coalescent and the diffusion models of neutral evolution [7, 11, 12, 19, 35, 46]. This dependence of the SFS on demography, along with the assumption of free recombination between sites, has been exploited in several efficient methods for inferring historical population demography [5, 13, 27, 29]. Second, the SFS can effectively capture the impact of recent demography on genetic variation. Recent large-sample studies [4, 6, 31, 44] have consistently shown that there is an excess of rare polymorphisms compared to the predictions of previously inferred demographic models, which might be explained by recent rapid population expansion [16]. Because the leading entries of the SFS count the rare variants in the sample, one might be able to use this information to infer demographic events in the recent past at a much finer resolution than possible using smaller samples. Third, the SFS also provides a simple way of visualizing the goodness of fit of a demographic model to data, since one can easily compare the SFS observed in the data with the SFS predicted by the fitted demographic model.

While the SFS has algorithmic advantages for demographic inference, it is believed to suffer from a statistical shortcoming. Specifically, Myers, Fefferman and Patterson [30] recently showed that even with perfect knowledge of the population frequency spectrum [i.e., the proportion of polymorphic sites with population-wide allele frequency in (x, x + dx) for all x ∈ (0, 1)], the historical population size function η(t) as a function of time is not identifiable. Using Müntz–Szász theory, they showed that for any population size function η(t), one can construct arbitrarily many smooth functions F(t) such that both η(t) and η(t) + αF(t) generate the same population frequency spectrum for suitably chosen values of α. They also constructed explicit examples of such functions η(t) and F(t). While this nonidentifiability could pose serious challenges to demographic inference from frequency spectrum data, the population size functions involved in their example are arguably unrealistic for biological populations. In particular, their explicit example involves a population size function which oscillates at an increasingly higher frequency as the time parameter approaches the present. Real biological population sizes can be expected to vary over time in a mathematically more well-behaved fashion. In particular, populations can be expected to evolve in discrete units of time, which, when approximated by a continuous-time model, restricts the frequency of oscillations in the population size function to be less than the number of generations of reproduction per unit time. Furthermore, since a population size model being inferred must have a finite representation for obvious algorithmic reasons, most previous demographic inference analyses have focused on inferring population size models that are piecewise-defined over a restricted class of functions, such as piecewise-constant and piecewise-exponential models [10, 17, 24, 26, 31, 41, 44]. Motivated by the large number of rare variants observed in several large-sample sequencing studies, recent works [38, 39] have also focused on more general population growth models which allow for the population to grow at a faster than exponential rate. Each piece in such piecewise models has two parameters that control the rate and acceleration of population growth. Since these models contain the family of piecewise-constant and piecewise-exponential population size functions, we refer to them as piecewise-generalized-exponential models in the remainder of this paper.

In this paper, we revisit the question of demographic model identifiability under the assumption that the population size is a piecewise-defined function of time where each piece comes from a family of biologically-motivated functions, such as the family of constant or exponential functions. We also re-examine the assumption that one has access to the population-wide patterns of polymorphism. In real applications, we do not expect to know the allele frequency spectrum for an entire population but rather only the SFS for a randomly drawn finite sample of individuals. Here, we investigate whether one can learn piecewise-defined population size functions given perfect knowledge of the expected SFS for a sufficiently large sample of size n. Unlike in the case of arbitrary continuous population size functions considered by Myers, Fefferman and Patterson, the answer to this question is affirmative. More precisely, we obtain bounds on the sample size n that are sufficient to distinguish population size functions among piecewise demographic models with K pieces, where each piece comes from some family of functions (see Theorems 6 and 11). Our bound on the sample size can be expressed as an affine function of the number K of pieces, where the slope of the function is a measure of the complexity of the family to which each piece belongs. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genetic analyses, the slope of this affine function can be calculated explicitly, as shown in Corollaries 7–9. We also obtain analogous results for the “folded” SFS (see Theorem 12), a variant of the SFS which circumvents the ambiguity in the identity of the ancestral allele type by grouping the polymorphic sites in a sample according to the sample minor allele frequency.

There are two main technical elements underlying our proofs of the identifiability results mentioned above. The first step is to show that the expected SFS of a sample of size n is in bijection with the Laplace transform of a time-rescaled version of the population size function evaluated at a particular sequence of n − 1 points. This reduces the problem of identifiability from the SFS to that of identifiability from the values of the Laplace transform at a fixed set of points. The second step relies on a generalization of Descartes’ rule of signs for polynomials to the Laplace transform of general piecewise-continuous functions. This technique yields an upper bound on the number of roots of the Laplace transform of a function by the number of sign changes of the function. We think that this proof technique based on sign changes might be of independent interest for proving statistical identifiability results in other settings. We also provide an alternate proof of identifiability for piecewise-constant population models, where the aforementioned second step is replaced by a linear algebraic argument that has a constructive flavor. We include this alternate proof in the hope that it could be used to develop an algebraic inference algorithm for piecewise-constant models.

The remainder of this paper is organized as follows. In Section 2, we introduce the model and notation, and describe our main results. We also discuss the counterexample of Myers, Fefferman and Patterson in light of our findings. The proofs of our results are provided in Section 3, and we conclude with a discussion in Section 4.

2. Main results

Here, we summarize our identifiability results. All proofs are deferred to Section 3.

2.1. Model and notation

We consider a population evolving according to Kingman’s coalescent [2123] with the infinite-sites model of mutation [20] and selective neutrality. Under this model, the genome is assumed to be infinite and every mutation occurs at a different site in the genome that has never experienced a mutation before. This model is applicable in the regime where the mutation rate is very low, and hence the probability of multiple mutations at a given site is vanishingly small. Any polymorphic site in a sample of sequences is dimorphic under this model. The population size is assumed to change deterministically with time and is described by a function η : ℝ≥0 → ℝ+, such that the instantaneous coalescence rate between any pair of lineages at time t is 1(t).

Let Tn,k(η) denote the time (in coalescent units) while there are k ancestral lineages for a sample of size n obtained at time 0. Defining Rη(t) as

Rη(t):=0t1η(x)dx,

the expected time E[Tm,m(η)] to the first coalescence event for a sample of size m is given by

E[Tm,m(η)]=0t(m2)η(t)exp[-(m2)Rη(t)]dt. (1)

Following the notation of Myers, Fefferman and Patterson, define a time-rescaled version η̃ of the population size function η as

η(τ)=η(Rη-1(τ)), (2)

where τ ∈ ℝ≥0. The function η̃(τ) reparameterizes the population size as a function of the cumulative rate of coalescence τ = Rη(t). For a given population size function η̃ parameterized by the total coalescence rate τ, there corresponds a unique population size function η parameterized by time t. Specifically, η(t)=η(Sη-1(t)), for all t ∈ ℝ≥0, where Sη̃(t) is an invertible function given by

Sη(t)=0tη(x)dx.

Applying integration by parts to (1) and using the condition that E[Tm,m(η)]<, we have

E[Tm,m(η)]=0exp[-(m2)Rη(t)]dt. (3)

Furthermore, since Rη is monotonically increasing and continuous from ℝ≥0 to ℝ≥0, it is a bijection over ℝ≥0. For notational convenience, for any interval I ⊆ ℝ≥0, we define Rη(I) to be the interval

Rη(I)={Rη(x)xI}.

By making the substitution τ = Rη(t) in (3) and using (2), we have the following expression for E[Tm,m(η)]:

E[Tm,m(η)]=0η(τ)exp[-(m2)τ]dτ. (4)

Equation (4) states that the time to the first coalescence event for a sample of size m is given by the Laplace transform of the time-rescaled population size function η̃ evaluated at the point (m2). For a sample of size n, let ξn,b denote the probability that a dimorphic site has b mutant alleles and nb ancestral alleles. We refer to (ξn,1,, ξn,n−1) as the expected sample frequency spectrum (SFS).

2.2. Determining the expected times to the first coalescence from the SFS

The following lemma shows that the expected SFS for a sample of size n tightly constrains the expected time to the first coalescence event for all sample sizes 2,, n:

Lemma 1

Under an arbitrary variable population size model {η(t), t ≥ 0}, suppose ξn,1, … , ξn,n−1 are known and define cm:=E[Tm,m(η)] for 2 ≤ mn. Then, up to a common positive multiplicative constant, the quantities c2, … , cn can be determined uniquely from ξn,1, … , ξn,n−1.

This implies that the problem of identifying the population size function η(t) from ξn,1,, ξn,n−1 can be reduced, up to a multiplicative constant, to the problem of identifying η(t) from c2,, cn.

2.3. Piecewise population size models and sign change complexities

To state our main result in full generality, we first need a few definitions.

Definition 1

(ℱ, family of continuous population size functions). A family ℱ of continuous population size functions is a set of positive continuous functions f : ℝ≥0 → ℝ+ of a particular type parameterized by a collection of variables.

We use ℱc to denote the family of constant population size functions; that is, functions of the form f(t) = ν for all t, where ν ∈ ℝ+ is the only parameter of the family. Further, we use ℱe to denote the family of exponential population size functions of the form f(t) = ν exp(βt), where ν ∈ ℝ+ and β ∈ ℝ are the parameters of the family. In human genetics, there has been recent interest [38, 39] in modeling superexponential growth in the effective population size via models that generalize exponential growth by incorporating an additional acceleration parameter γ. Such population size functions f satisfy the differential equation df/dt = βf (t)γ with initial condition f(0) = ν, where β ∈ ℝ, γ ∈ ℝ≥0, and ν ∈ ℝ+. When 0 ≤ γ <1 (resp., γ >1), this represents superexponential (resp., subexponential) population growth/decline, while γ = 1 corresponds to exponential population growth/decline. We let ℱg denote the family of such generalized-exponential population size functions.

Definition 2

[ℳK(ℱ), piecewise models over ℱ with at most K pieces]. Given a family ℱ of continuous population size functions, a population size function η(t) defined over ℝ≥0 is said to be piecewise over ℱ with at most K pieces if there exists an integer p, where 1 ≤ pK − 1, and a sequence of p time points 0 < t1 < ··· < tp < ∞ such that for each 1 ≤ ip + 1, there exists a positive continuous function fi ∈ ℱ such that η(t) = fi (tti−1) for all t ∈ [ti−1, ti). For convention, we define t0 = 0 and tp+1 =∞. Note that η may not be continuous at the change points t1,, tp. We use ℳK(ℱ) to denote the space of such piecewise population size models with at most K pieces, each of which belongs to function family ℱ. Illustrated in Figure 1 is an example of piecewise-exponential population size function η ∈ ℳK(ℱ) where K ≥ 5 and ℱ = ℱe.

Fig. 1.

Fig. 1

A piecewise-exponential population size function η ∈ ℳK(ℱe), where K ≥ 5. Note that the y-axis is in a log scale. This piecewise-exponential function depicts the historical population size changes of a European population that was estimated from the SFS of a sample of 1351 (diploid) individuals of European ancestry [44].

Definition 3

[σ(f), number of sign changes of a function]. For a function g (not necessarily continuous) defined over some interval (a, b), we say that t ∈ (a, b) is a sign change point of g if there exist some ε > 0, t′ ≥ t, and an interval (t, t′ + ε) ⊆ (a, b) such that:

  1. (tε, t) ⊆ (a, b),

  2. g(z) = 0 for z ∈ (t, t′),

  3. g(x)g(y) < 0 for all x ∈ (tε, t) and y ∈ (t, t′ + ε).

We define the number σ(g) of sign changes of g as the number of such sign change points in its domain (a, b). See Figure 2 for an illustration.

Fig. 2.

Fig. 2

Illustration of the sign changes of a function. For the domain shown, σ(g) = 3 and the sign change points of g are denoted t1, t2, and t3.

Note that the above definition of the number of sign changes counts the number of times the function g changes value from positive to negative (and vice versa) while ignoring intervals where it is identically zero. While the above definition is not restricted to piecewise continuous functions, we will restrict our attention to such functions for the remainder of this paper.

Definition 4

[𝒮(ℱ) and 𝒮(ℳK(ℱ)), sign change complexities]. For a family ℱ of continuous population size functions, we define the sign change complexity 𝒮(ℱ) as

S(F)=supf1,f2F,a1,a20{σ(g)|g(τ):=f1(τ-a1)-f2(τ-a2)withdomainDom(g)={τ0|τ-a1Dom(f1),τ-a2Dom(f2)}}=supf1,f2F,a0{σ(g)|g(τ):=f1(τ)-f2(τ-a)withdomainDom(g)={τ0|τDom(f1),τ-aDom(f2)}}, (5)

where j are the time-rescaled versions of fj as defined in (2), and Dom(j) = Rfj (ℝ≥0) is the domain of j. Similarly, for the space ℳK(ℱ) of piecewise population size models with at most K pieces over some function family ℱ, we define the sign change complexity 𝒮(ℳK(ℱ)) as

S(MK(F))=supη1,η2MK(F){σ(η1-η2)},

where, again, η̃j are related to ηj as given in (2).

The following lemma gives a bound on the sign change complexity of a model with at most K pieces in terms of the underlying family of population size functions for each piece.

Lemma 2

The sign change complexity of the space ℳK(ℱ) of piecewise models with at most K pieces in a function family ℱ is bounded by the sign change complexity of ℱ as

S(MK(F))(2K-2)+(2K-1)S(F).

Note that the bound in Lemma 2 is tight for the family ℱc of constant population sizes, for which 𝒮(ℱc) = 0 and 𝒮(ℳK(ℱc)) = 2K −2.

2.4. Identifiability results

Our main results on identifiability will be proved using a generalization of Descartes’ rule of signs for polynomials.

Theorem 3 (Descartes’ rule of signs for polynomials)

Consider a degree-n polynomial p(x) = a0 + a1x + ··· + anxn with real-valued coefficients ai. The number of positive real roots (counted with multiplicity) of p is at most the number of sign changes between consecutive nonzero terms in the sequence a0, a1, …, an.

The following theorem generalizes the above classic result to relate the number of sign changes of a piecewise-continuous function f to the number of roots of its Laplace transform.

Theorem 4 (Generalized Descartes’ rule of signs)

Let f :ℝ≥0 → ℝ be a piecewise-continuous function which is not identically zero and with a finite number σ(f) of sign changes. Then the function G(x) defined by

G(x)=0f(t)e-txdt (6)

has at most σ(f) roots in ℝ (counted with multiplicity).

The statement of Theorem 4 and the proof provided in Section 3 are adapted from Jameson [15], Lemma 4.5, for our setting. Using Theorem 4, we prove in Section 3 the following identifiability theorem for population size function families with finite sign change complexity.

Theorem 5

For a sample of size n, let c = (c2, … , cn), where cm=E[Tm,m(η)], for 2 ≤ mn, defined in (3). If 𝒮(ℱ) <∞ and n ≥ 𝒮(ℱ) + 2, then no two distinct models η1,η2 ∈ ℱ can produce the same (c2, … , cn). In other words, for n ≥ 𝒮(ℱ) + 2, the map c : F+n-1 is injective.

Note that the sample size bound in Theorem 5 applies to an arbitrary function family ℱ which need not have any special structure. Using Lemma 2 for bounding the sign change complexity of piecewise-defined function families ℳK(ℱ) in terms of the sign change complexity of the underlying function family ℱ, we immediately obtain the following theorem.

Theorem 6

For a sample of size n, let c = (c2, … , cn), where cm=E[Tm,m(η)], for 2 ≤ mn, defined in (3). If 𝒮(ℱ) < ∞ and n ≥ 2K + (2K − 1)𝒮(ℱ), then the map c : MK(F)+n-1 is injective.

Using Theorem 6, it is simple to derive identifiability results for piecewise-defined population size models over several function families ℱ that are of biological interest. In particular, we have the following result for the case of piecewise-constant models.

Corollary 7 [Identifiability of piecewise-constant population size models in ℳK(ℱc)]

The map c : MK(Fc)+n-1 is injective if the sample size n ≥ 2K.

The bound in Corollary 7 on the sample size sufficient for identifying piecewise-constant population models is actually tight, since ℳK(ℱc) has 2K − 1 parameters in ℝ+ and there is no continuous injective function from +2K-1 to ℝn−1 if n < 2K. (This fact can be proved in multiple ways, such as by the Borsuk–Ulam theorem or the Constant Rank theorem.) An alternate proof of Corollary 7 that does not rely on Theorem 6 is also provided in Section 3. This alternate proof is based on an argument from linear algebra, and it might be possible to adapt this approach to develop an algebraic algorithm for inferring the parameters of a piecewise-constant population function from the set of expected first coalescence times cm.

Another class of models often assumed in population genetic analyses are piecewise-exponential functions, for which we have the following result.

Corollary 8 [Identifiability of piecewise-exponential population size models in ℳK(ℱe)]

The map c : MK(Fe)+n-1 is injective if the sample size n ≥ 4K −1.

For the generalized-exponential growth models considered by Reppell, Boehnke and Zöllner [39], we have the following result.

Corollary 9 [Identifiability of piecewise-generalized-exponential population size models in ℳK(ℱg)]

The map c : MK(Fg)+n-1 is injective if the sample size n ≥ 6K −2.

For the identifiability of piecewise population size models from the SFS data, we first note the following lemma.

Lemma 10

Consider a piecewise population size function η ∈ ℳK(ℱ). Consider a sample of size n ≥ 2K + (2K − 1)𝒮(ℱ) and suppose the function η produces E[Tm,m(η)]=cm for 2 ≤ mn. Then, for every fixed κ ∈ ℝ+, there exists a unique piecewise population size function ζ ∈ ℳK(ℱ) with E[Tm,m(ζ)]=κcm for 2 ≤ mn. Furthermore, this population size function ζ is given by ζ(t) = κη(t/κ).

Given two models η, ζ ∈ ℳK, we say that η and ζ are equivalent, and write η ~ ζ, if they are related by a rescaling of change points and population sizes as described in Lemma 10. Let [η] denote the equivalence class of population size functions that contain η, and let ℳK(ℱ)/~ = {[η]]η ∈ ℳK(ℱ)} be the set of equivalence classes for the equivalence relation ~. Then, combining Lemma 1, Theorem 6 and Lemma 10, we obtain the following theorem.

Theorem 11

If 𝒮(ℱ) <∞ and n ≥ 2K + (2K − 1)𝒮(ℱ), then, for each expected SFS (ξn,1, … , ξn,n−1), there exists a unique equivalence class [η] of models in ℳK(ℱ)/~ consistent with (ξn,1, … , ξn,n−1).

2.5. Extension to the folded frequency spectrum

To generate the SFS from genomic sequence data, one needs to know the identities of the ancestral and mutant alleles at each site. To avoid this problem, a commonly employed strategy in population genetic inference involves “folding” the SFS. More precisely, for a sample of size n, the ith entry of the folded SFS χ = (χn,1,, χ n/2⌋) is defined by

χn,i=ξn,i+ξn,n-i1+δi,n-i,

where 1 ≤ i ≤ ⌊n/2⌋. In particular, χn,i is the proportion of polymorphic sites that have i copies of the minor allele. For any sample size n, since χ is a vector of approximately half the dimension as ξ, we might expect to require roughly twice as many samples to recover the demographic model from χ compared to ξ. This is indeed the case. Given the folded SFS χ, the following theorem establishes a sufficiency condition on the sample size for identifying demographic models in ℳK(ℱ).

Theorem 12

If 𝒮(ℱ) <∞ and n ≥ 2(2K −1)(1 + 𝒮(ℱ)), then, for each expected folded SFS χ = (χn,1, … , χn,n/2⌋), there exists a unique equivalence class [η] of models in ℳK(ℱ)/~ consistent with χ.

2.6. The counterexample of Myers, Fefferman and Patterson

Myers, Fefferman and Patterson [30] provided an explicit counterexample to the identifiability of population size models from the allelic frequency spectrum. In our notation, they provided two time-rescaled population size functions η̃1 and η̃2 given by

η1(τ)=N,η2(τ)=N(1-9F(τ)),

where N is an arbitrary positive constant, and the function F is given by the convolution

F(τ)=0τf0(τ-u)f1(u)du,

where f0 and f1 are given by

f0(τ)=exp(-1/τ2),f1(τ)=cos(π2/τ)exp(-τ/8)τ.

Both functions f1 and F have increasingly frequent oscillations as τ ↓ 0 so that σ(η̃1η̃2) = σ(F)=∞. This is why Theorem 5 does not apply to this example. Indeed, by an argument using the Laplace transforms of f1 and F, Myers, Fefferman and Patterson showed that the function G(x) defined in (6) in terms of F has roots at -(m2) for each m ≥ 2.

3. Proofs

We now provide proofs of the results presented earlier.

Proof of Lemma 1

In the coalescent for a sample of size n, let γn,b denote the total expected branch length subtending b leaves, for 1 ≤ bn−1. Then ξn,b=γn,b/k=1n-1γn,k, which implies that there exists a positive constant κ such that γn,b = κξn,b for all 1 ≤ bn−1. We now prove that c2,, cn can be determined uniquely from γn,1,, γn,n−1.

Let ϕn,k=E[Tn,k(η)]. Then, by a result of Griffiths and Tavaré [12],

γn,b=k=2n-b+1k(n-b-1k-2)(n-1k-1)ϕn,k, (7)

for 1 ≤ bn − 1. The system of equations (7) can be rewritten succinctly as a linear system

γ=Mϕ,

where γ = (γn,1,, γn,n−1), ϕ = (ϕn,2,, ϕn,n), and M = (mbk) with mbk=k(n-b-1k-2)/(n-1k-1), for 1 ≤ bn − 1 and 2 ≤ kn. The matrix M is upper-left triangular since (n-b-1k-2)=0 if k > nb + 1, and the anti-diagonal entries are k(n-1k-1)>0. Hence, det(M) ≠ 0 and M is therefore invertible. Thus, given γ, we can determine ϕ uniquely as M−1γ.

Let ψn,k=j=knE[Tn,j(η)]. Then, defining ψn,n+1 := 0, observe that ψn,k = ϕn,k + ψn,k+1 for 2 ≤ kn. This implies that ψn,2,, ψn,n can be determined uniquely from ϕn,2,, ϕn,n. Polanski, Bobrowski and Kimmel [35] showed that ψn,k can be written as

ψn,k=m=knakmcm, (8)

where akm, for kmn, are given by

akm=l=k,lmn(l2)l=k,lmn[(l2)-(m2)],

and cm=E[Tm,m(η)], shown in (3). Again, the system of equations (8) can be written as a triangular linear system

ψ=Ac,

where ψ = (ψn,2,, ψn,n), c = (c2,, cn), and A = (akm), for 2 ≤ k, mn. Note that A is an upper triangular matrix since akm := 0 if m < k. Since A has nonzero entries on its diagonal, A−1 exists, and c can be determined uniquely as A−1ψ.

Proof of Lemma 2

Given a pair of piecewise population size functions η1, η2 ∈ ℳK (ℱ), let η̃1 and η̃2 be their respective time-rescaled versions, defined by (2). Let 0<t1(1)<<tp1(1)<, where 0 ≤ p1K − 1 (resp., 0<t1(2)<<tp2(2)<, where 0 ≤ p2K − 1) be the change points of the pieces of η1 (resp., η2). We define t0(1)=t0(2)=0 and tp1+1(1)=tp2+1(2)=. The change points of η̃1 are given by Rη1(ti(1)), where 1 ≤ ip1, while the change points of η̃2 are given by Rη2(ti(2)), where 1 ≤ ip2. Let 0 < τ1 < ··· < τp < ∞ be the union of the change points of η̃1 and η̃2, where 0 ≤ pp1 + p2. For convention, let τ0 = 0 and τp+1 = ∞.

Consider the piece (τi, τi+1) for 0 ≤ ip. Let I1=(tk(1),tk+1(1)), where 0 ≤ kp1, and I2=(tl(2),tl+1(2)), where 0 ≤ lp2, be the pieces of the original population size functions η1 and η2, respectively, such that (τi, τi+1) ⊆ Rη1(I1) and (τi, τi+1) ⊆ Rη2(I2). Since η1 ∈ ℳK(ℱ), there exists a function f1 ∈ ℱ such that η1(t)=f1(t-tk(1))for all tI1. Then, for all τRη1(I1),

η1(τ)=η1(Rη1-1(τ))=f1(Rη1-1(τ)-tk(1))=f1(Rf1(Rη1-1(τ)-tk(1)))=f1(τ-Rη1(tk(1))). (9)

Similarly, there exists some function f2 ∈ ℱ such that, for all τRη2(I2),

η2(τ)=f2(τ-Rη2(tl(2))). (10)

Using (9) and (10), we see that the number of sign change points of η̃1η̃2 in the piece (τi, τi+1) is at most the number of sign change points of f1(τ-Rη1(tk(1)))-f2(τ-Rη2(tl(2))) for τ ∈ (τi, τi+1). Hence, by (5), it follows that within each piece (τi, τi+1) for 0 ≤ ip, η̃1η̃2 has at most 𝒮 (ℱ) sign change points. Also, the point τi+1 itself could be a sign change point in the interval between the last sign change point in piece (τi, τi+1) and the first sign change point in piece (τi+1, τi+2) where 0 ≤ ip − 1. These are all the possible sign change points of η̃1η̃2. Hence,

σ(η1-η2)p+(p+1)S(F)(p1+p2)+(p1+p2+1)S(F)(2K-2)+(2K-1)S(F). (11)

Since (11) holds for all η1, η2 ∈ ℳK (ℱ), the lemma follows.

Proof of Theorem 4

The proof is by induction on the number of sign changes of f. If f has zero sign changes, then without loss of generality, f(t) ≥ 0 for t ∈ (0, ∞) and f(t) > 0 for some interval (a, b) ⊆ (0, ∞). Hence, G(x) > 0 for all x, and the base case holds. Suppose f has m + 1 sign change points t0,..., tm, where m ≥ 0. Note that G(x) and F(x) = et0x G(x) have the same real-valued roots (with multiplicity) since et0x > 0 for all x ∈ ℝ. F′ (x) is given by

F(x)=ddx(0f(t)e-(t-t0)xdt)=0(t0-t)f(t)e-(t-t0)xdt,

where the interchange of the differential and integral operators in the second equality is justified by the Leibniz integral rule because f is piecewise continuous over ℝ≥0, and both f(t)e−(tt0)x and ddx(f(t)e-(t-t0)x) are jointly continuous over (pi, pi+1) × (−∞, ∞) for each piece (pi, pi+1) over which f is continuous. Note that the set of sign change points of (t0t)f(t) is {t1,..., tm}. Hence, (t0t)f(t) has only m sign changes. By the induction hypothesis, F′ has at most m real-valued roots. By Rolle’s theorem, the number of real-valued roots of F is at most one more than the number of real-valued roots of F′. Hence, F has at most m + 1 real-valued roots, implying that G has at most m + 1 real-valued roots.

Proof of Theorem 5

Suppose there exist two distinct population size functions η1, η2 ∈ ℱ that produce exactly the same cm for all 2 ≤ mn. From (4), we have that

0(η1(τ)-η2(τ))e-(m2)τdτ=0 (12)

for 2 ≤ mn. If we define the function G(x) as

G(x)=0(η1(τ)-η2(τ))e-xτdτ,

then from (12), we see that (m2) is a root of G(x) for 2 ≤ mn, and hence, G has at least n − 1 roots. Applying Theorem 4 to the piecewise continuous function η̃1η̃2, we see that G can have at most σ (η̃1η̃2) roots. Taking the supremum over all population size functions η1 and η2 in ℱ, we see that G can have at most 𝒮(ℱ) roots. Hence, if n − 1 > 𝒮 (ℱ), we get a contradiction. This implies that if n ≥ 𝒮(ℱ) + 2, no two distinct population size functions in ℱ can produce the same (c2,..., cn).

Proof of Corollary 7

As remarked after Lemma 2, for the constant population size function family ℱc, 𝒮 (ℱc) = 0. Hence, by Theorem 6, if n ≥ 2K, the map c: MK(Fc)+n-1 is injective.

An alternate proof of Corollary 7 based on linear algebra

Let n ≥ 2K, and suppose there exist two distinct models η(1), η(2) ∈ ℳK(ℱc) that produce exactly the same cm for all 2 ≤ mn. Let η̃(1) and η̃(2) denote the time-rescaled versions of η(1) and η(2), respectively, as in (2). Since η(j) is piecewise constant with at most K pieces, η̃(j) is also piecewise constant with the same number of pieces as η(j), and η(1)η(2) implies η̃(1)η̃(2). Therefore, Δ̃ := η̃(1)η̃(2) is a piecewise-constant function over [0, ∞) with p pieces, where 1 ≤ p ≤ 2K − 1, and Δ̃ is not identically zero. Let τ1 < ··· < τp−1 denote the change points of Δ̃, and define τ0 = 0 and τp = ∞. Suppose Δ̃(τ) = δi ∈ ℝ for all τ ∈ [τi−1, τi), where 1 ≤ ip. Since η̃(1) and η̃(2) produce the same for cm all 2 ≤ mn, we know that Δ̃ satisfies

0Δ(τ)e-(m2)τdτ=0, (13)

for all 2 ≤ mn. Substituting the definition of Δ̃ into (13) and multiplying by (m2), we obtain

i=1pδi[e-(m2)τi-1-e-(m2)τi]=0, (14)

for 2 ≤ mn. This defines a linear system Aδ = 0, where δ = (δ1,..., δp) and A = (ami) is an (n − 1) × p matrix with ami:=e-(m2)τi-1-e-(m2)τi for 2 ≤ mn and 1 ≤ ip.

Let B = (bmi) be the (n − 1) × p matrix formed from A such that the ith column of B is the sum of columns i, i + 1,..., p of A. Defining αi = eti−1, note that bmi=αi(m2) for 2 ≤ mn and 1 ≤ ip. Now, consider the p × p submatrix C of B consisting of the first p rows of B. Since α1 > α2 > ···> αp > 0, note that C is a generalized Vandermonde matrix, which implies det(C) ≠ 0 [8], Chapter XIII, Section 8. Hence, rank(B) = p. The rank of A is invariant under elementary column operations and, therefore, rank(A) = rank(B) = p. Therefore, the kernel of A is trivial, and the only solution to (14) is δ1 = δ2 = ··· = δp = 0, which contradicts our assumption that Δ̃ = η̃(1)η̃(2) ≡ 0.

Proof of Corollary 8

Let f1, f2 ∈ ℱe be given by

f1(t)=ν1exp(β1t),f2(t)=ν2exp(β2t),

where t ∈ ℝ≥0, ν1, ν2 ∈ ℝ+ and β1, β2 ∈ ℝ. Then, for i = 1, 2, the time-rescaled function i is given by

fi(τ)=νi1-νiβiτ, (15)

for τDom(fi)=Rfi(0)=[0,1νiβi). From (15), it can be seen that 1 and 2 are continuous in their domains. Furthermore, for any given a ∈ ℝ≥0, there is at most one τ, where τ ∈ Dom(1) and τa ∈ Dom(2), such that g(τ) := 1(τ) − 2(τa) = 0, implying σ (g) ≤ 1. By the definition of sign change complexity in (5), it then follows that 𝒮 (ℱe) ≤ 1 for the exponential population family ℱe. Hence, applying Theorem 6, we conclude that n ≥ 4K − 1 suffices for the map c: MK(Fe)+n-1 to be injective.

Proof of Corollary 9

Let f1, f2 ∈ ℱg be generalized-exponential functions which satisfy the following differential equations and initial conditions:

dfidt=βifi(t)γi,fi(0)=νi,i{1,2},

where νi ∈ ℝ+, βi ∈ ℝ and γi ∈ ℝ≥0. The solutions for fi are given by

fi(t)={νiexp(βit),γi=1,[νi1-γi+βit(1-γi)]1/(1-γi),γi1.

It can be shown that the time-rescaled population size functions i are given by

fi(τ)={νiexp(βiτ),βi=0orγi=0,(νi-γi-βiγiτ)-1/γi,βi0andγi>0. (16)

In order to obtain an upper bound on 𝒮 (ℱg), we consider the following three cases depending on the functional form of 1 and 2 in (16):

  • Case 1: 1(τ) = ν1 exp(β1τ) and 2(τ) = ν2 exp(β2τ). Since 1 and 2 are continuous functions of τ, the number of sign changes of g(τ) := 1(τ) − 2(τa) is at most the number of roots of g(τ). Taking the logarithm of 1(τ) and 2(τa), it is easy to see that g(τ) has at most one root for any a ∈ ℝ≥0. Hence, σ(g) ≤ 1.

  • Case 2: 1 and 2 have different functional forms. Suppose 1(τ) = ν1 exp(β1τ) and f2(τ)=(ν2-γ2-β2γ2τ)-1/γ2. For any a1, a2 ∈ ℝ≥0 such that τai ∈ Dom(i), the number of sign changes of g(τ) := 1(τa1) − 2(ta2) is at most the number of roots of g(τ). By raising 1(ta1) and 2(ta2) to the power of −γ2, we see that the number of roots of g(τ) is the number of solutions to
    μ1exp(-γ2β1τ)=μ2-γ2-β2γ2τ, (17)

    where μ1 = ν1 exp(γ2β1a1) and μ2=(ν2-γ2+β2γ2a2)-1/γ2. Equation (17) represents the intersection of an exponential function with a line and has at most 2 solutions for τ. Hence, σ(g) ≤ 2.

  • Case 3: fi(τ)=(νi-γi-βiγiτ)-1/γi for i = 1, 2. Let g(τ) := 1 (τ ) − 2(τa) where a ∈ ℝ≥0 such that τa ∈ Dom(2). Since g is a continuous function, the number of sign changes of g(τ) in ℝ≥0 is bounded by the number of distinct positive roots of g(τ). The number of distinct positive roots of g is the number of distinct positive solutions τ to
    (ν1-γ1-β1γ1τ)-1/γ1=(ν2-γ2-β2γ2(τ-a))-1/γ2,
    which is also the number of distinct positive solutions to
    (ν1-γ1-β1γ1τ)γ2/γ1=ν2-γ2-β2γ2(τ-a). (18)
    Let x:=f1(τ)-γ1=ν1-γ1-β1γ1τ. Since 1 is a time-rescaled population size function, x > 0 when τ ∈ ℝ≥0. Since βi ≠ 0 and γi > 0, (18) can be rewritten as
    xγ2/γ1+Ax+B=0,

    where A=-β2γ2β1γ1 and B=β2γ2β1γ1ν1-γ1-ν2-γ2-β2γ2a. Letting h(x) := xγ2/γ1 + Ax + B, the number of distinct positive solutions for τ in (18) is at most the number of distinct positive roots for the generalized polynomial h. For any real-valued function g(x) possessing infinitely many derivatives and any interval I ⊆ ℝ, let Z(g, I) the number of zeroes of g contained in I, counted with multiplicity. By a consequence of Rolle’s theorem [15], Proposition 2.1, Z(g, I) ≤ Z(g′, I) + 1. Observing that h(x)=γ2γ1xγ2/γ1-1+A has at most one root in ℝ+, Z(h, ℝ+) ≤ Z(h′, ℝ+) + 1 ≤ 2. Hence, the number of distinct positive solutions τ to (18) is at most 2, and σ(g) ≤ 2.

From the definition of sign change complexity in (5) and the bound on σ(g) in the three cases above, it follows that 𝒮 (ℱg) ≤ 2 for the generalized-exponential population family ℱg. Hence, applying Theorem 6, we conclude that n ≥ 6K − 2 suffices for the map c: MK(Fg)+n-1 to be injective.

Proof of Lemma 10

For the population size function ζ(t) defined by ζ(t) = κη(t/κ), note that Rζ(t) is given by

Rζ(t)=0t1ζ(x)dx=0t1κη(x/κ)dx=0t/κ1η(x)dx=Rη(t/κ).

E[Tm,m(ξ)] is then given by

E[Tm,m(ξ)]=0exp[-(m2)Rη(tκ)]dt=κ0exp[-(m2)Rη(t)]dt=κE[Tm,m(η)].

Since n ≥ 2K + (2K − 1)𝒮(ℱ), by Theorem 6, ζ is the unique population size function in ℳK(ℱ) with E[Tm,m(ζ)]=κcm for 2 ≤ mn.

To prove Theorem 12, we first need a lemma that characterizes a certain symmetry property of the invertible matrix that relates the genealogical quantities γ and c introduced in the proof of Lemma 1.

Lemma 13

For a sample of size n, let W be the (n − 1) × (n − 1) invertible matrix such that γn,b=m=2nWb,mcm, where γn,b is the total expected branch length subtending b leaves and cm=E[Tm,m(η)] . Then, for every b and m, where 1 ≤ bn − 1 and 2 ≤ m ≤ n, we have the following identities:

Wb,m+Wn-b,m=0ifmisodd,Wb,m-Wn-b,m=0ifmiseven.
Proof

From the proof of Lemma 1, it can be seen that the matrix W is the product of 3 matrices whose entries are explicitly given combinatorial expressions. However, using Zeilberger’s algorithm [34], Polanski and Kimmel [36], equations (13)(15), also derived the following recurrence relation for the entries of W:

Wb,2=6(n+1),Wb,3=30(n-2b)(n+1)(n+2),Wb,m+2=f(n,m)Wb,m+g(n,m)(n-2b)Wb,m+1, (19)

where f(n, m) and g(n, m) are rational functions of n and m given by

f(n,m)=-(1+m)(3+2m)(n-m)m(2m-1)(n+m+1),g(n,m)=(3+2m)m(n+m+1).

It will be easy to prove our lemma by induction on m using (19). The base cases are easy to check:

Wb,2-Wn-b,2=0,Wb,3+Wn-b,3=30(n-2b)+(n-2(n-b))(n+1)(n+2)=0.

Using (19), we see that if m is odd,

Wb,m+2+Wn-b,m+2=f(n,m)(Wb,m+Wn-b,m)+g(n,m){(n-2b)Wb,m+1+[n-2(n-b)]Wn-b,m+1}=f(n,m)(Wb,m+Wn-b,m)+g(n,m)(n-2b)(Wb,m+1-Wn-b,m+1)=0,

where the last equality follows from the induction hypothesis which implies Wb,m + Wnb,m = 0 and Wb,m+1Wnb,m+1 = 0. Similarly, if m is even,

Wb,m+2-Wn-b,m+2=f(n,m)(Wb,m-Wn-b,m)+g(n,m){(n-2b)Wb,m+1-[n-2(n-b)]Wn-b,m+1}=f(n,m)(Wb,m-Wn-b,m)+g(n,m)(n-2b)(Wb,m+1-Wn-b,m+1)=0,

where again the last equality follows from the induction hypothesis.

Proof of Theorem 12

For a sample of size n in the coalescent, let γn,b be the total expected branch length subtending b leaves, for 1 ≤ bn − 1. Then there exists a positive constant κ such that

γn,d+γn,n-d1+δd,n-d=κχn,d, (20)

for all 1 ≤ d ≤ ⌊n/2⌋. Let fn,d=γn,d+γn,n-d1+δd,n-d. The relationship between f = (fn,1,..., fn, ⌊n/2⌋) and γ = (γn,1,..., γn,n−1) can be described by the linear equation

f=Zγ,

where Z is an ⌊n/2⌋ × (n − 1) matrix with entries given by

Zdj={1,ifj=dorj=n-d,0,otherwise,

for 1 ≤ d ≤ ⌊n/2⌋ and 1 ≤ jn − 1. Hence, dim(ker(Z)) = ⌊ (n − 1)/2⌋.

From Lemma 1, we know that γ; and c = (c2,..., cn) are related as γ = Wc, where W = (Wb,m) is an (n − 1) × (n − 1) invertible matrix, where 1 ≤ bn − 1 and 2 ≤ mn. Hence,

f=Yc, (21)

where Y := ZW. Since Yb,m = Wb,m + Wnb,m, we know from Lemma 13 that Yb,m = 0 for all odd values of m. Therefore, every other column of the matrix Y is zero. This implies that span({e3, e5,..., en−𝟙{n even}}) ⊆ ker(Y), where ei is an (n − 1)-dimensional unit vector defined as ei = (ei,2,..., ei,n), with ei,i = 1 and ei,j = 0 for ij. Note that n − 𝟙{n even} = 2⌊ (n − 1)/2⌋ + 1 and dim(span({e3, e5,..., e2⌊ (n−1)/2⌋+1})) = ⌊ (n − 1)/2⌋. Now, since W is invertible, dim(ker(Y)) = dim(ker(ZW)) = dim(ker(Z)) = ⌊ (n − 1)/2⌋. Therefore,

ker(Y)=span({e3,e5,,e2(n-1)/2+1}). (22)

Suppose there exist two distinct models η1, η2 ∈ ℳK (ℱ) that produce the same folded SFS f. Let c(1) and c(2) be the vector of genealogical quantities for models η1 and η2, respectively, where cm(1)=E[Tm,m(η1)] and cm(2)=E[Tm,m(η2)], 2 ≤ mn. From (21), we know that c(1)c(2) ∈ ker(Y). Using (22), cm(1)-cm(2) can be written as

cm(1)-cm(2)=l=1(n-1)/2αle2l+1,m, (23)

for some αl ∈ ℝ. Since eij = 0 for ij, (23) implies that cm(1)-cm(2)=0 for all even values of m, where 2 ≤ mn. Now applying a similar argument as in the proof of Theorem 6 to cm(1)-cm(2) for even values of m, we conclude that if ⌈(n − 1)/2⌉ > (2K − 2) + (2K − 1)𝒮(ℱ), then no two distinct models η1, η2 ∈ ℳK (ℱ) can produce the same f. This implies that a sample size n ≥ 2(2K − 1)(1 + 𝒮 (ℱ)) suffices for identifying the population size function in ℳK (ℱ) from the folded SFS f, and the conclusion of the theorem follows from (20) and Lemma 10.

4. Discussion

In human genetics, several large-sample datasets have recently become available, with sample sizes on the order of several thousands to tens of thousands of individuals [1, 4, 6, 31, 44]. The patterns of polymorphism observed in these datasets deviate significantly from that expected under a constant population size, and there has been much interest in inferring recent and ancient human demographic changes that might explain these deviations [10, 24, 26]. Clearly, model identifiability is an important prerequisite for such statistical inference problems. In this paper, we have obtained mathematically rigorous identifiability results for demographic inference by showing that piecewise-defined population size functions over a wide class of function families are completely determined by the SFS, provided that the sample is sufficiently large. Furthermore, we have provided explicit bounds on the sample sizes that are sufficient for identifying such piecewise population size functions. These bounds depend on the number of pieces and the functional type of each piece. For piecewise-constant population size models, which have been extensively applied in demographic inference studies, our bounds are tight. We have also given analogous results for identifiability from the folded SFS, a variant of the SFS that is oblivious to the identities of the ancestral and mutant alleles.

Recent large-sample sequencing studies have consistently found a substantially higher fraction of rare variants compared to the predictions of the coalescent with a constant population size, even in regions of the genome that are believed to have evolved neutrally [9]. Keinan and Clark [16] suggested that recent rapid expansion of the population has given rise to variants which are private to single individuals in the population, and that this signature of population expansion is particularly apparent now due to the larger sample sizes involved in sequencing studies. We illustrate this point with a specific example. The blue plot in Figure 3 shows the expected SFS for a sample of size n = 19 under the piecewise-exponential population size history with 5 epochs recently inferred by Tennessen et al. [44] and illustrated in Figure 1. (Note that n = 19 is the sample size bound given by Corollary 8 for identifying piecewise-exponential models with up to 5 pieces.) The red plot in Figure 3 shows the expected SFS for the same sample size under a constant population size model. For this small sample size, the two expected frequency spectra are very similar despite the large difference in demographic models, indicating the difficulty of accurately recovering the details of recent exponential population growth using small-sample data. In contrast, for a much larger sample of size n = 2702, which corresponds to the actual sample size for Tennessen et al.’s data, the expected frequency spectra under the two demographic models mentioned above are considerably more different; see the green and purple plots in Figure 3.

Fig. 3.

Fig. 3

The leading entries of the expected SFS ξn for a piecewise-exponential population size model inferred b Tennessen et al. [44]. This demographic model, shown (up to scaling) in Figure 1, was fitted using the observed SFS from a sample of 1351 (diploid) individuals of European ancestry [44]. The blue plot is the expected SFS for n = 19, which matches the sample size bound in Corollary 8 for identifying piecewise-exponential models with up to 5 pieces, while the green plot is the first 18 entries of the expected SFS for n = 2702 (1351 diploids). The red and purple plots are the expected SFS for n = 19 and n = 2702, respectively, for a constant population size function.

On the other hand, our identifiability results show that perfect data (i.e., the exact expected SFS) from even a small sample size of n = 4K − 1 are sufficient to uniquely identify a piecewise-exponential model with K pieces. This gap between theoretical identifiability and practical inference needs to be better addressed through robustness results that can account for the finite genome length, which limits the resolution to which the expected SFS of a random sample can be estimated. Our identifiability results apply in the limit that the genome length is infinite, which allows one to estimate the entries of the expected SFS exactly. On the other hand, a finite length genome does not permit exact estimation of the expected SFS, which can make it difficult in practice to resolve the details of ancient demographic events even if the sample size is large. This is because population size changes sufficiently far back in the past are likely to have only a marginal effect on the SFS since the individuals in the sample are highly likely to have found a common ancestor by such ancient times.

Our work suggests several interesting avenues for future research. An important problem is to understand the sensitivity of the SFS to perturbations in the demographic parameters. A related problem is quantifying the extent to which errors in estimating the expected SFS from a finite amount of data affect the parameter estimates in inferred demographic models.

It would also be interesting to consider the possibility of developing an algebraic algorithm for demographic inference that closely mimics the linear algebraic proof of Corollary 7 provided in Section 3. For example, using a sample of size K + 1, one could consider inferring a piecewise-constant model with K pieces, with one piece for each of the most recent K − 1 generations and another piece for the population size further back in time. (Here, we are considering a restricted class of piecewise-constant population size functions with fixed change points, so the minimum sample size needed for distinguishing such models using the SFS is K + 1 rather than 2K.) Such an algebraic algorithm could provide a more principled way of inferring demographic parameters, compared to existing inference methods that rely on optimization procedures which lack theoretical guarantees for functions with multiple local optima.

In our work, we focused on the identifiability of demography from the expected SFS data. However, if one were to use the complete sequence data or other summary statistics such as the length distribution of shared haplotype tracts, it might be possible to uniquely identify the demography using even smaller sample sizes than that needed when using only the SFS. Indeed, several demographic inference methods have been developed to infer historical population size changes from such data using anywhere from a pair of genomic sequences [14, 24, 32] to tens of such sequences [42], and it is important to theoretically characterize the power and limitations of both the data and the inference methods.

Acknowledgments

We thank Graham Coop, Noah Mattoon, Aylwyn Scally and anonymous reviewers for their comments on our work. We also thank the generous support of the Simons Institute for the Theory of Computing. The final version of this work was completed while the authors were participating in the 2014 program on “Evolutionary Biology and the Theory of Computing.”

Footnotes

1

Supported in part by an NIH Grant R01-GM094402, and a Packard Fellowship for Science and Engineering.

References

  • 1.1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernan-dez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 2008;4:e1000083. doi: 10.1371/journal.pgen.1000083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN. Demonstrating stratification in a European American population. Nat Genet. 2005;37:868–872. doi: 10.1038/ng1607. [DOI] [PubMed] [Google Scholar]
  • 4.Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, Hixson JE, Rea TJ, Muzny DM, Lewis LR, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9:e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Fu W, O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Altshuler D, Shendure J, Nickerson DA, et al. Analysis of 6515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2012;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48:172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
  • 8.Gantmacher FR. The Theory of Matrices. Vol. 2. Chelsea; New York: 2000. [Google Scholar]
  • 9.Gazave E, Chang D, Clark AG, Keinan A. Population growth in-flates the per-individual number of deleterious mutations and reduces their mean effect. Genetics. 2013;195:969–978. doi: 10.1534/genetics.113.153973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Bustamante CD, Altshuler DL, et al. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011;108:11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor Popul Biol. 2003;64:241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
  • 12.Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Comm Statist Stochastic Models. 1998;14:273–295. [Google Scholar]
  • 13.Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Harris K, Nielsen R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 2013;9:e1003521. doi: 10.1371/journal.pgen.1003521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jameson GJO. Counting zeros of generalised polynomials: Descartes’ rule of signs and Laguerre’s extensions. The Mathematical Gazette. 2006;90:223–234. [Google Scholar]
  • 16.Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336:740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Keinan A, Mullikin JC, Patterson N, Reich D. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East asians than in europeans. Nat Genet. 2007;39:1251–1255. doi: 10.1038/ng2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kidd JM, Gravel S, Byrnes J, Moreno-Estrada A, Musharoff S, Bryc K, Degenhardt JD, Brisbin A, Sheth V, Chen R, et al. Population genetic inference from personal genome data: Impact of ancestry and admixture on human genomic variation. Am J Hum Genet. 2012;91:660–671. doi: 10.1016/j.ajhg.2012.08.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kimura M. Solution of a process of random genetic drift with a continuous model. Proc Natl Acad Sci USA. 1955;41:144–150. doi: 10.1073/pnas.41.3.144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61:893. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13:235–248. [Google Scholar]
  • 22.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19A:27–43. [Google Scholar]
  • 23.Kingman JFC. Exchangeability and the evolution of large populations. In: Koch G, Spizzichino F, editors. Exchangeability in Probability and Statistics (Rome, 1981) North-Holland; Amsterdam: 1982. pp. 97–112. [Google Scholar]
  • 24.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, Hubisz MJ, Sninsky JJ, White TJ, Sunyaev SR, Nielsen R, et al. Proportionally more deleterious genetic variation in European than in African populations. Nature. 2008;451:994–997. doi: 10.1038/nature06611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lukic S, Hey J. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics. 2012;192:619–639. doi: 10.1534/genetics.112.141846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lukić S, Hey J, Chen K. Non-equilibrium allele frequency spectra via spectral methods. Theor Popul Biol. 2011;79:203–219. doi: 10.1016/j.tpb.2011.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. doi: 10.1038/ng1337. [DOI] [PubMed] [Google Scholar]
  • 29.Marth GT, Czabarka E, Murvai J, Sherry ST. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. doi: 10.1534/genetics.166.1.351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73:342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
  • 31.Nelson MR, Wegmann D, Ehm MG, Kessner D, Jean PS, Verzilli C, Shen J, Tang Z, Bacanu S-A, Fraser D, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337:100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Palamara PF, Lencz T, Darvasi A, Pe’er I. Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet. 2012;91:809–822. doi: 10.1016/j.ajhg.2012.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WL, Ruczinski I, Fornage M, Siscovick DS, Zhu X, et al. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 2011;7:e1001371. doi: 10.1371/journal.pgen.1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Petkovšek M, Wilf HS, Zeilberger D. A = B. A K Peters; Wellesley, MA: 1996. [Google Scholar]
  • 35.Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63:33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
  • 36.Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165:427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 38.Reppell M, Boehnke M, Zöllner S. FTEC: A coalescent simulator for modeling faster than exponential growth. Bioinformatics. 2012;28:1282–1283. doi: 10.1093/bioinformatics/bts135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Reppell M, Boehnke M, Zöllner S. The impact of accelerating faster than exponential population growth on genetic variation. Genetics. 2014;196:819–828. doi: 10.1534/genetics.113.158675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. The date of interbreeding between Neandertals and modern humans. PLoS Genet. 2012;8:e1002947. doi: 10.1371/journal.pgen.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics. 2013;194:647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Skoglund P, Jakobsson M. Archaic human ancestry in East Asia. Proc Natl Acad Sci USA. 2011;108:18301–18306. doi: 10.1073/pnas.1108181108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA. 2005;102:7882–7887. doi: 10.1073/pnas.0502300102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Živković D, Stephan W. Analytical results on the neutral non-equilibrium allele frequency spectrum based on diffusion theory. Theor Popul Biol. 2011;79:184–191. doi: 10.1016/j.tpb.2011.03.003. [DOI] [PubMed] [Google Scholar]

RESOURCES