Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2015 Jun 8;112(25):7677–7682. doi: 10.1073/pnas.1503717112

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Jonathan Terhorst a, Yun S Song a,b,c,1
PMCID: PMC4485089  PMID: 26056264

Significance

Numerous empirical studies in population genetics have used a summary statistic called the sample frequency spectrum (SFS), which summarizes the information in a sample of DNA sequences. Despite their popularity, the accuracy of inference methods based on the SFS is difficult to characterize theoretically, and it is currently unknown how the estimation accuracy improves as more sites in the genome are used. Here, we establish information theoretic limits on the accuracy of all estimators that use the SFS to infer population size histories. We study the rate of convergence to the true answer as the amount of data increases, and obtain the surprising result that it is exponentially worse than known convergence rates for many classical estimation problems in statistics.

Keywords: minimax rate, population genetics, demographic inference

Abstract

The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.


The past decade has seen a revolution in our ability to interrogate the genome at the molecular level. Fueled by technological advances in DNA sequencing, studies now routinely query thousands or tens of thousands of individuals [refs. 14 and UK10K Project (www.uk10k.org) and Exome Aggregation Consortium (exac.broadinstitute.org)] to better understand disease susceptibility, heritability, population history, and other phenomena. In most cases, the conclusions of these studies come in the form of statistical estimates obtained from models that relate the effect of interest to mutation patterns arising in sampled DNA sequences. As genetic sample sizes explode, it is natural to wonder how additional data improve the quality of these estimates. While this general question has received intense focus in theoretical statistics, certain aspects of the genetics setting (for example, non-Gaussianity and lack of independence among samples) complicate efforts to study such models using classical techniques. New methods are needed to theoretically characterize some common models in statistical genetics.

Here, we address this need for a specific estimation problem in population genetics known as demographic inference. As we explain in further detail below, the aim of this problem is to reconstruct the sequence of historical events—including population size changes, migration, and admixture—that gave rise to present-day populations, using DNA samples obtained from those populations. We focus on the simplest problem of estimating the size history of a single population backward in time.

A summary statistic known as the (SFS; defined below) is often used in empirical studies (2, 511), but there have been fewer attempts to understand SFS-based estimation from a theoretical perspective. The main result of this paper is to show that, for a common class of estimators that analyze the SFS, there is a fundamental limit on their accuracy as a function of the sample size. More precisely, we show that, under a standard statistical error metric known as minimax error, the rate at which these estimators converge to the truth for certain populations is at best inversely logarithmic in the number of independent segregating sites analyzed, and does not depend at all on the number of individuals sampled. Compared with other types of statistical estimation problems (for example, linear regression), this is an extremely slow rate of convergence. Our proof is information theoretic in nature and applies to any estimator that operates solely on the SFS. This is the first result we are aware of that characterizes the convergence rate of demographic history estimates as a function of sample size.

The remainder of this paper is organized as follows. In Preliminaries, we formally define our notation and model. In Main Results, we state our main theoretical results, followed by a discussion of their practical implications in Discussion. To streamline our exposition, all mathematical proofs are deferred until Proofs.

Preliminaries

The stochastic process underlying the inference procedure we consider is Kingman’s coalescent (1214), which evolves backward in time and describes the genealogy of a collection of chromosomes randomly sampled from a population. The population size is assumed to change deterministically over time and is described by a function η:[0,)(0,), with η(t) being the population size at time t in the past. The instantaneous rate of coalescence between any pair of lineages at time t is 1/η(t).

As in the standard infinite sites model of mutation (15), we assume that every dimorphic site (i.e., a site with exactly two observed allelic types) has experienced mutation exactly once in the evolutionary history of the sample. Further, for each such site, we assume that it is known which allele is the ancestral type versus the mutant type. In what follows, we use the terms “dimorphic” and “segregating” interchangeably.

A population size function η(t) induces a probability distribution on the number of derived alleles found at a particular segregating site. Specifically, for a sample of n2 randomly sampled individuals, let ξn,b(η), for 1bn1, denote the probability that a segregating site contains b mutant alleles in a sample of n individuals under model η. The vector ξn(η)=def(ξn,1(η),,ξn,n1(η)) is called the expected SFS. In the coalescent setting, a general expression for ξn,b(η) is given by (16)

ξn,b(η)k=2nb+1(nb1k2)(n1k1)kETn,k(η),

where ETn,k(η) denotes the amount of time (in coalescent units) during which the genealogy of the sample contained k lineages under model η. The expected waiting time ETm,m(η) to the first coalescence in a sample of m individuals is given by

cm(η)=defETm,m(η)=0tamη(t)exp{amRη(t)}dt, [1]

where am=def(m2) and Rη(t)=def0t1η(s)ds is the cumulative rate of coalescence up to time t. It turns out (17) that there is an invertible linear transformation that relates (ETn,2(η),ETn,3(η),,ETn,n(η)) to c(η)=def(c2(η),c3(η),,cn(η)). Using this relation, the quantity ξn,b(η) can be written as (18)

ξn,b(η)=c(η),Wn,bc(η),Vn, [2]

where Wn,b=(Wn,b,2,,Wn,b,n) and Vn=(Vn,2,,Vn,n) are vectors of universal constants that do not depend on the population size function η, and , denotes the l2 inner product. Under model η, the quantity c(η),Wn,b is the total expected length of edges subtending b out of n individuals sampled at time 0, while the quantity c(η),Vn is the total expected tree length for a sample of size n. Both quantities are positive for all population size functions η. For an arbitrary population size function η, we have b=1n1Wn,b,m=Vn,m for all 2mn, which implies

b=1n1c(η),Wn,b=c(η),Vn. [3]

For a constant function η(t)N,

cm(η)=Nam,
c(η),Wn,b=2bN, [4]
c(η),Vn=2NHn1, [5]

where Hn1=defb=1n11b.

To formulate the problem, we use the following notation. We suppose that a sample of n2 randomly sampled individuals has been typed at s independent segregating sites. These data are used to form the empirical sample frequency spectrum, which is an (n1)-tuple (ξ^n,1,,ξ^n,n1), where ξ^n,b denotes the proportion of segregating sites with b copies of the mutant allele and nb copies of the ancestral allele. A frequency-based estimator is any statistic η^ that maps an empirical SFS to a population size history.

Main Results

Here, we establish a minimax lower bound on the ability of any estimator η^ to accurately reconstruct population size functions.

A General Bound on the Kullback−Leibler Divergence Between Two SFS Distributions.

Abusing notation, we use D(ηη) to denote the Kullback−Leibler (KL) divergence between the probability distributions ξn(η) and ξn(η). In Proofs, we prove the following general upper bound on the KL divergence between two SFS distributions:

Theorem 1.

Let denote a general space of population size functions and suppose η,η satisfy η(t)=η(t) for all 0ttc and maxt>tcη(t)mint>tcη(t). Then,

D(ηη)c(η)c(η),Vnc(η),Vn. [6]

Bounds for a Family of Piecewise Constant Models.

We now focus on a particular class of population size functions that are easier to analyze and are popular in the literature (11, 19, 20). For a fixed positive integer K>1, let K denote the space of piecewise constant size functions with exactly K pieces. A population size function η is a member of K if and only if there exist positive real numbers t1<<tK1 and N1,N2,,NK such that

η(t)=k=1KNk1{tk1t<tk}, [7]

where, by convention, we define t0=0 and tK=. For such an η, define

Sk(η)=defj=1ktjtj1Nj. [8]

For ηK, the expected waiting time cm(η) defined in Eq. 1 is given by

cm(η)=1amk=1KNk(eamSk1(η)eamSk(η)). [9]

Note that since tK=,

eamSK(η)0,forallηK. [10]

To formulate our result, we let I,J denote positive integers that satisfy I+J=K, and introduce a subfamily I,JK of piecewise constant functions defined as follows. See Fig. 1 for illustration. We assume that all change points t1<<tI+J1 are fixed and that the sizes N1,,NI of the first I epochs are also fixed, with NI being the smallest size. So, all functions in I,J are identical to each other for the first I epochs, and there is a population bottleneck in the last epoch. Then, for ttI, every function ηI,J undergoes jumps according to the following rules:

  • 1.

    For the interval tIt<tI+1, η(t) takes a constant value of either h or h+δ, where h>NI and δ>0.

  • 2.

    At later change points {tI+1,,tI+J1}, η either stays the same or jumps upward by δ.

Fig. 1.

Fig. 1.

A family I,J of piecewise-constant population size models with K=I+J epochs.

Hence, I,J consists of 2J distinct piecewise constant functions that are nondecreasing functions of t for ttI. Note that mintη(t)=NI for all ηI,J. For ease of notation, we use ε=defNI to denote the bottleneck size and τB=deftItI1 to denote the bottleneck duration. To facilitate analysis later, we fix tI+jtI+j1 to some positive constant τA for all j=1,,J1.

For any two models in I,J, we obtain the following bound on the difference of their waiting times to the first coalescence:

Lemma 2.

For all η,ηI,J,

|cm(η)cm(η)|JδameamτB/ε. [11]

Together with Theorem 1, this lemma can be used to show

Theorem 3.

Let η,ηI,J that satisfy maxttIη(t)minttIη(t). Then,

D(ηη)JδεeτB/ε. [12]

Proofs of these results are deferred to Proofs. It is interesting that the above bound does not depend on the number n of sampled individuals.

Minimax Lower Bounds.

Before using the above results to obtain a minimax lower bound, we first note a subtle fact. Given any population size function η, consider a function ζ that satisfies ζ(t)=κη(t/κ) for all t[0,), where κ is some positive constant. Such functions are equivalent, as it turns out that ξn,b(ζ)=ξn,b(η) for all n2 and 1bn1. To mod out by this equivalence, we assume that every η satisfies η(0)=Nfix, where Nfix is some fixed positive constant.

Let * denote a generic norm (specific examples will be given later) and let Eη() denote expectation with respect to the SFS distribution ξn(η)=(ξn,1(η),,ξn,n1(η)) induced by population size function η. Then, note that

infη^supηEη||η^η||*infη^supηKEη||η^η||*infη^supηI,JEη||η^η||*.

In what follows, we will put a lower bound on the last quantity. We first fix a sensible distance metric on . An intuitive way to measure distance between two population size functions is their L1 distance, ||ηaηb||1=0|ηa(t)ηb(t)|dt, but this is unreasonably stringent in that ||ηaηb||1= if ηa and ηb do not agree infinitely far back into the past. Instead we will focus on the following truncated L1 distance: ||ηaηb||1,T=def0T|ηa(t)ηb(t)|dt, which measures the discrepancy between ηa and ηb back to some fixed time T in the past.

Henceforth, let η^ be any estimator of the population size function that operates on a sample of s independent segregating sites obtained from a sample of n randomly sampled individuals. In Proofs, we prove the following main results of our paper:

Theorem 4.

Consider the subfamily I,J of models described above, and suppose J>8 and TtI+J1+τA. Then,

infη^supηI,JEη||η^η||1,TCτA(J8)2JεseτB/ε, [13]

where C is a positive constant.

The above theorem applies to all models in I,J. We now consider the subset I,JM={ηI,J:η<M}, which is the set of all models in I,J that are bounded by some constant M. For this family of bounded population size functions, a sharper asymptotic lower bound can be obtained as follows.

Theorem 5.

Suppose J>8 and TtI+J1+τA. Then,

infη^supηI,JMEη||η^η||1,TC(J8)2JτBτAlogs, [14]

where C is a positive constant.

By specializing I,JM, a simplified version of Theorem 5 can be obtained:

Corollary 6.

Suppose TtI+J1+τA and let I,M=J1I,JM. Then,

infη^supηI,MEη||η^η||1,TC(TtI)τBlogs, [15]

where C is a positive constant.

Note that the above lower bounds do not depend on the dimension of the SFS (which is equal to n1). Hence, for a fixed number s of segregating sites considered, using more individuals does not diminish the error bounds.

Bottleneck Followed by Exponential Growth.

In the results presented above, we dropped smaller terms to obtain the dominant contribution to our lower bound. Here, we provide a more detailed analysis to study how the model in the recent past (i.e., the period 0ttI1) affects the lower bound. A slight modification of the above results permits us to analyze the following model class, which is of interest in, for example, human genetics (2, 3, 7): Let GJ be the family of models illustrated in Fig. 2 with exponential growth in the recent past. Specifically, η(t)=η0eβ(η0)t for the period 0tt1. The rate of growth β(η0)=log(η0/γε)/t1 is defined so that η(t1)=γε for all ηGJ, where γ1. The part for t>t1 is the same as that for t>tI1 in I,J (Fig. 1). We obtain the following result for the subfamily GJ:

Fig. 2.

Fig. 2.

A family GJ of population size models with exponential growth in the recent past. This family consists of size histories that are piecewise constant before the bottleneck, and then jump to some level γε and undergo (identical) exponential growth from time t1 to present.

Theorem 7.

Consider the subfamily GJ of models described above, and suppose J>8 and TtJ+τA. Then,

infη^supηGJEη||η^η||1,TCτA(J8)2Jεsexp[τBε+t11γε1η0log(η0)log(γε)]. [16]

Theorem 4 is a measure of how (a lower bound on) estimation error depends on growth following a bottleneck. The two extremes η0 and η0γε have intuitive interpretations. For large η0, the bound in Eq. 16 tends to the corresponding bound given by Theorem 4, as expected since coalescences become increasingly less likely in the first time period. Small η0 has the effect of ‘‘prolonging’’ the bottleneck, thus increasing the minimax lower bound. In particular, if γ=1 then t1[(1/γε)(1/η0)]/[log(η0)log(γε)](t1/ε) as η0γε, so that the effect of low population growth on the minimax lower bound is to simply prolong the bottleneck effect by an additional t1 time periods.

Discussion

In this paper, we have theoretically characterized fundamental limits on the accuracy of demographic inference from data. We have shown that the minimax error rate for estimating the piecewise-constant demography of a single population is at least O(1/logs), where s is the number of independent segregating sites analyzed. In contrast, the minimax error for many classical estimation problems in statistics (for example, nonparametric regression or density estimation) decays inverse polynomially in the sample size (21). Compared with these problems, exponentially more samples would be required to estimate a population size history function to within a similar magnitude of error. The paper that most closely relates to the present work is by Kim et al. (22), who obtain lower bounds on the amount of exact coalescence time data necessary to distinguish between size histories in a hypothesis testing framework. Since coalescence times are never observed and must be estimated from data, these bounds place a limit on the accuracy with which a population size function can be inferred. The authors also describe an estimator that uses coalescence times (again observed without noise) to accurately recover the underlying population size function with high probability, at a rate that roughly matches the lower bound.

Another line of work centers around the identifiability of the parameter η(t) using the SFS. Roughly speaking, a family of statistical models {Pθ}θΘ defined over a parameter space Θ is identifiable if, for any θ1,θ2Θ with θ1θ2, the sampling distributions induced by Pθ1 and Pθ2 are different. In our context, this simply says that, for all n, ξn(η1)ξn(η2) unless η1=η2 almost everywhere. Standard desiderata for statistical estimators (e.g., consistency or unbiasedness) are impossible without identifiability, so it is the weakest possible regularity condition one can impose on a useful family of models.

Perhaps surprisingly, it turns out that, in general, a population size function is not identifiable from the SFS (23). Indeed, for any given η(t), it has been shown that an infinite number of smooth functions F(t) exist such that ξn(η)=ξn(η+F). Moreover, explicit examples can be constructed that demonstrate this phenomenon (23). On the other hand, these counterexamples consist of functions that exhibit an unbounded frequency of oscillatory behavior near the present time, which is perhaps unrealistic when modeling naturally occurring populations. More recently, it has been shown (19) that identifiability holds for many classes of population size functions used by practitioners (including piecewise constant, piecewise exponential, and piecewise generalized exponential). Furthermore, the number n of sampled individuals sufficient for identifiability can be explicitly given and is a function of the complexity of the underlying class of models being studied (19).

Identifiability asserts that, given an infinite amount of data (specifically, taking the number of segregating sites s), the model parameter η(t) can be uniquely recovered. In practice, s is finite, and only a perturbed version of the expected frequency spectrum, say ξ^n(η), is observed. From a practical standpoint, it is important to understand how these perturbations ultimately affect the parameter estimate η^(t). It is this question that forms the starting point for the present work.

A single population evolving under a piecewise-constant demography is a special case of many richer classes of demographic models. For example, it is a (limiting) member of the family of exponential growth models, seen by taking each exponential growth parameter to zero. In the multispecies coalescent setting (10, 24), multiple population size histories must be estimated, and the error of that estimate must necessarily be lower bounded by that of estimating a single such history. Thus, our result can be expected to apply to a broader class of models than the one we have studied here.

As detailed in Proofs, the result in Theorem 5 follows from setting ε=τB/logs and δεsexp(τB/ε) in the subfamily I,JM. The size τB/logs is in coalescent units. In terms of the number of individuals, it is proportional to gB/logs, where gB is the number of generations corresponding to duration τB in the coalescent limit. Intuitively, as the severity of the bottleneck increases, the population is increasingly likely to find its most recent common ancestor (MRCA) during that time; farther back in time than the MRCA, no information is conveyed concerning the demographic events experienced by the population.

One might object to considering models with a bottleneck size that scales inversely with the number s of segregating sites in the data, and it is indeed possible that a better convergence rate may be achievable for populations that are known not to contain a bottleneck. On the other hand, we note that 1/logs decreases sufficiently slowly with s that our result can be expected to apply to many real-world examples. For example, for s108, which is a conservative upper bound for most organisms, gB/logs0.054gB. This implies that for populations that have experienced roughly an order-of-magnitude increase in effective population size during their history, accurate estimation of demographic events that occurred before this expansion is difficult using SFS-based methods. Additionally, an interesting aspect of our work is that our minimax lower bounds do not depend on the number n of sampled individuals; increasing n is not enough to overcome the information barrier imposed by the presence of a bottleneck. This is intuitively plausible since, as n increases, the (n+1) th sampled lineage becomes more likely to coalesce early on.

An interesting question that we have not attempted to analyze is whether the O(1/logs) rate is optimal, i.e., whether there exists some estimator η^(t) that achieves the minimax lower bound established here. In practice, from Eqs. 2, 8, and 9, it can be seen that naively maximizing the likelihood of the observed SFS with respect to η(t) requires solving a nonconvex optimization problem, so that convergence to the global maximum is not even guaranteed. Computational issues aside, finding such an estimator remains an open theoretical challenge.

In closing, we stress that our result is specific to SFS-based estimators, which analyze only independent sites. The main allure of these estimators is their mathematical tractability, rather than their realism. In fact, a rich source of additional information exists in the correlation structure found among linked sites in the genome. Methods that seek to exploit this structure by modeling the action of recombination pose greater mathematical and computational difficulties, but there has been recent progress in this area (20, 2529). Our result serves to underscore the importance of pursuing more realistic models of genomic evolution, challenging though they may be.

Proofs

Proof of Theorem 1. To simplify the notation, we write c=c(η) and c=c(η). Then, using Eq. 2, we can write

D(ηη)=b=1n1ξn,b(η)logξn,b(η)ξn,b(η)=b=1n1ξn,b(η)[log(c,Wn,bc,Wn,b)+log(c,Vnc,Vn)].

The assumption mint>tcη(t)maxt>tcη(t) implies that, for all times t,t>tc, the instantaneous rate of coalescence at time t in model η is greater than or equal to the instantaneous rate of coalescence at time t in model η. Hence, this assumption together with η(t)=η(t) for all 0ttc implies cc,Wn,b0 for all 1bn1; equivalently, log(c,Wn,b/c,Wn,b)<0. Additionally, (cc,Vn/c,Vn)>1 and log(1+x)x for all x1. Combining these facts, we obtain

D(ηη)b=1n1ξn,b(η)log(c,Vnc,Vn)b=1n1ξn,b(η)cc,Vnc,Vn=cc,Vnc,Vn,

where we have used b=1n1ξn,b(η)=1 in the final equality.

Proof of Lemma 2. We distinguish two particular models, η,ηuI,J, which are the lower and the upper envelopes of I,J. The function η stays constant at h for all ttI, while ηu jumps upward by δ at every change point tI,,tI+J1. Hence, ηηηu pointwise for all ηI,J. The two enveloping functions will form the basis of subsequent analysis.

Fix η,ηI,J and note that, by the definition of I,J, one of these functions must pointwise dominate the other. Therefore, assume without loss of generality that η(t)η(t) for all t. Then, for all t,

η(t)η(t)η(t)ηu(t),

which implies

cm(η)cm(η)cm(η)cm(ηu),

for all m=2,,n. Using these inequalities, we conclude

cm(η)cm(η)cm(ηu)cm(η),

so it suffices to demonstrate Eq. 11 for cm(ηu)cm(η). Now, by Eq. 9 and the definition of η,

amcm(η)=i=1INi[eamSi1(η)eamSi(η)]+j=1Jh[eamSI+j1(η)eamSI+j(η)]=i=1INi[eamSi1(η)eamSi(η)]+heamSI(η),

where we have used Eq. 10. Similarly,

amcm(ηu)=i=1INi[eamSi1(ηu)eamSi(ηu)]+j=1J(h+jδ)[eamSI+j1(ηu)eamSI+j(ηu)]=i=1INi[eamSi1(ηu)eamSi(ηu)]+heamSI(ηu)+j=1Jjδ[eamSI+j1(ηu)eamSI+j(ηu)].

Now, using the fact that η and ηu agree on the first I epochs, we obtain

am[cm(ηu)cm(η)]=j=1Jjδ[eamSI+j1(ηu)eamSI+j(ηu)]=δj=1JeamSI+j1(ηu)JδeamτB/ε, [17]

where the second line follows from telescoping and the fact that SI+J(ηu)=, while the last line follows from the fact that τBεSI+j1(ηu) for all j=1,,J.

Proof of Theorem 3. For ease of notation, define c=c(η) and c=c(η). By Lemma 2,

cc,Vn=m=2n(cmcm)Vn,mJδm=2nVn,mameamτB/εJδeτB/εm=2nVn,mam,

where the second inequality follows from eamτB/εeτB/ε for all m=2,,n. Now, noting that m=2n(Vn,m/am) corresponds to the total tree length for the constant population size function η1 and using Eq. 5, we obtain

cc,VnJδeτB/ε2Hn1. [18]

To finish the proof, recall that c,Vn is the total expected branch length of the coalescent tree under model η. Since mintη(t)=ε, we have that c,Vn is at least as large as the corresponding quantity under a model with constant population size ε. By Eq. 5, the total expected tree length under the latter model equals 2εHn1. Thus, c,Vn2εHn1, and combining this result with Eq. 18 gives

cc,Vnc,VnJδεeτB/ε.

Finally, Eq. 12 follows from this inequality and Theorem 1.

Proof of Theorem 4. Our proof uses a generalized form of Fano’s inequality (30). Adapted to our setting and notation, the method reads as follows.

Theorem 8 (Fano’s method). Consider a space of population size models. Let r2 be an integer, and let Snr={η1,η2,,ηr} contain r population size functions such that for all ab, ||ηaηb||*αr and D(ξn(ηa)ξn(ηb))βr. Let η^(n,s)=η^(n,s)(X1,,Xs) be an estimator of η based on the SFS data X1,,Xs sampled independently from ξn(η); i.e., X1,,Xs are SFS data for n individuals at s independent segregating sites. Then,

infη^supηEη||η^(n,s)η||*αr2(1sβr+log2logr). [19]

This theorem places a lower bound on the minimax rate of convergence of a population size history estimator based on the SFS.

For ηI,J, let wj denote the variable {0,1} indicating whether η jumps by δ at change point tI+j. Let Y={w=(w0,,wJ1)|wi{0,1}}, where J8. By the Varshamov−Gilbert lemma (see ref. 31, Lemma 4.7), there exist X={w0,,wM}Y such that (i) w0=(0,,0), (ii) M2J/8, and (iii) H(wi,wj)J/8, where H(,) denotes the Hamming distance.

Let I,JX denote the subset of 2J/8+1 functions in I,J with the indicator variable for δ jumps at tI,,tI+J1 given by wX. Then, for any two ηaηbI,JX, we have

||ηaηb||1,TJ8τAδ. [20]

Using Theorem 8 via Eq. 20 and Theorem 3, we obtain

infη^supηI,JEη||η^(n,s)η||1,TJτAδ16[1sJδεeτB/ε+log2log(2J/8+1)]JτAδ16[1sJδεeτB/ε+log2J8log2]. [21]

We now optimize the bound with respect to δ. A straightforward calculation shows that the maximum is attained at

δ*=(J8)log216J(εs)eτB/ε, [22]

and setting δ=δ* in Eq. 21 yields the result.

Proof of Theorem 5. The result is obtained by scaling ε with the number of segregating sites s. Denote this scaling by ε(s); we will determine ε(s) that produces the largest possible lower bound. Starting from Eq. 22 in the proof of Theorem 4, note that δ* scales as (ε/s)eτB/ε=:f(ε). To satisfy the constraint that ||η||<M for all ηI,JM and s, the condition

limsupsmax{ε(s)seτB/ε(s),ε(s)}< [23]

must therefore hold. This implies that ε(s)sp as s for all p>0. Suppose that q=defliminfs[(ε(s)logs)/τB]<1; note that ε(s)>0 implies q>0. Then there exists a diverging sequence s1,s2, with log(si)<[(1+q)/2][τB/(ε(si))] for all i, whence

limsupsε(s)seτB/ε(s)limsupiε(si)sie21+qlog(si)=limsupiε(si)si1q1+q=.

From this, it follows that ε(s)τB/logs for sufficiently large s. Now, on the interval (0,), the function f(ε) is convex with a unique minimum at ε=τB. Let ε be a point where f(ε)>f(τB/logs)=τB/logs. Then ε[τB/logs,τB]. If ε>τB, then f(ε)<(ε/s)e1. Since τBlogs<f(ε), we then conclude ε>sτB/(e1logs), which is not bounded as s.

In summary, we see that the largest possible lower bound that obeys Eq. 23 must have f(ε) asymptotically τB/logs, and that this bound is achieved by setting ε(s)=τB/logs. Plugging this in to Eq. 19 yields the claim.

Proof of Corollary 6. For c(0,1), choose J large enough so that (J8)/J>c, and fix τA so that T=tI+JτA. Then (J8)τAcJτA=c(TtI). Substituting the above inequalities into Eq. 14 and letting C=Cc2 yields the desired result.

Proof of Theorem 7. The theorem is obtained by suitably modifying the preceding results to account for the effect of exponential growth in the first period. Let ηu,η be the analogously defined upper and lower envelope functions for GJ. Then

0tj+1dsηu(s)=eβ(η0)t11η0β(η0)+τBε+i=3j+1titi1Ni=t11γε1η0log(η0)log(γε)+τBε+i=3j+1titi1Ni,

where we have used the definition of β(η0) in the second equality. Since all size histories in GJ are equal up to period t2, the steps of Lemma 2 all go through unchanged. Starting from Eq. 17, we obtain the modified bound

am[cm(ηu)cm(η)]Jδexp{amt11γε1η0log(η0)log(γε)}eamτB/ε. [24]

Propagating the modified bound (Eq. 24) through Theorems 3 and 4 ultimately yields the claim.

Acknowledgments

We thank Anand Bhaskar for helpful comments on a draft of this paper and for suggesting Corollary 6 to simplify the presentation of the main result. We also thank Jack Kamm and Jeff Spence for useful feedback. This research is supported in part by a Citadel Fellowship (to J.T.), National Institutes of Health Grant R01-GM109454 (to Y.S.S.), a Packard Fellowship for Science and Engineering (to Y.S.S.), and a Miller Research Professorship (to Y.S.S.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

References

  • 1.Abecasis GR, et al. 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nelson MR, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tennessen JA, et al. Broad GO Seattle GO NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fu W, et al. NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154(2):931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Coventry A, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gazave E, et al. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111(2):757–762. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gravel S, et al. 1000 Genomes Project Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13(3):235–248. [Google Scholar]
  • 13.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19A:27–43. [Google Scholar]
  • 14.Kingman JFC. In: Exchangeability in Probability and Statistics. Koch G, Spizzichino F, editors. North-Holland; Amsterdam: 1982. pp. 97–112. [Google Scholar]
  • 15.Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Griffiths R, Tavaré S. The age of a mutation in a general coalescent tree. Commun Stat Stochastic Models. 1998;14(1-2):273–295. [Google Scholar]
  • 17.Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
  • 18.Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165(1):427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014;42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. [Google Scholar]
  • 22.Kim J, Mossel E, Rácz MZ, Ross N. Can one hear the shape of a population history? Theor Popul Biol. 2014;100:26–38. doi: 10.1016/j.tpb.2014.12.002. [DOI] [PubMed] [Google Scholar]
  • 23.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
  • 24.Chen H. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theor Popul Biol. 2012;81(2):179–195. doi: 10.1016/j.tpb.2011.11.004. [DOI] [PubMed] [Google Scholar]
  • 25.Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Steinrücken M, Paul JS, Song YS. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol. 2013;87:51–61. doi: 10.1016/j.tpb.2012.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yu B. 1997. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, ed Pollard D, Torgersen E, Yang GL (Springer, New York), pp 423–435.
  • 31.Massart P. Concentration Inequalities and Model Selection. Springer; Berlin: 2007. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES