Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Jonathan Terhorst; Yun S Song

doi:10.1073/pnas.1503717112

. 2015 Jun 8;112(25):7677–7682. doi: 10.1073/pnas.1503717112

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Jonathan Terhorst ^a, Yun S Song ^a,^b,^c,¹

PMCID: PMC4485089 PMID: 26056264

Significance

Numerous empirical studies in population genetics have used a summary statistic called the sample frequency spectrum (SFS), which summarizes the information in a sample of DNA sequences. Despite their popularity, the accuracy of inference methods based on the SFS is difficult to characterize theoretically, and it is currently unknown how the estimation accuracy improves as more sites in the genome are used. Here, we establish information theoretic limits on the accuracy of all estimators that use the SFS to infer population size histories. We study the rate of convergence to the true answer as the amount of data increases, and obtain the surprising result that it is exponentially worse than known convergence rates for many classical estimation problems in statistics.

Keywords: minimax rate, population genetics, demographic inference

Abstract

The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.

The past decade has seen a revolution in our ability to interrogate the genome at the molecular level. Fueled by technological advances in DNA sequencing, studies now routinely query thousands or tens of thousands of individuals [refs. 1–4 and UK10K Project (www.uk10k.org) and Exome Aggregation Consortium (exac.broadinstitute.org)] to better understand disease susceptibility, heritability, population history, and other phenomena. In most cases, the conclusions of these studies come in the form of statistical estimates obtained from models that relate the effect of interest to mutation patterns arising in sampled DNA sequences. As genetic sample sizes explode, it is natural to wonder how additional data improve the quality of these estimates. While this general question has received intense focus in theoretical statistics, certain aspects of the genetics setting (for example, non-Gaussianity and lack of independence among samples) complicate efforts to study such models using classical techniques. New methods are needed to theoretically characterize some common models in statistical genetics.

Here, we address this need for a specific estimation problem in population genetics known as demographic inference. As we explain in further detail below, the aim of this problem is to reconstruct the sequence of historical events—including population size changes, migration, and admixture—that gave rise to present-day populations, using DNA samples obtained from those populations. We focus on the simplest problem of estimating the size history of a single population backward in time.

A summary statistic known as the (SFS; defined below) is often used in empirical studies (2, 5–11), but there have been fewer attempts to understand SFS-based estimation from a theoretical perspective. The main result of this paper is to show that, for a common class of estimators that analyze the SFS, there is a fundamental limit on their accuracy as a function of the sample size. More precisely, we show that, under a standard statistical error metric known as minimax error, the rate at which these estimators converge to the truth for certain populations is at best inversely logarithmic in the number of independent segregating sites analyzed, and does not depend at all on the number of individuals sampled. Compared with other types of statistical estimation problems (for example, linear regression), this is an extremely slow rate of convergence. Our proof is information theoretic in nature and applies to any estimator that operates solely on the SFS. This is the first result we are aware of that characterizes the convergence rate of demographic history estimates as a function of sample size.

The remainder of this paper is organized as follows. In Preliminaries, we formally define our notation and model. In Main Results, we state our main theoretical results, followed by a discussion of their practical implications in Discussion. To streamline our exposition, all mathematical proofs are deferred until Proofs.

Preliminaries

The stochastic process underlying the inference procedure we consider is Kingman’s coalescent (12–14), which evolves backward in time and describes the genealogy of a collection of chromosomes randomly sampled from a population. The population size is assumed to change deterministically over time and is described by a function $η : [0, \infty) \to (0, \infty),$ with $η (t)$ being the population size at time t in the past. The instantaneous rate of coalescence between any pair of lineages at time t is $1 / η (t)$ .

As in the standard infinite sites model of mutation (15), we assume that every dimorphic site (i.e., a site with exactly two observed allelic types) has experienced mutation exactly once in the evolutionary history of the sample. Further, for each such site, we assume that it is known which allele is the ancestral type versus the mutant type. In what follows, we use the terms “dimorphic” and “segregating” interchangeably.

A population size function $η (t)$ induces a probability distribution on the number of derived alleles found at a particular segregating site. Specifically, for a sample of $n \geq 2$ randomly sampled individuals, let $ξ_{n, b}^{(η)}$ , for $1 \leq b \leq n - 1$ , denote the probability that a segregating site contains b mutant alleles in a sample of n individuals under model η. The vector $ξ_{n}^{(η)} \overset{def}{=} (ξ_{n, 1}^{(η)}, \dots, ξ_{n, n - 1}^{(η)})$ is called the expected SFS. In the coalescent setting, a general expression for $ξ_{n, b}^{(η)}$ is given by (16)

ξ_{n, b}^{(η)} \propto \sum_{k = 2}^{n - b + 1} \frac{(\begin{matrix} n - b - 1 \\ k - 2 \end{matrix})}{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})} \cdot k \cdot E T_{n, k}^{(η)},

where $E T_{n, k}^{(η)}$ denotes the amount of time (in coalescent units) during which the genealogy of the sample contained k lineages under model η. The expected waiting time $E T_{m, m}^{(η)}$ to the first coalescence in a sample of m individuals is given by

c_{m}^{(η)} \overset{def}{=} E T_{m, m}^{(η)} = \int_{0}^{\infty} t \frac{a_{m}}{η (t)} \exp {- a_{m} R_{η} (t)} d t,

[1]

where $a_{m} \overset{def}{=} (\begin{matrix} m \\ 2 \end{matrix})$ and $R_{η} (t) \overset{def}{=} \int_{0}^{t} \frac{1}{η (s)} d s$ is the cumulative rate of coalescence up to time t. It turns out (17) that there is an invertible linear transformation that relates $(E T_{n, 2}^{(η)}, E T_{n, 3}^{(η)}, \dots, E T_{n, n}^{(η)})$ to $c^{(η)} \overset{def}{=} (c_{2}^{(η)}, c_{3}^{(η)}, \dots, c_{n}^{(η)})$ . Using this relation, the quantity $ξ_{n, b}^{(η)}$ can be written as (18)

ξ_{n, b}^{(η)} = \frac{〈 c^{(η)}, W_{n, b} 〉}{〈 c^{(η)}, V_{n} 〉},

[2]

where $W_{n, b} = (W_{n, b, 2}, \dots, W_{n, b, n})$ and $V_{n} = (V_{n, 2}, \dots, V_{n, n})$ are vectors of universal constants that do not depend on the population size function η, and $〈 \cdot, \cdot 〉$ denotes the $l_{2}$ inner product. Under model η, the quantity $〈 c^{(η)}, W_{n, b} 〉$ is the total expected length of edges subtending b out of n individuals sampled at time 0, while the quantity $〈 c^{(η)}, V_{n} 〉$ is the total expected tree length for a sample of size n. Both quantities are positive for all population size functions η. For an arbitrary population size function η, we have $\sum_{b = 1}^{n - 1} W_{n, b, m} = V_{n, m}$ for all $2 \leq m \leq n$ , which implies

\sum_{b = 1}^{n - 1} 〈 c^{(η)}, W_{n, b} 〉 = 〈 c^{(η)}, V_{n} 〉 .

[3]

For a constant function $η (t) \equiv N$ ,

c_{m}^{(η)} = \frac{N}{a_{m}},

〈 c^{(η)}, W_{n, b} 〉 = \frac{2}{b} N,

[4]

〈 c^{(η)}, V_{n} 〉 = 2 N H_{n - 1},

[5]

where $H_{n - 1} \overset{def}{=} \sum_{b = 1}^{n - 1} \frac{1}{b}$ .

To formulate the problem, we use the following notation. We suppose that a sample of $n \geq 2$ randomly sampled individuals has been typed at s independent segregating sites. These data are used to form the empirical sample frequency spectrum, which is an $(n - 1)$ -tuple $({\hat{ξ}}_{n, 1}, \dots, {\hat{ξ}}_{n, n - 1})$ , where ${\hat{ξ}}_{n, b}$ denotes the proportion of segregating sites with b copies of the mutant allele and $n - b$ copies of the ancestral allele. A frequency-based estimator is any statistic $\hat{η}$ that maps an empirical SFS to a population size history.

Main Results

Here, we establish a minimax lower bound on the ability of any estimator $\hat{η}$ to accurately reconstruct population size functions.

A General Bound on the Kullback−Leibler Divergence Between Two SFS Distributions.

Abusing notation, we use $D (η ‖ η^{'})$ to denote the Kullback−Leibler (KL) divergence between the probability distributions $ξ_{n}^{(η)}$ and $ξ_{n}^{(η^{'})}$ . In Proofs, we prove the following general upper bound on the KL divergence between two SFS distributions:

Theorem 1.

Let $ℳ$ denote a general space of population size functions and suppose $η, η^{'} \in ℳ$ satisfy $η (t) = η^{'} (t)$ for all $0 \leq t \leq t_{c}$ and ${max}_{t > t_{c}} η (t) \leq {min}_{t > t_{c}} η^{'} (t)$ . Then,

D (η ‖ η^{'}) \leq \frac{〈 c^{(η^{'})} - c^{(η)}, V_{n} 〉}{〈 c^{(η)}, V_{n} 〉} .

[6]

Bounds for a Family of Piecewise Constant Models.

We now focus on a particular class of population size functions that are easier to analyze and are popular in the literature (11, 19, 20). For a fixed positive integer $K > 1$ , let $ℳ_{K} \subset ℳ$ denote the space of piecewise constant size functions with exactly K pieces. A population size function η is a member of $ℳ_{K}$ if and only if there exist positive real numbers $t_{1} < \dots < t_{K - 1}$ and $N_{1}, N_{2}, \dots, N_{K}$ such that

η (t) = \sum_{k = 1}^{K} N_{k} 1 {t_{k - 1} \leq t < t_{k}},

[7]

where, by convention, we define $t_{0} = 0$ and $t_{K} = \infty$ . For such an η, define

S_{k}^{(η)} \overset{def}{=} \sum_{j = 1}^{k} \frac{t_{j} - t_{j - 1}}{N_{j}} .

[8]

For $η \in ℳ_{K}$ , the expected waiting time $c_{m}^{(η)}$ defined in Eq. 1 is given by

c_{m}^{(η)} = \frac{1}{a_{m}} \sum_{k = 1}^{K} N_{k} (e^{- a_{m} S_{k - 1}^{(η)}} - e^{- a_{m} S_{k}^{(η)}}) .

[9]

Note that since $t_{K} = \infty$ ,

e^{- a_{m} S_{K}^{(η)}} \equiv 0, for all η \in ℳ_{K} .

[10]

To formulate our result, we let $I, J$ denote positive integers that satisfy $I + J = K$ , and introduce a subfamily $ℱ_{I, J} \subset ℳ_{K}$ of piecewise constant functions defined as follows. See Fig. 1 for illustration. We assume that all change points $t_{1} < \dots < t_{I + J - 1}$ are fixed and that the sizes $N_{1}, \dots, N_{I}$ of the first I epochs are also fixed, with $N_{I}$ being the smallest size. So, all functions in $ℱ_{I, J}$ are identical to each other for the first I epochs, and there is a population bottleneck in the last epoch. Then, for $t \geq t_{I}$ , every function $η \in ℱ_{I, J}$ undergoes jumps according to the following rules:

1.
For the interval $t_{I} \leq t < t_{I + 1}$ , $η (t)$ takes a constant value of either h or $h + δ$ , where $h > N_{I}$ and $δ > 0$ .
2.
At later change points ${t_{I + 1}, \dots, t_{I + J - 1}}$ , η either stays the same or jumps upward by δ.

Fig. 1. — A family $ℱ_{I, J}$ of piecewise-constant population size models with $K = I + J$ epochs.

Hence, $ℱ_{I, J}$ consists of $2^{J}$ distinct piecewise constant functions that are nondecreasing functions of t for $t \geq t_{I}$ . Note that ${min}_{t} η (t) = N_{I}$ for all $η \in ℱ_{I, J}$ . For ease of notation, we use $ε \overset{def}{=} N_{I}$ to denote the bottleneck size and $τ_{B} \overset{def}{=} t_{I} - t_{I - 1}$ to denote the bottleneck duration. To facilitate analysis later, we fix $t_{I + j} - t_{I + j - 1}$ to some positive constant $τ_{A}$ for all $j = 1, \dots, J - 1$ .

For any two models in $ℱ_{I, J}$ , we obtain the following bound on the difference of their waiting times to the first coalescence:

Lemma 2.

For all $η, η^{'} \in ℱ_{I, J}$ ,

| c_{m}^{(η)} - c_{m}^{(η^{'})} | \leq J \frac{δ}{a_{m}} e^{- a_{m} τ_{B} / ε} .

[11]

Together with Theorem 1, this lemma can be used to show

Theorem 3.

Let $η, η^{'} \in ℱ_{I, J}$ that satisfy ${max}_{t \geq t_{I}} η (t) \leq {min}_{t \geq t_{I}} η^{'} (t) .$ Then,

D (η ‖ η^{'}) \leq J \frac{δ}{ε} e^{- τ_{B} / ε} .

[12]

Proofs of these results are deferred to Proofs. It is interesting that the above bound does not depend on the number n of sampled individuals.

Minimax Lower Bounds.

Before using the above results to obtain a minimax lower bound, we first note a subtle fact. Given any population size function η, consider a function ζ that satisfies $ζ (t) = κ \cdot η (t / κ)$ for all $t \in [0, \infty)$ , where κ is some positive constant. Such functions are equivalent, as it turns out that $ξ_{n, b}^{(ζ)} = ξ_{n, b}^{(η)}$ for all $n \geq 2$ and $1 \leq b \leq n - 1$ . To mod out by this equivalence, we assume that every $η \in ℳ$ satisfies $η (0) = N_{fix}$ , where $N_{fix}$ is some fixed positive constant.

Let ${‖ \cdot ‖}_{*}$ denote a generic norm (specific examples will be given later) and let $E_{η} (\cdot)$ denote expectation with respect to the SFS distribution $ξ_{n}^{(η)} = (ξ_{n, 1}^{(η)}, \dots, ξ_{n, n - 1}^{(η)})$ induced by population size function η. Then, note that

inf_{\hat{η}} sup_{η \in ℳ} E_{η} {| | \hat{η} - η | |}_{*} \geq inf_{\hat{η}} sup_{η \in ℳ_{K}} E_{η} {| | \hat{η} - η | |}_{*} \geq inf_{\hat{η}} sup_{η \in ℱ_{I, J}} E_{η} {| | \hat{η} - η | |}_{*} .

In what follows, we will put a lower bound on the last quantity. We first fix a sensible distance metric on $ℳ$ . An intuitive way to measure distance between two population size functions is their $L_{1}$ distance, ${| | η_{a} - η_{b} | |}_{1} = \int_{0}^{\infty} | η_{a} (t) - η_{b} (t) | d t$ , but this is unreasonably stringent in that ${| | η_{a} - η_{b} | |}_{1} = \infty$ if $η_{a}$ and $η_{b}$ do not agree infinitely far back into the past. Instead we will focus on the following truncated $L_{1}$ distance: ${| | η_{a} - η_{b} | |}_{1, T} \overset{def}{=} \int_{0}^{T} | η_{a} (t) - η_{b} (t) | d t$ , which measures the discrepancy between $η_{a}$ and $η_{b}$ back to some fixed time T in the past.

Henceforth, let $\hat{η}$ be any estimator of the population size function that operates on a sample of s independent segregating sites obtained from a sample of n randomly sampled individuals. In Proofs, we prove the following main results of our paper:

Theorem 4.

Consider the subfamily $ℱ_{I, J}$ of models described above, and suppose $J > 8$ and $T \geq t_{I + J - 1} + τ_{A}$ . Then,

inf_{\hat{η}} sup_{η \in ℱ_{I, J}} E_{η} {| | \hat{η} - η | |}_{1, T} \geq C τ_{A} \frac{{(J - 8)}^{2}}{J} \frac{ε}{s} e^{τ_{B} / ε},

[13]

where C is a positive constant.

The above theorem applies to all models in $ℱ_{I, J}$ . We now consider the subset $ℱ_{I, J}^{M} = {η \in ℱ_{I, J} : {‖ η ‖}_{\infty} < M}$ , which is the set of all models in $ℱ_{I, J}$ that are bounded by some constant M. For this family of bounded population size functions, a sharper asymptotic lower bound can be obtained as follows.

Theorem 5.

Suppose $J > 8$ and $T \geq t_{I + J - 1} + τ_{A}$ . Then,

inf_{\hat{η}} sup_{η \in ℱ_{I, J}^{M}} E_{η} {| | \hat{η} - η | |}_{1, T} \geq C^{'} \frac{{(J - 8)}^{2}}{J} \frac{τ_{B} τ_{A}}{\log s},

[14]

where $C^{'}$ is a positive constant.

By specializing $ℱ_{I, J}^{M}$ , a simplified version of Theorem 5 can be obtained:

Corollary 6.

Suppose $T \geq t_{I + J - 1} + τ_{A}$ and let $ℱ_{I, ⋆}^{M} = \cup_{J \geq 1} ℱ_{I, J}^{M}$ . Then,

inf_{\hat{η}} sup_{η \in ℱ_{I, ⋆}^{M}} E_{η} {| | \hat{η} - η | |}_{1, T} \geq C^{″} (T - t_{I}) \frac{τ_{B}}{\log s},

[15]

where $C^{″}$ is a positive constant.

Note that the above lower bounds do not depend on the dimension of the SFS (which is equal to $n - 1$ ). Hence, for a fixed number s of segregating sites considered, using more individuals does not diminish the error bounds.

Bottleneck Followed by Exponential Growth.

In the results presented above, we dropped smaller terms to obtain the dominant contribution to our lower bound. Here, we provide a more detailed analysis to study how the model in the recent past (i.e., the period $0 \leq t \leq t_{I - 1}$ ) affects the lower bound. A slight modification of the above results permits us to analyze the following model class, which is of interest in, for example, human genetics (2, 3, 7): Let $G_{J}$ be the family of models illustrated in Fig. 2 with exponential growth in the recent past. Specifically, $η (t) = η_{0} e^{- β (η_{0}) t}$ for the period $0 \leq t \leq t_{1}$ . The rate of growth $β (η_{0}) = log (η_{0} / γ ε) / t_{1}$ is defined so that $η (t_{1}) = γ ε$ for all $η \in G_{J}$ , where $γ \geq 1$ . The part for $t > t_{1}$ is the same as that for $t > t_{I - 1}$ in $ℱ_{I, J}$ (Fig. 1). We obtain the following result for the subfamily $G_{J}$ :

Fig. 2. — A family $G_{J}$ of population size models with exponential growth in the recent past. This family consists of size histories that are piecewise constant before the bottleneck, and then jump to some level $γ ε$ and undergo (identical) exponential growth from time $t_{1}$ to present.

Theorem 7.

Consider the subfamily $G_{J}$ of models described above, and suppose $J > 8$ and $T \geq t_{J} + τ_{A}$ . Then,

inf_{\hat{η}} sup_{η \in G_{J}} E_{η} {| | \hat{η} - η | |}_{1, T} \geq C τ_{A} \frac{{(J - 8)}^{2}}{J} \frac{ε}{s} \exp [\frac{τ_{B}}{ε} + t_{1} \frac{\frac{1}{γ ε} - \frac{1}{η_{0}}}{log (η_{0}) - log (γ ε)}] .

[16]

Theorem 4 is a measure of how (a lower bound on) estimation error depends on growth following a bottleneck. The two extremes $η_{0} \to \infty$ and $η_{0} \to γ ε$ have intuitive interpretations. For large $η_{0}$ , the bound in Eq. 16 tends to the corresponding bound given by Theorem 4, as expected since coalescences become increasingly less likely in the first time period. Small $η_{0}$ has the effect of ‘‘prolonging’’ the bottleneck, thus increasing the minimax lower bound. In particular, if $γ = 1$ then $t_{1} [(1 / γ ε) - (1 / η_{0})] / [log (η_{0}) - log (γ ε)] \to (t_{1} / ε)$ as $η_{0} \to γ ε$ , so that the effect of low population growth on the minimax lower bound is to simply prolong the bottleneck effect by an additional $t_{1}$ time periods.

Discussion

In this paper, we have theoretically characterized fundamental limits on the accuracy of demographic inference from data. We have shown that the minimax error rate for estimating the piecewise-constant demography of a single population is at least $O (1 / \log s)$ , where s is the number of independent segregating sites analyzed. In contrast, the minimax error for many classical estimation problems in statistics (for example, nonparametric regression or density estimation) decays inverse polynomially in the sample size (21). Compared with these problems, exponentially more samples would be required to estimate a population size history function to within a similar magnitude of error. The paper that most closely relates to the present work is by Kim et al. (22), who obtain lower bounds on the amount of exact coalescence time data necessary to distinguish between size histories in a hypothesis testing framework. Since coalescence times are never observed and must be estimated from data, these bounds place a limit on the accuracy with which a population size function can be inferred. The authors also describe an estimator that uses coalescence times (again observed without noise) to accurately recover the underlying population size function with high probability, at a rate that roughly matches the lower bound.

Another line of work centers around the identifiability of the parameter $η (t)$ using the SFS. Roughly speaking, a family of statistical models ${P_{θ}}_{θ \in Θ}$ defined over a parameter space $Θ$ is identifiable if, for any $θ_{1}, θ_{2} \in Θ$ with $θ_{1} \neq θ_{2}$ , the sampling distributions induced by $P_{θ_{1}}$ and $P_{θ_{2}}$ are different. In our context, this simply says that, for all n, $ξ_{n}^{(η_{1})} \neq ξ_{n}^{(η_{2})}$ unless $η_{1} = η_{2}$ almost everywhere. Standard desiderata for statistical estimators (e.g., consistency or unbiasedness) are impossible without identifiability, so it is the weakest possible regularity condition one can impose on a useful family of models.

Perhaps surprisingly, it turns out that, in general, a population size function is not identifiable from the SFS (23). Indeed, for any given $η (t)$ , it has been shown that an infinite number of smooth functions $F (t)$ exist such that $ξ_{n}^{(η)} = ξ_{n}^{(η + F)}$ . Moreover, explicit examples can be constructed that demonstrate this phenomenon (23). On the other hand, these counterexamples consist of functions that exhibit an unbounded frequency of oscillatory behavior near the present time, which is perhaps unrealistic when modeling naturally occurring populations. More recently, it has been shown (19) that identifiability holds for many classes of population size functions used by practitioners (including piecewise constant, piecewise exponential, and piecewise generalized exponential). Furthermore, the number n of sampled individuals sufficient for identifiability can be explicitly given and is a function of the complexity of the underlying class of models being studied (19).

Identifiability asserts that, given an infinite amount of data (specifically, taking the number of segregating sites $s \to \infty$ ), the model parameter $η (t)$ can be uniquely recovered. In practice, s is finite, and only a perturbed version of the expected frequency spectrum, say ${\hat{ξ}}_{n}^{(η)}$ , is observed. From a practical standpoint, it is important to understand how these perturbations ultimately affect the parameter estimate $\hat{η} (t)$ . It is this question that forms the starting point for the present work.

A single population evolving under a piecewise-constant demography is a special case of many richer classes of demographic models. For example, it is a (limiting) member of the family of exponential growth models, seen by taking each exponential growth parameter to zero. In the multispecies coalescent setting (10, 24), multiple population size histories must be estimated, and the error of that estimate must necessarily be lower bounded by that of estimating a single such history. Thus, our result can be expected to apply to a broader class of models than the one we have studied here.

As detailed in Proofs, the result in Theorem 5 follows from setting $ε = τ_{B} / \log s$ and $δ \propto \frac{ε}{s} \exp (τ_{B} / ε)$ in the subfamily $ℱ_{I, J}^{M}$ . The size $τ_{B} / \log s$ is in coalescent units. In terms of the number of individuals, it is proportional to $g_{B} / \log s$ , where $g_{B}$ is the number of generations corresponding to duration $τ_{B}$ in the coalescent limit. Intuitively, as the severity of the bottleneck increases, the population is increasingly likely to find its most recent common ancestor (MRCA) during that time; farther back in time than the MRCA, no information is conveyed concerning the demographic events experienced by the population.

One might object to considering models with a bottleneck size that scales inversely with the number s of segregating sites in the data, and it is indeed possible that a better convergence rate may be achievable for populations that are known not to contain a bottleneck. On the other hand, we note that $1 / \log s$ decreases sufficiently slowly with s that our result can be expected to apply to many real-world examples. For example, for $s \approx 10^{8}$ , which is a conservative upper bound for most organisms, $g_{B} / \log s \approx 0.054 g_{B}$ . This implies that for populations that have experienced roughly an order-of-magnitude increase in effective population size during their history, accurate estimation of demographic events that occurred before this expansion is difficult using SFS-based methods. Additionally, an interesting aspect of our work is that our minimax lower bounds do not depend on the number n of sampled individuals; increasing n is not enough to overcome the information barrier imposed by the presence of a bottleneck. This is intuitively plausible since, as n increases, the $(n + 1)$ th sampled lineage becomes more likely to coalesce early on.

An interesting question that we have not attempted to analyze is whether the $O (1 / \log s)$ rate is optimal, i.e., whether there exists some estimator $\hat{η} (t)$ that achieves the minimax lower bound established here. In practice, from Eqs. 2, 8, and 9, it can be seen that naively maximizing the likelihood of the observed SFS with respect to $η (t)$ requires solving a nonconvex optimization problem, so that convergence to the global maximum is not even guaranteed. Computational issues aside, finding such an estimator remains an open theoretical challenge.

In closing, we stress that our result is specific to SFS-based estimators, which analyze only independent sites. The main allure of these estimators is their mathematical tractability, rather than their realism. In fact, a rich source of additional information exists in the correlation structure found among linked sites in the genome. Methods that seek to exploit this structure by modeling the action of recombination pose greater mathematical and computational difficulties, but there has been recent progress in this area (20, 25–29). Our result serves to underscore the importance of pursuing more realistic models of genomic evolution, challenging though they may be.

Proofs

Proof of Theorem 1. To simplify the notation, we write $c = c^{(η)}$ and $c^{'} = c^{(η^{'})}$ . Then, using Eq. 2, we can write

D (η ‖ η^{'}) = \sum_{b = 1}^{n - 1} ξ_{n, b}^{(η)} \log \frac{ξ_{n, b}^{(η)}}{ξ_{n, b}^{(η^{'})}} = \sum_{b = 1}^{n - 1} ξ_{n, b}^{(η)} [\log (\frac{〈 c, W_{n, b} 〉}{〈 c^{'}, W_{n, b} 〉}) + \log (\frac{〈 c^{'}, V_{n} 〉}{〈 c, V_{n} 〉})] .

The assumption ${min}_{t > t_{c}} η^{'} (t) \geq {max}_{t > t_{c}} η (t)$ implies that, for all times $t, t^{'} > t_{c}$ , the instantaneous rate of coalescence at time t in model η is greater than or equal to $\geq$ the instantaneous rate of coalescence at time $t^{'}$ in model $η^{'}$ . Hence, this assumption together with $η (t) = η^{'} (t)$ for all $0 \leq t \leq t_{c}$ implies $〈 c - c^{'}, W_{n, b} 〉 \leq 0$ for all $1 \leq b \leq n - 1$ ; equivalently, $log (〈 c, W_{n, b} 〉 / 〈 c^{'}, W_{n, b} 〉) < 0$ . Additionally, $(〈 c^{'} - c, V_{n} 〉 / 〈 c, V_{n} 〉) > - 1$ and $log (1 + x) \leq x$ for all $x \geq - 1$ . Combining these facts, we obtain

D (η ‖ η^{'}) \leq \sum_{b = 1}^{n - 1} ξ_{n, b}^{(η)} \log (\frac{〈 c^{'}, V_{n} 〉}{〈 c, V_{n} 〉}) \leq \sum_{b = 1}^{n - 1} ξ_{n, b}^{(η)} \frac{〈 c^{'} - c, V_{n} 〉}{〈 c, V_{n} 〉} = \frac{〈 c^{'} - c, V_{n} 〉}{〈 c, V_{n} 〉},

where we have used $\sum_{b = 1}^{n - 1} ξ_{n, b}^{(η)} = 1$ in the final equality.

Proof of Lemma 2. We distinguish two particular models, $η_{ℓ}, η_{u} \in ℱ_{I, J}$ , which are the lower and the upper envelopes of $ℱ_{I, J}$ . The function $η_{ℓ}$ stays constant at h for all $t \geq t_{I}$ , while $η_{u}$ jumps upward by δ at every change point $t_{I}, \dots, t_{I + J - 1}$ . Hence, $η_{ℓ} \leq η \leq η_{u}$ pointwise for all $η \in ℱ_{I, J}$ . The two enveloping functions will form the basis of subsequent analysis.

Fix $η, η^{'} \in ℱ_{I, J}$ and note that, by the definition of $ℱ_{I, J}$ , one of these functions must pointwise dominate the other. Therefore, assume without loss of generality that $η (t) \leq η^{'} (t)$ for all t. Then, for all t,

η_{ℓ} (t) \leq η (t) \leq η^{'} (t) \leq η_{u} (t),

which implies

c_{m}^{(η_{ℓ})} \leq c_{m}^{(η)} \leq c_{m}^{(η^{'})} \leq c_{m}^{(η_{u})},

for all $m = 2, \dots, n$ . Using these inequalities, we conclude

c_{m}^{(η^{'})} - c_{m}^{(η)} \leq c_{m}^{(η_{u})} - c_{m}^{(η_{ℓ})},

so it suffices to demonstrate Eq. 11 for $c_{m}^{(η_{u})} - c_{m}^{(η_{ℓ})}$ . Now, by Eq. 9 and the definition of $η_{ℓ}$ ,

a_{m} c_{m}^{(η_{ℓ})} = \sum_{i = 1}^{I} N_{i} [e^{- a_{m} S_{i - 1}^{(η_{ℓ})}} - e^{- a_{m} S_{i}^{(η_{ℓ})}}] + \sum_{j = 1}^{J} h [e^{- a_{m} S_{I + j - 1}^{(η_{ℓ})}} - e^{- a_{m} S_{I + j}^{(η_{ℓ})}}] = \sum_{i = 1}^{I} N_{i} [e^{- a_{m} S_{i - 1}^{(η_{ℓ})}} - e^{- a_{m} S_{i}^{(η_{ℓ})}}] + h e^{- a_{m} S_{I}^{(η_{ℓ})}},

where we have used Eq. 10. Similarly,

a_{m} c_{m}^{(η_{u})} = \sum_{i = 1}^{I} N_{i} [e^{- a_{m} S_{i - 1}^{(η_{u})}} - e^{- a_{m} S_{i}^{(η_{u})}}] + \sum_{j = 1}^{J} (h + j δ) [e^{- a_{m} S_{I + j - 1}^{(η_{u})}} - e^{- a_{m} S_{I + j}^{(η_{u})}}] = \sum_{i = 1}^{I} N_{i} [e^{- a_{m} S_{i - 1}^{(η_{u})}} - e^{- a_{m} S_{i}^{(η_{u})}}] + h e^{- a_{m} S_{I}^{(η_{u})}} + \sum_{j = 1}^{J} j δ [e^{- a_{m} S_{I + j - 1}^{(η_{u})}} - e^{- a_{m} S_{I + j}^{(η_{u})}}] .

Now, using the fact that $η_{ℓ}$ and $η_{u}$ agree on the first I epochs, we obtain

a_{m} [c_{m}^{(η_{u})} - c_{m}^{(η_{ℓ})}] = \sum_{j = 1}^{J} j δ [e^{- a_{m} S_{I + j - 1}^{(η_{u})}} - e^{- a_{m} S_{I + j}^{(η_{u})}}] = δ \sum_{j = 1}^{J} e^{- a_{m} S_{I + j - 1}^{(η_{u})}} \leq J δ e^{- a_{m} τ_{B} / ε},

[17]

where the second line follows from telescoping and the fact that $S_{I + J}^{(η_{u})} = \infty$ , while the last line follows from the fact that $\frac{τ_{B}}{ε} \leq S_{I + j - 1}^{(η_{u})}$ for all $j = 1, \dots, J$ .

Proof of Theorem 3. For ease of notation, define $c = c^{(η)}$ and $c^{'} = c^{(η^{'})}$ . By Lemma 2,

〈 c^{'} - c, V_{n} 〉 = \sum_{m = 2}^{n} (c_{m}' - c_{m}) V_{n, m} \leq J δ \sum_{m = 2}^{n} \frac{V_{n, m}}{a_{m}} e^{- a_{m} τ_{B} / ε} \leq J δ e^{- τ_{B} / ε} \sum_{m = 2}^{n} \frac{V_{n, m}}{a_{m}},

where the second inequality follows from $e^{- a_{m} τ_{B} / ε} \leq e^{- τ_{B} / ε}$ for all $m = 2, \dots, n$ . Now, noting that $\sum_{m = 2}^{n} (V_{n, m} / a_{m})$ corresponds to the total tree length for the constant population size function $η \equiv 1$ and using Eq. 5, we obtain

〈 c^{'} - c, V_{n} 〉 \leq J δ e^{- τ_{B} / ε} 2 H_{n - 1} .

[18]

To finish the proof, recall that $〈 c, V_{n} 〉$ is the total expected branch length of the coalescent tree under model η. Since $\min_{t} η (t) = ε,$ we have that $〈 c, V_{n} 〉$ is at least as large as the corresponding quantity under a model with constant population size ε. By Eq. 5, the total expected tree length under the latter model equals $2 ε H_{n - 1}$ . Thus, $〈 c, V_{n} 〉 \geq 2 ε H_{n - 1}$ , and combining this result with Eq. 18 gives

\frac{〈 c^{'} - c, V_{n} 〉}{〈 c, V_{n} 〉} \leq J \frac{δ}{ε} e^{- τ_{B} / ε} .

Finally, Eq. 12 follows from this inequality and Theorem 1.

Proof of Theorem 4. Our proof uses a generalized form of Fano’s inequality (30). Adapted to our setting and notation, the method reads as follows.

Theorem 8 (Fano’s method). Consider a space $ℳ$ of population size models. Let $r \geq 2$ be an integer, and let $S_{n}^{r} = {η_{1}, η_{2}, \dots, η_{r}} \subset ℳ$ contain r population size functions such that for all $a \neq b$ , ${| | η_{a} - η_{b} | |}_{*} \geq α_{r}$ and $D (ξ_{n}^{(η_{a})} ‖ ξ_{n}^{(η_{b})}) \leq β_{r}$ . Let ${\hat{η}}^{(n, s)} = {\hat{η}}^{(n, s)} (X_{1}, \dots, X_{s})$ be an estimator of η based on the SFS data $X_{1}, \dots, X_{s}$ sampled independently from $ξ_{n}^{(η)}$ ; i.e., $X_{1}, \dots, X_{s}$ are SFS data for n individuals at s independent segregating sites. Then,

inf_{\hat{η}} sup_{η \in ℳ} E_{η} {| | {\hat{η}}^{(n, s)} - η | |}_{*} \geq \frac{α_{r}}{2} (1 - \frac{s \cdot β_{r} + \log 2}{\log r}) .

[19]

This theorem places a lower bound on the minimax rate of convergence of a population size history estimator based on the SFS.

For $η \in ℱ_{I, J}$ , let $w_{j}$ denote the variable $\in {0,1}$ indicating whether η jumps by δ at change point $t_{I + j}$ . Let $Y = {w = (w_{0}, \dots, w_{J - 1}) | w_{i} \in {0,1}}$ , where $J \geq 8$ . By the Varshamov−Gilbert lemma (see ref. 31, Lemma 4.7), there exist $X = {w^{0}, \dots, w^{M}} \subset Y$ such that (i) $w^{0} = (0, \dots, 0)$ , (ii) $M \geq 2^{J / 8}$ , and (iii) $H (w^{i}, w^{j}) \geq J / 8$ , where $H (\cdot, \cdot)$ denotes the Hamming distance.

Let $ℱ_{I, J}^{X}$ denote the subset of $2^{J / 8} + 1$ functions in $ℱ_{I, J}$ with the indicator variable for δ jumps at $t_{I}, \dots, t_{I + J - 1}$ given by $w \in X$ . Then, for any two $η_{a} \neq η_{b} \in ℱ_{I, J}^{X}$ , we have

{| | η_{a} - η_{b} | |}_{1, T} \geq \frac{J}{8} \cdot τ_{A} \cdot δ .

[20]

Using Theorem 8 via Eq. 20 and Theorem 3, we obtain

inf_{\hat{η}} sup_{η \in ℱ_{I, J}} E_{η} {| | {\hat{η}}^{(n, s)} - η | |}_{1, T} \geq \frac{J \cdot τ_{A} \cdot δ}{16} [1 - \frac{s J \frac{δ}{ε} e^{- τ_{B} / ε} + \log 2}{\log (2^{J / 8} + 1)}] \geq \frac{J \cdot τ_{A} \cdot δ}{16} [1 - \frac{s J \frac{δ}{ε} e^{- τ_{B} / ε} + \log 2}{\frac{J}{8} \log 2}] .

[21]

We now optimize the bound with respect to δ. A straightforward calculation shows that the maximum is attained at

δ^{*} = \frac{(J - 8) \log 2}{16 J} (\frac{ε}{s}) e^{τ_{B} / ε},

[22]

and setting $δ = δ^{*}$ in Eq. 21 yields the result.

Proof of Theorem 5. The result is obtained by scaling ε with the number of segregating sites s. Denote this scaling by $ε (s)$ ; we will determine $ε (s)$ that produces the largest possible lower bound. Starting from Eq. 22 in the proof of Theorem 4, note that $δ^{*}$ scales as $(ε / s) e^{τ_{B} / ε} = : f (ε)$ . To satisfy the constraint that ${| | η | |}_{\infty} < M$ for all $η \in ℱ_{I, J}^{M}$ and s, the condition

\underset{s \to \infty}{lim sup} \max {\frac{ε (s)}{s} e^{τ_{B} / ε (s)}, ε (s)} < \infty

[23]

must therefore hold. This implies that $ε (s) s^{p} \to \infty$ as $s \to \infty$ for all $p > 0$ . Suppose that $q \overset{def}{=} {lim inf}_{s \to \infty} [(ε (s) \log s) / τ_{B}] < 1$ ; note that $ε (s) > 0$ implies $q > 0$ . Then there exists a diverging sequence $s_{1}, s_{2}, \dots \to \infty$ with $\log (s_{i}) < [(1 + q) / 2] [τ_{B} / (ε (s_{i}))]$ for all i, whence

\underset{s \to \infty}{lim sup} \frac{ε (s)}{s} e^{τ_{B} / ε (s)} \geq \underset{i \to \infty}{lim sup} \frac{ε (s_{i})}{s_{i}} e^{\frac{2}{1 + q} \log (s_{i})} = \underset{i \to \infty}{lim sup} ε (s_{i}) s_{i}^{\frac{1 - q}{1 + q}} = \infty .

From this, it follows that $ε (s) \geq τ_{B} / \log s$ for sufficiently large s. Now, on the interval $(0, \infty)$ , the function $f (ε)$ is convex with a unique minimum at $ε = τ_{B}$ . Let $ε^{'}$ be a point where $f (ε^{'}) > f (τ_{B} / \log s) = τ_{B} / \log s$ . Then $ε^{'} \notin [τ_{B} / \log s, τ_{B}]$ . If $ε^{'} > τ_{B}$ , then $f (ε^{'}) < (ε^{'} / s) e^{1}$ . Since $\frac{τ_{B}}{\log s} < f (ε^{'})$ , we then conclude $ε^{'} > s τ_{B} / (e^{1} \log s)$ , which is not bounded as $s \to \infty$ .

In summary, we see that the largest possible lower bound that obeys Eq. 23 must have $f (ε)$ asymptotically $\leq τ_{B} / \log s$ , and that this bound is achieved by setting $ε (s) = τ_{B} / \log s$ . Plugging this in to Eq. 19 yields the claim.

Proof of Corollary 6. For $c \in (0, 1)$ , choose J large enough so that $(J - 8) / J > c$ , and fix $τ_{A}$ so that $T = t_{I} + J τ_{A}$ . Then $(J - 8) τ_{A} \geq c J τ_{A} = c (T - t_{I})$ . Substituting the above inequalities into Eq. 14 and letting $C^{″} = C^{'} c^{2}$ yields the desired result.

Proof of Theorem 7. The theorem is obtained by suitably modifying the preceding results to account for the effect of exponential growth in the first period. Let $η_{u}, η_{ℓ}$ be the analogously defined upper and lower envelope functions for $G_{J}$ . Then

\int_{0}^{t_{j + 1}} \frac{d s}{η_{u} (s)} = \frac{e^{β (η_{0}) t_{1}} - 1}{η_{0} β (η_{0})} + \frac{τ_{B}}{ε} + \sum_{i = 3}^{j + 1} \frac{t_{i} - t_{i - 1}}{N_{i}} = t_{1} \frac{\frac{1}{γ ε} - \frac{1}{η_{0}}}{\log (η_{0}) - \log (γ ε)} + \frac{τ_{B}}{ε} + \sum_{i = 3}^{j + 1} \frac{t_{i} - t_{i - 1}}{N_{i}},

where we have used the definition of $β (η_{0})$ in the second equality. Since all size histories in $G_{J}$ are equal up to period $t_{2}$ , the steps of Lemma 2 all go through unchanged. Starting from Eq. 17, we obtain the modified bound

a_{m} [c_{m}^{(η_{u})} - c_{m}^{(η_{ℓ})}] \leq J δ \exp {- a_{m} t_{1} \frac{\frac{1}{γ ε} - \frac{1}{η_{0}}}{\log (η_{0}) - \log (γ ε)}} e^{- a_{m} τ_{B} / ε} .

[24]

Propagating the modified bound (Eq. 24) through Theorems 3 and 4 ultimately yields the claim.

Acknowledgments

We thank Anand Bhaskar for helpful comments on a draft of this paper and for suggesting Corollary 6 to simplify the presentation of the main result. We also thank Jack Kamm and Jeff Spence for useful feedback. This research is supported in part by a Citadel Fellowship (to J.T.), National Institutes of Health Grant R01-GM109454 (to Y.S.S.), a Packard Fellowship for Science and Engineering (to Y.S.S.), and a Miller Research Professorship (to Y.S.S.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

References

1.Abecasis GR, et al. 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Nelson MR, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Tennessen JA, et al. Broad GO Seattle GO NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fu W, et al. NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154(2):931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Coventry A, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Gazave E, et al. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111(2):757–762. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gravel S, et al. 1000 Genomes Project Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13(3):235–248. [Google Scholar]
13.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19A:27–43. [Google Scholar]
14.Kingman JFC. In: Exchangeability in Probability and Statistics. Koch G, Spizzichino F, editors. North-Holland; Amsterdam: 1982. pp. 97–112. [Google Scholar]
15.Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Griffiths R, Tavaré S. The age of a mutation in a general coalescent tree. Commun Stat Stochastic Models. 1998;14(1-2):273–295. [Google Scholar]
17.Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
18.Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165(1):427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014;42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. [Google Scholar]
22.Kim J, Mossel E, Rácz MZ, Ross N. Can one hear the shape of a population history? Theor Popul Biol. 2014;100:26–38. doi: 10.1016/j.tpb.2014.12.002. [DOI] [PubMed] [Google Scholar]
23.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
24.Chen H. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theor Popul Biol. 2012;81(2):179–195. doi: 10.1016/j.tpb.2011.11.004. [DOI] [PubMed] [Google Scholar]
25.Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Steinrücken M, Paul JS, Song YS. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol. 2013;87:51–61. doi: 10.1016/j.tpb.2012.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Yu B. 1997. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, ed Pollard D, Torgersen E, Yang GL (Springer, New York), pp 423–435.
31.Massart P. Concentration Inequalities and Model Selection. Springer; Berlin: 2007. [Google Scholar]

[r1] 1.Abecasis GR, et al. 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Nelson MR, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Tennessen JA, et al. Broad GO Seattle GO NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Fu W, et al. NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154(2):931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Coventry A, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.Gazave E, et al. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111(2):757–762. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Gravel S, et al. 1000 Genomes Project Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10.Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13(3):235–248. [Google Scholar]

[r13] 13.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19A:27–43. [Google Scholar]

[r14] 14.Kingman JFC. In: Exchangeability in Probability and Statistics. Koch G, Spizzichino F, editors. North-Holland; Amsterdam: 1982. pp. 97–112. [Google Scholar]

[r15] 15.Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Griffiths R, Tavaré S. The age of a mutation in a general coalescent tree. Commun Stat Stochastic Models. 1998;14(1-2):273–295. [Google Scholar]

[r17] 17.Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]

[r18] 18.Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165(1):427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014;42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. [Google Scholar]

[r22] 22.Kim J, Mossel E, Rácz MZ, Ross N. Can one hear the shape of a population history? Theor Popul Biol. 2014;100:26–38. doi: 10.1016/j.tpb.2014.12.002. [DOI] [PubMed] [Google Scholar]

[r23] 23.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]

[r24] 24.Chen H. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theor Popul Biol. 2012;81(2):179–195. doi: 10.1016/j.tpb.2011.11.004. [DOI] [PubMed] [Google Scholar]

[r25] 25.Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26] 26.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27] 27.Steinrücken M, Paul JS, Song YS. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol. 2013;87:51–61. doi: 10.1016/j.tpb.2012.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28] 28.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r29] 29.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30] 30.Yu B. 1997. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, ed Pollard D, Torgersen E, Yang GL (Springer, New York), pp 423–435.

[r31] 31.Massart P. Concentration Inequalities and Model Selection. Springer; Berlin: 2007. [Google Scholar]

PERMALINK

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Jonathan Terhorst

Yun S Song

Significance

Abstract

Preliminaries

Main Results

A General Bound on the Kullback−Leibler Divergence Between Two SFS Distributions.

Theorem 1.

Bounds for a Family of Piecewise Constant Models.

Fig. 1.

Lemma 2.

Theorem 3.

Minimax Lower Bounds.

Theorem 4.

Theorem 5.

Corollary 6.

Bottleneck Followed by Exponential Growth.

Fig. 2.

Theorem 7.

Discussion

Proofs

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Jonathan Terhorst

Yun S Song

Significance

Abstract

Preliminaries

Main Results

A General Bound on the Kullback−Leibler Divergence Between Two SFS Distributions.

Theorem 1.

Bounds for a Family of Piecewise Constant Models.

Fig. 1.

Lemma 2.

Theorem 3.

Minimax Lower Bounds.

Theorem 4.

Theorem 5.

Corollary 6.

Bottleneck Followed by Exponential Growth.

Fig. 2.

Theorem 7.

Discussion

Proofs

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases