Significance
Numerous empirical studies in population genetics have used a summary statistic called the sample frequency spectrum (SFS), which summarizes the information in a sample of DNA sequences. Despite their popularity, the accuracy of inference methods based on the SFS is difficult to characterize theoretically, and it is currently unknown how the estimation accuracy improves as more sites in the genome are used. Here, we establish information theoretic limits on the accuracy of all estimators that use the SFS to infer population size histories. We study the rate of convergence to the true answer as the amount of data increases, and obtain the surprising result that it is exponentially worse than known convergence rates for many classical estimation problems in statistics.
Keywords: minimax rate, population genetics, demographic inference
Abstract
The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.
The past decade has seen a revolution in our ability to interrogate the genome at the molecular level. Fueled by technological advances in DNA sequencing, studies now routinely query thousands or tens of thousands of individuals [refs. 1–4 and UK10K Project (www.uk10k.org) and Exome Aggregation Consortium (exac.broadinstitute.org)] to better understand disease susceptibility, heritability, population history, and other phenomena. In most cases, the conclusions of these studies come in the form of statistical estimates obtained from models that relate the effect of interest to mutation patterns arising in sampled DNA sequences. As genetic sample sizes explode, it is natural to wonder how additional data improve the quality of these estimates. While this general question has received intense focus in theoretical statistics, certain aspects of the genetics setting (for example, non-Gaussianity and lack of independence among samples) complicate efforts to study such models using classical techniques. New methods are needed to theoretically characterize some common models in statistical genetics.
Here, we address this need for a specific estimation problem in population genetics known as demographic inference. As we explain in further detail below, the aim of this problem is to reconstruct the sequence of historical events—including population size changes, migration, and admixture—that gave rise to present-day populations, using DNA samples obtained from those populations. We focus on the simplest problem of estimating the size history of a single population backward in time.
A summary statistic known as the (SFS; defined below) is often used in empirical studies (2, 5–11), but there have been fewer attempts to understand SFS-based estimation from a theoretical perspective. The main result of this paper is to show that, for a common class of estimators that analyze the SFS, there is a fundamental limit on their accuracy as a function of the sample size. More precisely, we show that, under a standard statistical error metric known as minimax error, the rate at which these estimators converge to the truth for certain populations is at best inversely logarithmic in the number of independent segregating sites analyzed, and does not depend at all on the number of individuals sampled. Compared with other types of statistical estimation problems (for example, linear regression), this is an extremely slow rate of convergence. Our proof is information theoretic in nature and applies to any estimator that operates solely on the SFS. This is the first result we are aware of that characterizes the convergence rate of demographic history estimates as a function of sample size.
The remainder of this paper is organized as follows. In Preliminaries, we formally define our notation and model. In Main Results, we state our main theoretical results, followed by a discussion of their practical implications in Discussion. To streamline our exposition, all mathematical proofs are deferred until Proofs.
Preliminaries
The stochastic process underlying the inference procedure we consider is Kingman’s coalescent (12–14), which evolves backward in time and describes the genealogy of a collection of chromosomes randomly sampled from a population. The population size is assumed to change deterministically over time and is described by a function with being the population size at time t in the past. The instantaneous rate of coalescence between any pair of lineages at time t is .
As in the standard infinite sites model of mutation (15), we assume that every dimorphic site (i.e., a site with exactly two observed allelic types) has experienced mutation exactly once in the evolutionary history of the sample. Further, for each such site, we assume that it is known which allele is the ancestral type versus the mutant type. In what follows, we use the terms “dimorphic” and “segregating” interchangeably.
A population size function induces a probability distribution on the number of derived alleles found at a particular segregating site. Specifically, for a sample of randomly sampled individuals, let , for , denote the probability that a segregating site contains b mutant alleles in a sample of n individuals under model η. The vector is called the expected SFS. In the coalescent setting, a general expression for is given by (16)
where denotes the amount of time (in coalescent units) during which the genealogy of the sample contained k lineages under model η. The expected waiting time to the first coalescence in a sample of m individuals is given by
[1] |
where and is the cumulative rate of coalescence up to time t. It turns out (17) that there is an invertible linear transformation that relates to . Using this relation, the quantity can be written as (18)
[2] |
where and are vectors of universal constants that do not depend on the population size function η, and denotes the inner product. Under model η, the quantity is the total expected length of edges subtending b out of n individuals sampled at time 0, while the quantity is the total expected tree length for a sample of size n. Both quantities are positive for all population size functions η. For an arbitrary population size function η, we have for all , which implies
[3] |
For a constant function ,
[4] |
[5] |
where .
To formulate the problem, we use the following notation. We suppose that a sample of randomly sampled individuals has been typed at s independent segregating sites. These data are used to form the empirical sample frequency spectrum, which is an -tuple , where denotes the proportion of segregating sites with b copies of the mutant allele and copies of the ancestral allele. A frequency-based estimator is any statistic that maps an empirical SFS to a population size history.
Main Results
Here, we establish a minimax lower bound on the ability of any estimator to accurately reconstruct population size functions.
A General Bound on the Kullback−Leibler Divergence Between Two SFS Distributions.
Abusing notation, we use to denote the Kullback−Leibler (KL) divergence between the probability distributions and . In Proofs, we prove the following general upper bound on the KL divergence between two SFS distributions:
Theorem 1.
Let denote a general space of population size functions and suppose satisfy for all and . Then,
[6] |
Bounds for a Family of Piecewise Constant Models.
We now focus on a particular class of population size functions that are easier to analyze and are popular in the literature (11, 19, 20). For a fixed positive integer , let denote the space of piecewise constant size functions with exactly K pieces. A population size function η is a member of if and only if there exist positive real numbers and such that
[7] |
where, by convention, we define and . For such an η, define
[8] |
For , the expected waiting time defined in Eq. 1 is given by
[9] |
Note that since ,
[10] |
To formulate our result, we let denote positive integers that satisfy , and introduce a subfamily of piecewise constant functions defined as follows. See Fig. 1 for illustration. We assume that all change points are fixed and that the sizes of the first I epochs are also fixed, with being the smallest size. So, all functions in are identical to each other for the first I epochs, and there is a population bottleneck in the last epoch. Then, for , every function undergoes jumps according to the following rules:
-
1.
For the interval , takes a constant value of either h or , where and .
-
2.
At later change points , η either stays the same or jumps upward by δ.
Hence, consists of distinct piecewise constant functions that are nondecreasing functions of t for . Note that for all . For ease of notation, we use to denote the bottleneck size and to denote the bottleneck duration. To facilitate analysis later, we fix to some positive constant for all .
For any two models in , we obtain the following bound on the difference of their waiting times to the first coalescence:
Lemma 2.
For all ,
[11] |
Together with Theorem 1, this lemma can be used to show
Theorem 3.
Let that satisfy Then,
[12] |
Proofs of these results are deferred to Proofs. It is interesting that the above bound does not depend on the number n of sampled individuals.
Minimax Lower Bounds.
Before using the above results to obtain a minimax lower bound, we first note a subtle fact. Given any population size function η, consider a function ζ that satisfies for all , where κ is some positive constant. Such functions are equivalent, as it turns out that for all and . To mod out by this equivalence, we assume that every satisfies , where is some fixed positive constant.
Let denote a generic norm (specific examples will be given later) and let denote expectation with respect to the SFS distribution induced by population size function η. Then, note that
In what follows, we will put a lower bound on the last quantity. We first fix a sensible distance metric on . An intuitive way to measure distance between two population size functions is their distance, , but this is unreasonably stringent in that if and do not agree infinitely far back into the past. Instead we will focus on the following truncated distance: , which measures the discrepancy between and back to some fixed time T in the past.
Henceforth, let be any estimator of the population size function that operates on a sample of s independent segregating sites obtained from a sample of n randomly sampled individuals. In Proofs, we prove the following main results of our paper:
Theorem 4.
Consider the subfamily of models described above, and suppose and . Then,
[13] |
where C is a positive constant.
The above theorem applies to all models in . We now consider the subset , which is the set of all models in that are bounded by some constant M. For this family of bounded population size functions, a sharper asymptotic lower bound can be obtained as follows.
Theorem 5.
Suppose and . Then,
[14] |
where is a positive constant.
By specializing , a simplified version of Theorem 5 can be obtained:
Corollary 6.
Suppose and let . Then,
[15] |
where is a positive constant.
Note that the above lower bounds do not depend on the dimension of the SFS (which is equal to ). Hence, for a fixed number s of segregating sites considered, using more individuals does not diminish the error bounds.
Bottleneck Followed by Exponential Growth.
In the results presented above, we dropped smaller terms to obtain the dominant contribution to our lower bound. Here, we provide a more detailed analysis to study how the model in the recent past (i.e., the period ) affects the lower bound. A slight modification of the above results permits us to analyze the following model class, which is of interest in, for example, human genetics (2, 3, 7): Let be the family of models illustrated in Fig. 2 with exponential growth in the recent past. Specifically, for the period . The rate of growth is defined so that for all , where . The part for is the same as that for in (Fig. 1). We obtain the following result for the subfamily :
Theorem 7.
Consider the subfamily of models described above, and suppose and . Then,
[16] |
Theorem 4 is a measure of how (a lower bound on) estimation error depends on growth following a bottleneck. The two extremes and have intuitive interpretations. For large , the bound in Eq. 16 tends to the corresponding bound given by Theorem 4, as expected since coalescences become increasingly less likely in the first time period. Small has the effect of ‘‘prolonging’’ the bottleneck, thus increasing the minimax lower bound. In particular, if then as , so that the effect of low population growth on the minimax lower bound is to simply prolong the bottleneck effect by an additional time periods.
Discussion
In this paper, we have theoretically characterized fundamental limits on the accuracy of demographic inference from data. We have shown that the minimax error rate for estimating the piecewise-constant demography of a single population is at least , where s is the number of independent segregating sites analyzed. In contrast, the minimax error for many classical estimation problems in statistics (for example, nonparametric regression or density estimation) decays inverse polynomially in the sample size (21). Compared with these problems, exponentially more samples would be required to estimate a population size history function to within a similar magnitude of error. The paper that most closely relates to the present work is by Kim et al. (22), who obtain lower bounds on the amount of exact coalescence time data necessary to distinguish between size histories in a hypothesis testing framework. Since coalescence times are never observed and must be estimated from data, these bounds place a limit on the accuracy with which a population size function can be inferred. The authors also describe an estimator that uses coalescence times (again observed without noise) to accurately recover the underlying population size function with high probability, at a rate that roughly matches the lower bound.
Another line of work centers around the identifiability of the parameter using the SFS. Roughly speaking, a family of statistical models defined over a parameter space is identifiable if, for any with , the sampling distributions induced by and are different. In our context, this simply says that, for all n, unless almost everywhere. Standard desiderata for statistical estimators (e.g., consistency or unbiasedness) are impossible without identifiability, so it is the weakest possible regularity condition one can impose on a useful family of models.
Perhaps surprisingly, it turns out that, in general, a population size function is not identifiable from the SFS (23). Indeed, for any given , it has been shown that an infinite number of smooth functions exist such that . Moreover, explicit examples can be constructed that demonstrate this phenomenon (23). On the other hand, these counterexamples consist of functions that exhibit an unbounded frequency of oscillatory behavior near the present time, which is perhaps unrealistic when modeling naturally occurring populations. More recently, it has been shown (19) that identifiability holds for many classes of population size functions used by practitioners (including piecewise constant, piecewise exponential, and piecewise generalized exponential). Furthermore, the number n of sampled individuals sufficient for identifiability can be explicitly given and is a function of the complexity of the underlying class of models being studied (19).
Identifiability asserts that, given an infinite amount of data (specifically, taking the number of segregating sites ), the model parameter can be uniquely recovered. In practice, s is finite, and only a perturbed version of the expected frequency spectrum, say , is observed. From a practical standpoint, it is important to understand how these perturbations ultimately affect the parameter estimate . It is this question that forms the starting point for the present work.
A single population evolving under a piecewise-constant demography is a special case of many richer classes of demographic models. For example, it is a (limiting) member of the family of exponential growth models, seen by taking each exponential growth parameter to zero. In the multispecies coalescent setting (10, 24), multiple population size histories must be estimated, and the error of that estimate must necessarily be lower bounded by that of estimating a single such history. Thus, our result can be expected to apply to a broader class of models than the one we have studied here.
As detailed in Proofs, the result in Theorem 5 follows from setting and in the subfamily . The size is in coalescent units. In terms of the number of individuals, it is proportional to , where is the number of generations corresponding to duration in the coalescent limit. Intuitively, as the severity of the bottleneck increases, the population is increasingly likely to find its most recent common ancestor (MRCA) during that time; farther back in time than the MRCA, no information is conveyed concerning the demographic events experienced by the population.
One might object to considering models with a bottleneck size that scales inversely with the number s of segregating sites in the data, and it is indeed possible that a better convergence rate may be achievable for populations that are known not to contain a bottleneck. On the other hand, we note that decreases sufficiently slowly with s that our result can be expected to apply to many real-world examples. For example, for , which is a conservative upper bound for most organisms, . This implies that for populations that have experienced roughly an order-of-magnitude increase in effective population size during their history, accurate estimation of demographic events that occurred before this expansion is difficult using SFS-based methods. Additionally, an interesting aspect of our work is that our minimax lower bounds do not depend on the number n of sampled individuals; increasing n is not enough to overcome the information barrier imposed by the presence of a bottleneck. This is intuitively plausible since, as n increases, the th sampled lineage becomes more likely to coalesce early on.
An interesting question that we have not attempted to analyze is whether the rate is optimal, i.e., whether there exists some estimator that achieves the minimax lower bound established here. In practice, from Eqs. 2, 8, and 9, it can be seen that naively maximizing the likelihood of the observed SFS with respect to requires solving a nonconvex optimization problem, so that convergence to the global maximum is not even guaranteed. Computational issues aside, finding such an estimator remains an open theoretical challenge.
In closing, we stress that our result is specific to SFS-based estimators, which analyze only independent sites. The main allure of these estimators is their mathematical tractability, rather than their realism. In fact, a rich source of additional information exists in the correlation structure found among linked sites in the genome. Methods that seek to exploit this structure by modeling the action of recombination pose greater mathematical and computational difficulties, but there has been recent progress in this area (20, 25–29). Our result serves to underscore the importance of pursuing more realistic models of genomic evolution, challenging though they may be.
Proofs
Proof of Theorem 1. To simplify the notation, we write and . Then, using Eq. 2, we can write
The assumption implies that, for all times , the instantaneous rate of coalescence at time t in model η is greater than or equal to the instantaneous rate of coalescence at time in model . Hence, this assumption together with for all implies for all ; equivalently, . Additionally, and for all . Combining these facts, we obtain
where we have used in the final equality.
Proof of Lemma 2. We distinguish two particular models, , which are the lower and the upper envelopes of . The function stays constant at h for all , while jumps upward by δ at every change point . Hence, pointwise for all . The two enveloping functions will form the basis of subsequent analysis.
Fix and note that, by the definition of , one of these functions must pointwise dominate the other. Therefore, assume without loss of generality that for all t. Then, for all t,
which implies
for all . Using these inequalities, we conclude
so it suffices to demonstrate Eq. 11 for . Now, by Eq. 9 and the definition of ,
where we have used Eq. 10. Similarly,
Now, using the fact that and agree on the first I epochs, we obtain
[17] |
where the second line follows from telescoping and the fact that , while the last line follows from the fact that for all .
Proof of Theorem 3. For ease of notation, define and . By Lemma 2,
where the second inequality follows from for all . Now, noting that corresponds to the total tree length for the constant population size function and using Eq. 5, we obtain
[18] |
To finish the proof, recall that is the total expected branch length of the coalescent tree under model η. Since we have that is at least as large as the corresponding quantity under a model with constant population size ε. By Eq. 5, the total expected tree length under the latter model equals . Thus, , and combining this result with Eq. 18 gives
Finally, Eq. 12 follows from this inequality and Theorem 1.
Proof of Theorem 4. Our proof uses a generalized form of Fano’s inequality (30). Adapted to our setting and notation, the method reads as follows.
Theorem 8 (Fano’s method). Consider a space of population size models. Let be an integer, and let contain r population size functions such that for all , and . Let be an estimator of η based on the SFS data sampled independently from ; i.e., are SFS data for n individuals at s independent segregating sites. Then,
[19] |
This theorem places a lower bound on the minimax rate of convergence of a population size history estimator based on the SFS.
For , let denote the variable indicating whether η jumps by δ at change point . Let , where . By the Varshamov−Gilbert lemma (see ref. 31, Lemma 4.7), there exist such that (i) , (ii) , and (iii) , where denotes the Hamming distance.
Let denote the subset of functions in with the indicator variable for δ jumps at given by . Then, for any two , we have
[20] |
Using Theorem 8 via Eq. 20 and Theorem 3, we obtain
[21] |
We now optimize the bound with respect to δ. A straightforward calculation shows that the maximum is attained at
[22] |
and setting in Eq. 21 yields the result.
Proof of Theorem 5. The result is obtained by scaling ε with the number of segregating sites s. Denote this scaling by ; we will determine that produces the largest possible lower bound. Starting from Eq. 22 in the proof of Theorem 4, note that scales as . To satisfy the constraint that for all and s, the condition
[23] |
must therefore hold. This implies that as for all . Suppose that ; note that implies . Then there exists a diverging sequence with for all i, whence
From this, it follows that for sufficiently large s. Now, on the interval , the function is convex with a unique minimum at . Let be a point where . Then . If , then . Since , we then conclude , which is not bounded as .
In summary, we see that the largest possible lower bound that obeys Eq. 23 must have asymptotically , and that this bound is achieved by setting . Plugging this in to Eq. 19 yields the claim.
Proof of Corollary 6. For , choose J large enough so that , and fix so that . Then . Substituting the above inequalities into Eq. 14 and letting yields the desired result.
Proof of Theorem 7. The theorem is obtained by suitably modifying the preceding results to account for the effect of exponential growth in the first period. Let be the analogously defined upper and lower envelope functions for . Then
where we have used the definition of in the second equality. Since all size histories in are equal up to period , the steps of Lemma 2 all go through unchanged. Starting from Eq. 17, we obtain the modified bound
[24] |
Propagating the modified bound (Eq. 24) through Theorems 3 and 4 ultimately yields the claim.
Acknowledgments
We thank Anand Bhaskar for helpful comments on a draft of this paper and for suggesting Corollary 6 to simplify the presentation of the main result. We also thank Jack Kamm and Jeff Spence for useful feedback. This research is supported in part by a Citadel Fellowship (to J.T.), National Institutes of Health Grant R01-GM109454 (to Y.S.S.), a Packard Fellowship for Science and Engineering (to Y.S.S.), and a Miller Research Professorship (to Y.S.S.).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
References
- 1.Abecasis GR, et al. 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nelson MR, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tennessen JA, et al. Broad GO Seattle GO NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fu W, et al. NHLBI Exome Sequencing Project Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154(2):931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Coventry A, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gazave E, et al. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111(2):757–762. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gravel S, et al. 1000 Genomes Project Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13(3):235–248. [Google Scholar]
- 13.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19A:27–43. [Google Scholar]
- 14.Kingman JFC. In: Exchangeability in Probability and Statistics. Koch G, Spizzichino F, editors. North-Holland; Amsterdam: 1982. pp. 97–112. [Google Scholar]
- 15.Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Griffiths R, Tavaré S. The age of a mutation in a general coalescent tree. Commun Stat Stochastic Models. 1998;14(1-2):273–295. [Google Scholar]
- 17.Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
- 18.Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165(1):427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014;42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. [Google Scholar]
- 22.Kim J, Mossel E, Rácz MZ, Ross N. Can one hear the shape of a population history? Theor Popul Biol. 2014;100:26–38. doi: 10.1016/j.tpb.2014.12.002. [DOI] [PubMed] [Google Scholar]
- 23.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
- 24.Chen H. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theor Popul Biol. 2012;81(2):179–195. doi: 10.1016/j.tpb.2011.11.004. [DOI] [PubMed] [Google Scholar]
- 25.Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Steinrücken M, Paul JS, Song YS. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol. 2013;87:51–61. doi: 10.1016/j.tpb.2012.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yu B. 1997. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, ed Pollard D, Torgersen E, Yang GL (Springer, New York), pp 423–435.
- 31.Massart P. Concentration Inequalities and Model Selection. Springer; Berlin: 2007. [Google Scholar]