Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2019 Aug 6;116(34):17115–17120. doi: 10.1073/pnas.1905060116

Inference of complex population histories using whole-genome sequences from multiple populations

Matthias Steinrücken a,b, Jack Kamm c,d, Jeffrey P Spence e, Yun S Song c,d,f,1
PMCID: PMC6708337  PMID: 31387977

Significance

An increasing number of population genomic studies now try to infer complex models of population history using a number of whole-genome sequences sampled from multiple populations. A key technical challenge to this effort is to compute model likelihoods, which involves integrating out latent variables (genealogical histories) that live in extremely high dimensions. This is a notoriously difficult computational problem, especially when the sample size is greater than a handful and the underlying population genetic model is complex. Here, we present an efficient, flexible statistical method that can scale to larger sample sizes and more populations than previously possible. Aside from demographic inference, our method can be used in other statistical inference problems in evolutionary biology and human genetics.

Keywords: coalescent, population genetics, demography, statistical inference

Abstract

There has been much interest in analyzing genome-scale DNA sequence data to infer population histories, but inference methods developed hitherto are limited in model complexity and computational scalability. Here we present an efficient, flexible statistical method, diCal2, that can use whole-genome sequence data from multiple populations to infer complex demographic models involving population size changes, population splits, admixture, and migration. Applying our method to data from Australian, East Asian, European, and Papuan populations, we find that the population ancestral to Australians and Papuans started separating from East Asians and Europeans about 100,000 y ago, and that the separation of East Asians and Europeans started about 50,000 y ago, with pervasive gene flow between all pairs of populations.


Whole-genome sequences are now routinely available for population genetic analyses, and inference methods that can take better advantage of genome-scale data have received considerable attention in recent years. In particular, there has been much interest in methods that can use the genomic data of individuals from multiple populations to infer complex models of population history. In addition to being of historical interest, population demography is important to study because it influences patterns of genetic variation, and understanding the intricate interplay between demography and other evolutionary forces such as natural selection is a major aim in population genetics.

Inferring these demographic histories is computationally and statistically challenging. One class of methods (18) based on the sample frequency spectrum (SFS) is computationally efficient but ignores linkage information, and the minimax rate of convergence for such estimators is poor (9). Also, their utility is limited by the fact that the number of model parameters that can theoretically be identified using the SFS alone is bounded by the sample size (10). Although no identifiability results currently exist for the general case, methods (1321) that take linkage structure into account are empirically more statistically efficient and can be used to infer models with many parameters even when the sample size is small.* This is of practical importance, since an increasing number of studies now seek to infer complex demographic models involving multiple populations using only a small number of individuals sampled from each population (e.g., refs. 2225). A popular demographic inference method of this kind is PSMC (pairwise sequentially Markovian coalescent) (13), which uses a pair of sequences to infer piecewise-constant population size histories. Its extension, MSMC (multiple sequentially Markovian coalescent) (18, 19), can use sequences sampled from a pair of populations to infer a genetic separation history in addition to population size changes. A more recent method called SMC++ (20) can scale to hundreds of individuals, but it is able to analyze individuals from only a pair of populations that have diverged without subsequent gene flow.

Parallel to these developments, an inference method called diCal (Demographic Inference using Composite Approximate Likelihood) (16) was introduced to infer piecewise-constant effective population size histories using multiple sequences, thereby providing improved inference about the recent past. The key mathematical component of diCal is the conditional sampling distribution (CSD) πΘ, which describes the conditional probability of observing a new sequence or haplotype given a collection of already observed haplotypes, under a given population genetic model with parameters Θ. The corresponding genealogical process can be formulated as a hidden Markov model (HMM), enabling efficient inference.

Here, we present our method diCal2, a scalable inference tool for population genomic analysis under general demographic models, which extends diCal in several ways. diCal2 has been successfully applied in recent empirical studies of human demographic history using whole-genome sequencing data (22, 24, 25). In the present paper, we provide a detailed description of the method, carry out a simulation study to benchmark its performance, and discuss its strengths and weaknesses.

To handle gene flow between populations, diCal2 builds on previous theoretical work (26) which introduced a CSD for subdivided populations with unchanging continuous migration; that earlier work did not address parameter estimation, which is the focus of this article. In contrast to MSMC, which does not explicitly model population structure, we consider fully parametric demographic models, including subdivided population structure with migration, that are easier to interpret. Our method also enables inference under demographic models more general than the 2-population clean-split model currently implemented in SMC++. Specifically, our method is flexible enough to model 1) an arbitrary number of populations specified by the user, 2) an arbitrary pattern of population splits and mergers, 3) more general population size changes (e.g., piecewise-exponential), 4) arbitrary migration patterns with time-varying continuous migration rates or pulse admixture events, and 5) an arbitrary poly-allelic mutation model at each site (including diallelic or tetraallelic). These are significant improvements on the previous version of diCal (16), which could only be used to infer piecewise-constant population size changes in a single population.

In addition to these features, we introduce major computational improvements which enable the use of whole-genome data. The mathematical details of our method and the computational extensions are provided in Materials and Methods and SI Appendix, SI Text. Below, we briefly highlight the key technical advances.

In PSMC and the earlier version of diCal, the demographic epochs and HMM discretization intervals are both fixed, and the latter forms a strict refinement of the former. In contrast, discretization intervals and demographic epochs are decoupled in our improved version of diCal. For example, population size change points or population split times can vary freely and do not need to coincide with discretization interval boundaries. This flexibility allows for more accurate parameter estimation, especially for population split times.

Moreover, the CSDs for different haplotypes can be combined in various ways to devise a composite likelihood that can be used in a maximum likelihood framework for parameter estimation. Our implementation of the expectation–maximization (EM) algorithm allows any composite likelihood that is composed of sums and products of CSDs, which includes the product of approximate conditionals (PAC) used by Li and Stephens (27) to detect recombination hotspots.

For substantial computational speedup, we implement a previously described “locus-skipping” algorithm (28), which analytically and exactly integrates over contiguous stretches of nonsegregating loci. However, locus skipping is less computationally efficient with missing data, and thus, to efficiently incorporate missing alleles, we also implement an alternative speedup by grouping loci together into larger blocks. A similar speedup was used in PSMC, but it treats the whole block as a single diallelic site. In contrast, the blocks in our method are viewed as full nonrecombining haplotypes.

For complex demographic models, the likelihood function may have local optima. To address this issue, we implement a flexible genetic algorithm and combine it with the EM procedure to enable more efficient navigation of high-dimensional parameter space.

Lastly, we also present an application of our method to data from the Simons Genome Diversity Project (SGDP) (23) to investigate the population history of Australians, East Asians, Europeans, and Papuans. There has been some debate whether the population ancestral to Australians and Papuans (which we call Australo-Papuans, following ref. 29) split off prior to the divergence of East Asians and Europeans (e.g., ref. 29), or whether East Asians and Australo-Papuans first split from Europeans (e.g., ref. 30). We find substantial evidence in favor of the former hypothesis, but find that there has been pervasive gene flow between all of these populations since their divergence.

Results

To demonstrate the flexibility, accuracy, and efficiency of our method, we performed an extensive simulation study under a variety of biologically relevant demographic scenarios. DNA sequence data were simulated using the software scrm (31). We simulated 100 datasets for each demographic scenario and set the haplotype length to 250 Mbp for each dataset. We used 1.25×108 per generation for the per-site mutation and recombination rates and used a value of 100 kbp for the value of the -l parameter in scrm, which determines the length of the recombination history to be used during the simulation. Generation time was assumed to be 30 y. In each scenario, we used our method to estimate all demographic parameters of the underlying model.

Recent Exponential Growth.

The first model we investigated involves recent exponential population growth. To investigate the performance of our method, we simulated data consisting of 10 haplotypes under the demographic model depicted in Fig. 1A. We fixed TB=65ka, TG=15ka, NA=15,000, and NB=1,800, and used 3 different values for the growth rate, r: 0.25%,0.5%, and 1.0% per generation.

Fig. 1.

Fig. 1.

Demographic models used in our simulation study. (A) Recent exponential population growth. An ancestral population of size NA undergoes a bottleneck at time TB, where its size is reduced to NB. Growth starts at time TG at an exponential rate r. (B) Demographic model of a population split. An ancestral population of size NA undergoes a strong bottleneck that starts at time TB in the past, and reduces the population size to NB. At time TDIV, this population then splits into 2 populations of size N1 and N2. Following the population split, migrants are exchanged at a rate m. (C) Demographic model of 3 populations with pure splits. An ancestral population of size NA splits into 2 populations of size NB and N3 at time T3. The former then again splits into 2 populations of size N1 and N2 at time T1,2.

We used the leave-one-out composite likelihood (LCL) in our EM procedure combined with a genetic algorithm to estimate all 5 parameters of the demographic model. For the genetic algorithm, we chose 50 random starting points that were each optimized for 5 EM iterations. Then we chose the 5 best parameter values (“parents”) and replaced each of them with an average of 3 “offspring” parameter sets to obtain the next “generation.” These were then optimized for 5 EM iterations. We repeated this procedure for 4 more “generations,” and reported the parameters that achieved the overall maximal likelihood value. We found that the results are robust to the choice of composite likelihood scheme.

Violin plots representing the accuracy of the inferences are shown in Fig. 2A and SI Appendix, Fig. S1. Analysis of the simulated data shows that, in these scenarios, all parameters are estimated with little variability. However, the results indicate that the estimate of the exponential growth rate is biased upward. This bias is somewhat counterbalanced by a slight downward bias of the time when growth starts, and the population size before growth starts. In fact, the estimates lead to very accurate contemporary population sizes. We note that it is possible to empirically correct for biases in applications via simulation. Furthermore, using more sequence data for each individual reduces the variability of the estimates. We stress that our method accurately estimates recent exponential growth rates using only 10 haplotypes. This is far less than the sample size (thousands to tens of thousands) required by SFS-based methods to get reasonable estimates; see Bhaskar et al. (5) and references therein.

Fig. 2.

Fig. 2.

Accuracy results of our method, diCal2. Each violin plot shows the base-2 logarithm of the relative error (estimate/truth) for the analysis of 100 simulated datasets. Thus, a value of 0 corresponds to an exact estimate, whereas +1 is a 2-fold overestimate and −1 is a 2-fold underestimate. True parameter values are shown on the x axis. (A) The recent exponential growth model shown in Fig. 1A with expansion rate r=0.5% per generation. Parameter estimates were obtained using only 10 haplotypes, which is much less than the sample size (thousands to tens of thousands) required by SFS-based methods to get good estimates. (B) Accuracy results for the clean-split model (no gene flow, m=0) shown in Fig. 1B with divergence time TDIV=20 ka. Using only 2 haplotypes in each extant population, the parameters of this clean-split model could be estimated very accurately. (C) Accuracy results for the isolation with migration (IM) model shown in Fig. 1B with divergence time TDIV=40 ka, and migration probability m=0.00025. As in the clean-split case, only 2 haplotypes in each extant population were used. Most parameter estimates show little bias or variability. See the text for further discussion. (D) Accuracy results for the 3-population model shown in Fig. 1C, using 2 haplotypes in each extant population.

Note that, in these and the following simulations, the ancestral population size NA is estimated with less variability than the other parameters. This is likely due to the fact that, in these scenarios, a sizable fraction of the informative genealogical events happen during the last epoch, and thus the power to estimate the single parameter for the population size during this last epoch is high. This power is not affected strongly by the different sample sizes used in the different scenarios.

Population Split.

We also investigated a model of a past population split, depicted in Fig. 1B. This model allows for a bottleneck before the populations split, and subsequent gene flow following the split. We first focused on the case with no gene flow, i.e., with migration probability m=0.

We simulated datasets with 2 haplotypes in each of the extant populations. We simulated 100 datasets each for TDIV=10ka and 20ka, with the remaining parameters set to TB=70ka, NA=20,000, NB=1,800, and N1=N2=5,000. This scenario has recently been used in a study of the demographic history of Native Americans (22). In addition, we simulated 100 datasets with TDIV=70ka, setting NB=NA=20,000, thereby also removing the need for TB. For the genetic algorithm, we used 60 random starting points, and 6 “parents” for each of the following 4 “generations” for the cases TDIV=10ka and 20ka, and 40 starting points and 5 “parents” for the case TDIV=70ka. We used the LCL.

Fig. 2B and SI Appendix, Fig. S2 show the accuracy of the estimator. These empirical results demonstrate that our method is able to estimate the parameters in this clean-split model with high accuracy. Most parameters show little bias, and the empirical distributions are very narrow. Only the estimates of the extant population sizes N1 and N2 for TDIV=10ka and TDIV=20ka show a somewhat higher variability. Since this time frame is very recent on an evolutionary timescale, either more sampled haplotypes or more sequence data are required to better estimate these parameters.

Isolation with Migration.

We also investigated the demographic model shown in Fig. 1B allowing for positive gene flow after the ancestral population splits into 2. We set the migration probability to m=0.00025; i.e., an individual from population 1 can have a parent from population 2, and vice versa, with a probability of 0.00025 per generation. Using this migration probability, we simulated 100 datasets each consisting of 2 haplotypes in each extant population, using TDIV=40ka, TB=70ka, NA=20,000, NB=1,800, and N1=N2=5,000. We also simulated 100 datasets using TDIV=70ka, NB=NA=20,000, and N1=N2=5,000. In the former case, we used 70 starting points and 6 “parents” for each “generation” in the genetic algorithm, whereas, for the latter, we used 50 and 5, respectively.

Fig. 2C and SI Appendix, Fig. S3 show the accuracy of the estimator. In both scenarios, we used the pairwise composite likelihood. Again, most parameter estimates show little bias or variability, the exceptions being NB and m in the first scenario. However, we note that the evolutionary timescales involved are, again, rather short, and thus the number of events informative about these parameters is small. In practice, the variability could be reduced by using additional chromosomes.

Three-Population Model.

Our method can handle models with more than 2 extant populations each with several haplotypes. To test the accuracy of our method in this setting, we simulated data under the model depicted in Fig. 1C relating 3 extant populations. Under this model, an ancestral population of size NA splits into 2 populations of size NB and N3 at time T3. The one of size NB then splits into 2 populations of size N1 and N2 at time T1,2. We simulated 100 datasets with 2 haplotypes in each of the extant populations. We set the parameters to T1,2=30ka, T3=60ka, NA=20,000, NB=3,000, and N1=N2=N3=5,000. For the genetic algorithm, we chose 70 starting points, and 6 “parents,” and used the LCL. Fig. 2D shows the accuracy of our method. Again, the empirical distribution of the estimates shows little bias or variability.

Application to SGDP Data.

We used our method to investigate the pattern of population splits between Australians, East Asians, Europeans, and Papuans. There has been some debate about the relative ordering of population splits; specifically, there has been competing evidence about whether East Asians and Europeans split most recently (e.g., ref. 29) or whether Australo-Papuans and East Asians split most recently (e.g., ref. 30). To date these splits, we used Australian, French, Han, and Papuan individuals from the SGDP (23) and fit models for each of the 6 possible pairs of these populations, allowing for recent population size changes and pulse admixture. The model is depicted in Fig. 3, and additional details are given in Materials and Methods. The estimates of the divergence time TDIV and admixture fraction p together with confidence intervals obtained using a parametric bootstrapping approach are presented in Table 1. We found compelling evidence that Australo-Papuans and Eurasians diverged first, about 100 ka, with subsequent French–Han divergence at 53.6 ka, and the Australian–Papuan divergence at 33.9 ka. Note that, while these estimates of divergence times are largely consistent with a tree, some estimates appear to imply slightly different split times (for example, the Australian–Han divergence time is about 15 ka earlier than the Australian–French divergence time); this is likely due to model misspecification resulting from an overly simplistic model.

Fig. 3.

Fig. 3.

The demographic model used for the analysis of the French, Han, Papuan, and Australian population from the SGDP dataset. The ancestral population has 2 periods of constant size, then splits into 2, and each of the extant populations has again 2 periods of constant size. Additionally, there is a symmetric pulse admixture event at TADM, replacing p% of the ancestors in each population.

Table 1.

Estimates of the divergence time TDIV (in kiloyears before present) and the admixture percentage p (in percent) for the respective pair of populations from the SGDP dataset

Han Papuan Australian
French 54 ka [52,55] 106 ka [104,108] 106 ka [105,108]
14.8% [14.4,15.2] 23.5% [23.0,24.0] 25.0% [24.3,25.7]
Han 113 ka [110,115] 91 ka [91,91]
26.3% [25.0,27.7] 24.5% [23.7,25.4]
Papuan 34 ka [33,35]
15.4% [14.6,16.1]

The 95% confidence intervals obtained from the parametric bootstrap procedure are shown in square brackets.

We also found evidence of pervasive recent gene flow. In particular, we found pulse admixture proportions of 15 to 26% between each pair of populations, all occurring 5 to 20 ka. We note that our model cannot capture all of the intricacies of human demographic history: There has likely been continuous gene flow between all populations punctuated by a few mass migrations. Our gene flow estimates likely attempt to capture both of these modes simultaneously along with indirect gene flow through intervening populations. While it is unlikely that about a quarter of any population was replaced by a geographically distant population, our results suggest that, since their divergences tens of thousands of years ago, these populations have exchanged a considerable number of migrants.

Additional details about the data, data processing, and parameter settings for our method are presented in Materials and Methods. All parameter estimates, bootstrap results, and measures of goodness-of-fit evaluated using cross-coalescence rate curves (CCRs) may be found in SI Appendix, Fig. S4 and Table S1.

There is also a separate debate about the number of out-of-Africa events (23, 29, 32), with some studies suggesting that a second, earlier wave left traces of ancestry specifically in Australo-Papuans. We caution against interpreting our results in this context, since directly testing this hypothesis would require explicitly including African populations in the analysis. Moreover, our estimates of TDIV are not directly comparable to divergence time estimates given in refs. 23, 29, and 32; as the CCRs in SI Appendix, Fig. S4 and similar curves in refs. 23 and 29 show, the populations involved already exhibited a substantial degree of structure 100 ka. Representing this complex population history by a simple model with a single pulse admixture event after splitting as in our analysis, or by a single estimate for the divergence time as in refs. 23 and 29, is certainly an oversimplification that omits relevant details. Lastly, we do not explicitly model the excess traces of Denisovan ancestry that are found in Papuans (33), which may cause differences in the estimated divergence times.

Discussion

The results described above demonstrate that our method can efficiently and accurately estimate demographic parameters in biologically relevant scenarios. Our method has recently been used to study the history of Native American peoples (22, 24, 25), due to the flexible framework underlying the method, enabling the consideration of a wide range of population histories.

A limitation of our method is that it relies on the haplotype structure of the sample, and thus requires phased data. Phasing errors can lead to biased parameter estimation; see supplementary information, section S7, specifically table S13, of ref. 22 for a simulation study that explores the bias when estimating divergence times. Moreover, note that the simulations presented above were performed using homogeneous recombination and mutation rates along the genome. The EM procedure underlying our method could be modified to accommodate heterogeneous rates without severely impacting the runtime. If the correct rates are specified, we do not expect the accuracy of inference to be adversely affected. If the correct rates are not known, it would also be possible to adjust the method for joint inference, or to use rates obtained from alternative approaches (34).

Aside from demographic inference, we note that our method can be used in other population genetic problems of interest, such as model selection (see, for example, supplementary information 18.4 of ref. 24). Furthermore, the posterior decoding of latent variables in our CSD can be used in detecting admixture tracts (35), estimating fine-scale recombination rates in admixed individuals, distinguishing ancestral and introgressed polymorphism, and detecting incomplete lineage sorting. Also, applying our CSD in methods for phasing genotypes, imputing missing sequence data, and detecting identity-by-descent tracts (36) would make it possible to properly account for demography or infer it simultaneously, thus potentially improving accuracy. Lastly, it is straightforward to incorporate temporal samples (ancient DNA sequences) into our method (24), which leads to further interesting applications.

Materials and Methods

Here, we briefly describe our method, a composite likelihood framework to estimate demographic parameters using EM. Further details are provided in SI Appendix, SI Text. We also describe our analysis of the SGDP data.

Demographic Inference Using diCal2.

A central building block of our method diCal2 for demographic inference is the CSD πΘh|α,n. It denotes the probability of observing the sequence or haplotype h in subpopulation α, given that the haplotypes n have already been observed in their respective subpopulations and the underlying demography is described by the parameters Θ. The CSD can be described using a sequentially Markovian genealogical process (37, 38) that approximates the true conditional genealogical process. Subsequent approximations to this sequential process lead to an HMM with finite hidden state space that can be used to efficiently compute approximate CSDs. We provide the details of the HMM approximations in SI Appendix, section 1. The CSDs presented in Steinrücken et al. (26) and Sheehan et al. (16) can be obtained as special cases of the model presented here.

The CSD can then be used to define composite likelihood functions, which, in turn, enable us to perform maximum composite likelihood inference of the demographic parameters, Θ. We can use any such composite likelihood function that is composed of sums and products of CSDs, for example, the PAC framework which has been used successfully by Li and Stephens (27) to infer recombination hotspots.

To find the parameter values that maximize this composite likelihood, we employ the composite likelihood in the standard EM framework (39). While, in principle, all parameters of the model can be inferred, we focus on the demographic parameters, Θ. Since it is not possible to derive a closed form solution for the maximum in the maximization step in general, we employ numerical optimization schemes, like the Nelder–Mead simplex algorithm (40), to efficiently determine the requisite maximum. We provide mathematical details for the implementation of the EM algorithm in SI Appendix, section 3.

In SI Appendix, section 4, we provide details on the implementation of the “locus-skipping” algorithm, and the alternative speedup that groups loci into larger blocks. Furthermore, in SI Appendix, section 5, we provide mathematical details of the modifications to the trunk genealogy to increase accuracy. Finally, we describe, in SI Appendix, section 6, how to employ a discretization for the HMM computations that differs from the partition induced by the demography and remains fixed throughout the optimization procedure.

Runtime.

The runtime of our implementation of the EM algorithm is linear in the number of haplotypes times the number of CSDs in the composite likelihood and quadratic in the number of populations involved. The E step depends linearly on the length of the haplotypes, whereas the M step is independent of this quantity. The exact complexity and runtime of parameter estimation depends on the composite likelihood used, the details of the genetic algorithm, and the number of parameters to estimate. The analyses of the simulated data presented in this section were performed on a cluster of AMD Opteron processors. The raw sequential runtime of analyzing a single dataset averaged 100 to 120 CPU hours, but, by taking advantage of the independence structure of the composite likelihood and the genetic algorithm, we were able to decrease the runtime to an average of 15 to 20 wall clock hours, using up to 16 cores in parallel. The one exception was the 3-population model, where the parallelized version took, on average, 70 wall clock hours, due to the more complex demographic model and the increased number of haplotypes.

SGDP Analysis.

For the analysis of the SGDP data, we used the following individuals: B_Australian-3, B_Australian-4, S_French-1, S_French-2, B_French-3, S_Han-1, S_Han-2, B_Han-3, S_Papuan-1, S_Papuan-3, and B_Papuan-15. The data were phased using Shapeit (41) with read-based phasing (phased data provided by I. Mathieson, Department of Genetics, University of Pennsylvania, Philadelphia, PA), and all sites in Heng Li’s 75-bp universal mask (23) were treated as missing. As in our simulations, we used a mutation rate and recombination rate of 1.25×108 per base per generation, and assumed a generation time of 30 y. For each pair of populations, we used all of the individuals from those populations and used all of the autosomes to fit a model where each population has a constant size from present to 5 ka, and another constant size from 5 ka until the divergence time of the populations. The ancestral population is assumed to be a constant size from the divergence time until 100 ka, beyond which we infer a separate constant size. We allow a symmetric pulse migration between the 2 populations. To fit this model, we performed 4 iterations of the genetic algorithm, starting from 15 arbitrary points, keeping the 3 best particles at each iteration, and then spawning a total of 10 particles. Each genetic algorithm iteration consisted of 6 EM iterations for each particle, using the LCL. For computational efficiency, we grouped loci into 2.5-kbp bins (SI Appendix, section 4), and discretized time with 8 log-uniformly spaced break points between 1.5 ka and 5 Ma. Each genome-wide analysis of a pair of populations with combined sample size 10 to 12 took ∼90 to 145 wall clock hours on AMD Opteron processors, using up to 10 cores in parallel and less then 30 GB of memory.

As seen in the simulations, the raw inferred parameters may be biased. To address this issue and infer confidence intervals, we performed a parametric bootstrap using msprime (42). For each pair of populations, we simulated 10 full genome datasets, and reran our method on each of these datasets. Our reported estimates are “debiased” estimates, obtained by subtracting the estimated bias from our raw estimates. We then used the bootstraps to estimate a SD for each parameter, and reported confidence intervals based on a normal approximation (i.e., the debiased estimate ±1.96 SD). To avoid having these debiased estimates or confidence intervals fall outside of the domain of the parameters (e.g., negative population sizes, times, or pulse proportions or pulse proportions >100%), all debiasing and confidence intervals were computed in log space for population sizes and times and in logit space for pulse proportions. The resulting estimates and confidence intervals were then transformed back to their natural space using the exponential map and logistic map, respectively. We note that this procedure means that our estimates are unbiased in log space or logit space, and may be slightly biased in their natural scale. All parameter estimates and bootstrap results are presented in SI Appendix, Table S1. We also note that the bootstrap results were obtained using the same grouping of loci (2.5-kbp bins), and thus show that this procedure does not severely impact accuracy.

To assess goodness of fit, we used MSMC to infer CCRs on the real data, and then on data simulated under our debiased estimates, presented in SI Appendix, Fig. S4. For each pair of populations, we used a single diploid from each population (B_Australian-3, S_French-1, S_Han-1, and S_Papuan-1), using all of the autosomes and again treating all sites in Heng Li’s 75-bp universal mask as missing. We simulated 5 replicates of each pair of populations to assess the variability in the MSMC CCRs. The CCRs are qualitatively similar between the real and simulated data, and the fit is quite good for a model with only 9 parameters. As discussed above, introducing additional size changes and migration rates would likely improve the fit.

Software Availability.

The algorithms described here are implemented in a new version of the software package diCal2, which is available for download at https://sourceforge.net/projects/dical2.

Supplementary Material

Supplementary File
pnas.1905060116.sapp.pdf (831.9KB, pdf)

Acknowledgments

We thank Sara Mathieson and Geno Guerra for helpful discussions and for testing our software. Furthermore, we thank Iain Mathieson for helpful discussions and providing the phased SGDP data. This research is supported, in part, by National Institutes of Health Grant R01-GM094402 and a Packard Fellowship for Science and Engineering. Y.S.S. is a Chan Zuckerberg Biohub Investigator.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

*Whether the distribution of pairwise coalescence times uniquely determines the demographic model has been answered recently (11, 12).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1905060116/-/DCSupplemental.

References

  • 1.Nielsen R., Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154, 931–942 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lukić S., Hey J., Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192, 619–639 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Excoffier L., Dupanloup I., Huerta-Sanchez E., Sousa V., Foll M., Robust demographic inference from genomic and SNP data. PLoS Genet. 9, e1003905 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bhaskar A., Wang Y. R., Song Y. S., Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25, 268–279 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kamm J. A., Terhorst J., Song Y. S., Efficient computation of the joint sample frequency spectra for multiple populations. J. Comput. Graph. Stat. 26, 182–194 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jouganous J., Long W., Ragsdale A. P., Gravel S., Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics 206, 1549–1567 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kamm J., Terhorst J., Durbin R., Song Y. S., Efficiently inferring the demographic history of many populations with allele count data. J. Am. Stat. Assoc., 10.1080/01621459.2019.1635482 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Terhorst J., Song Y. S., Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. U.S.A. 112, 7677–7682 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bhaskar A, Song YS, Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Stat. 42, 2469–2493 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kim J., Mossel E., Rácz M. Z., Ross N., Can one hear the shape of a population history? Theor. Popul. Biol. 100, 26–38 (2015). [DOI] [PubMed] [Google Scholar]
  • 12.Kim Y., Koehler F., Moitra A., Mossel E., Ramnarayan G., “How many subpopulations is too many? Exponential lower bounds for inferring population histories” in Research in Computational Molecular Biology. RECOMB 2019 Research in Computational Molecular Biology. RECOMB 2019, Cowen L., Ed. (Lecture Notes in Computer Science, Springer, 2019), vol. 11467, pp. 136–157. [DOI] [PubMed] [Google Scholar]
  • 13.Li H., Durbin R., Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mailund T., Dutheil J. Y., Hobolth A., Lunter G., Schierup M. H., Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet. 7, e1001319(2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Palamara P. F., Lencz T., Darvasi A., Pe’er I., Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 91, 809–822 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Sheehan S., Harris K., Song Y. S., Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Harris K., Nielsen R., Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9, e1003521 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Schiffels S., Durbin R., Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang K., Mathieson I., O’Connell J., Schiffels S., Tracking human population structure through time from whole genome sequences. bioRxiv:10.1101/585265 (21 March 2019). [DOI] [PMC free article] [PubMed]
  • 20.Terhorst J., Kamm J. A., Song Y. S., Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49, 303–309 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Spence J. P., Steinrücken M., Terhorst J., Song Y. S., Inference of population history using coalescent HMMs: Review and outlook. Curr. Opin. Genet. Dev. 53, 70–76 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Raghavan M., et al. , Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 349, aab3884 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mallick S., et al. , The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Moreno-Mayar J. V., et al. , Terminal Pleistocene Alaskan genome reveals first founding population of native Americans. Nature 553, 203–207 (2018). [DOI] [PubMed] [Google Scholar]
  • 25.Moreno-Mayar J. V., et al. , Early human dispersals within the Americas. Science 362, aav2621 (2018). [DOI] [PubMed] [Google Scholar]
  • 26.Steinrücken M., Paul J. S., Song Y. S., A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li N., Stephens M., Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics 165, 2213–2233 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Paul J. S., Song Y. S., Blockwise HMM computation for large-scale population genomic inference. Bioinformatics 28, 2008–2015 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Malaspinas A. S., et al. , A genomic history of Aboriginal Australia. Nature 538, 207–214 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wall J. D., Inferring human demographic histories of non-African populations from patterns of allele sharing. Am. J. Hum. Genet. 100, 766–772 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Staab P. R., Zhu S., Metzler D., Lunter G., scrm: Efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics 31, 1680–1682 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Pagani L., et al. , Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538, 238–242 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sankararaman S., et al. , The combined landscape of Denisovan and Neanderthal ancestry in present-day humans. Curr. Biol. 26, 1241–1247 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Spence J. P., Song Y. S., Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations, bioRxiv:10.1101/532168 (28 January 2019). [DOI] [PMC free article] [PubMed]
  • 35.Steinrücken M., Spence J. P., Kamm J. A., Wieczorek E., Song Y. S., Model-based detection and analysis of introgressed Neanderthal ancestry in modern humans. Mol. Ecol. 27, 3873–3888 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Tataru P., Nirody J., Song Y. S., diCal-IBD: Demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics 30, 3430–3431 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wiuf C., Hein J., Recombination as a point process along sequences. Theor. Pop. Biol. 55, 248–259 (1999). [DOI] [PubMed] [Google Scholar]
  • 38.McVean G. A., Cardin N. J., Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 1387–1393 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Dempster A. P., Laird N. M., Rubin D. B., Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Met. 39, 1–38 (1977). [Google Scholar]
  • 40.Nelder J. A., Mead R., A simplex method for function minimization. Comput. J. 7, 308–313 (1965). [Google Scholar]
  • 41.Delaneau O., Howie B., Cox A., Zagury J. F., Marchini J., Haplotype estimation using sequence reads. Am. J. Hum. Genet. 93, 787–696 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kelleher J., Etheridge A. M., McVean G., Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 12, 1–22 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1905060116.sapp.pdf (831.9KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES