Skip to main content
Genetics logoLink to Genetics
. 2016 Nov 8;205(1):385–395. doi: 10.1534/genetics.116.192963

Modeling Human Population Separation History Using Physically Phased Genomes

Shiya Song *, Elzbieta Sliwerska , Sarah Emery , Jeffrey M Kidd *,†,1
PMCID: PMC5223516  PMID: 28049708

Abstract

Phased haplotype sequences are a key component in many population genetic analyses since variation in haplotypes reflects the action of recombination, selection, and changes in population size. In humans, haplotypes are typically estimated from unphased sequence or genotyping data using statistical models applied to large reference panels. To assess the importance of correct haplotype phase on population history inference, we performed fosmid pool sequencing and resolved phased haplotypes of five individuals from diverse African populations (including Yoruba, Esan, Gambia, Maasai, and Mende). We physically phased 98% of heterozygous SNPs into haplotype-resolved blocks, obtaining a block N50 of 1 Mbp. We combined these data with additional phased genomes from San, Mbuti, Gujarati, and Centre de’Etude du Polymorphism Humain European populations and analyzed population size and separation history using the pairwise sequentially Markovian coalescent and multiple sequentially Markovian coalescent models. We find that statistically phased haplotypes yield a more recent split-time estimation compared with experimentally phased haplotypes. To better interpret patterns of cross-population coalescence, we implemented an approximate Bayesian computation approach to estimate population split times and migration rates by fitting the distribution of coalescent times inferred between two haplotypes, one from each population, to a standard isolation-with-migration model. We inferred that the separation between hunter-gatherer populations and other populations happened ∼120–140 KYA, with gene flow continuing until 30–40 KYA; separation between west-African and out-of-African populations happened ∼70–80 KYA; while the separation between Maasai and out-of-African populations happened ∼50 KYA.

Keywords: fosmid pool sequencing, haplotype, population split time, PSMC, MSMC, approximate Bayesian computation


HAPLOTYPES contain rich information about population history and are shaped by population size, natural selection, and recombination (Veeramah and Hammer 2014; Schraiber and Akey 2015). Due to historic recombination events there are 100s of 1000s of pairs of loci along a chromosome that have distinct histories. Recent methodological advances permit the estimation of a detailed population demographic history from a single or several whole-genome sequences based on the distribution of coalescent times across the genome. For example, Li and Durbin (2011) developed the pairwise sequentially Markovian coalescent (PSMC) model to reconstruct the distribution of the time since the most recent common ancestor (TMRCA) between the two alleles of an individual, and infer population size changes over time. Typically, these TMRCA values are calculated using the two haploid genomes that compose the diploid genome of a single sample (Li and Durbin 2011). When PSMC is applied to two haplotypes obtained from different populations, the inferred TMRCA distribution is informative about the timing of population splits, since the time after which nearly no coalescence events occur is a good estimate for the population split time. One key question regarding human population history is the timing of population splits and the dynamics of separation between Africans and non-Africans, which has a great influence on modern genetic diversity. Li and Durbin (2011) paired X chromosomes from African and non-African males and suggested that the two groups remained as one population until 60–80 KYA with substantial genetic exchange up until 20–40 KYA [assuming a mutation rate of 2.5 × 10−8 bp per generation and 25 years as generation time, estimates which approximately double when assuming a mutation rate of 1.25 × 10−8 bp per generation and 30 years as generation time (Schiffels and Durbin 2014)]. Subsequently, PSMC applied to pseudodiploid sequences was used to date the divergence time between nonhuman primate subspecies (Prado-Martinez et al. 2013). However, PSMC curves themselves provide only a qualitative measure of population separation and estimating split times is complicated by the presence of migration (Pritchard 2011).

The multiple sequentially Markovian coalescent (MSMC) model (Schiffels and Durbin 2014) extends PSMC to multiple individuals, focusing on the first coalescence event for any pair of haplotypes. With multiple haplotypes from different populations, MSMC calculates the ratio between cross-population and within-population coalescence rates, termed the “relative cross-coalescence rate,” a value reflecting population separation history. Schiffels and Durbin (2014) applied MSMC on statistically phased genomes (two or four haplotypes per population) and suggested that African and non-African populations exhibited a slow, gradual separation beginning earlier than 200 KYA and lasting until ∼40 KYA, while the median point of such divergence was ∼60–80 KYA. The midpoint of the relative cross-coalescence decay curve has been used as an estimate of population separation time (Pagani et al. 2015; Schiffels and Durbin 2014). Although useful, this approach does not generate parametric estimates for population history under standard models. As none of these methods to infer population separation history were applied on physically phased genomes, it is unclear how phasing errors and missing data affect this type of analysis.

In this article, we construct physically phased genomes of five individuals from diverse African populations [including Yoruba (YRI), Esan (ESN), Gambia (GWD), Maasai (MKK), and Mende (MSL)]. We reanalyzed fosmid sequencing data for individuals from the Gujarati (GIH), San, and Mbuti populations, assess the ability to correctly assemble SNP haplotypes using fosmid pool sequencing, and compare the resulting data with statistically phased haplotypes. We have previously compared several reconstructed haplotypes from a subset of these samples with those released by phase three of the 1000 Genomes Project (1000 Genomes Project Consortium 2015). In this article, we focus on how well the existing statistical phasing software SHAPEIT (Delaneau et al. 2012) performs, given the available 1000 Genomes reference panel; and how different reference panels perform, especially for samples from populations not represented in the panel. We further assess the impact of phasing error on MSMC’s estimates of population split times using physically phased genomes vs. statistically phased genomes. Finally, we extend the current PSMC method to model population splits. We apply an approximate Bayesian computation (ABC) method to obtain posterior estimates of split time and migration rate by fitting the inferred TMRCA distribution obtained from PSMC on pseudodiploid genomes to a standard isolation-with-migration (IM) model. Additionally, we assess the sensitivity of existing methods to missing data and phasing errors from statistically phased haplotypes.

Materials and Methods

Reconstructing haplotypes using fosmid pool sequencing

We performed fosmid pool sequencing and standard Illumina sequencing on individuals NA19240, HG03428, HG02799, HG03108, and NA21302, the detailed methods of which is elaborated in the phase three 1000 Genomes article (1000 Genomes Project Consortium 2015) (Supplemental Material, Table S1, Table S2, and Table S3). Paired-end reads were aligned to the reference genome assembly (Genome Reference Consortium human genome build 37, with the pseudoautosomal regions of the Y chromosome masked to “N”) using Burrows–Wheeler Aligner version 0.5.9-r16 (Li and Durbin 2009). PCR duplicates were removed by Picard version 1.62. Reads in regions with known indels were locally realigned and base quality scores were recalibrated using Genome Analysis Toolkit (GATK) (McKenna et al. 2010). We generated genomic VCF (GVCF) files with a record for every position in the genome using GATK HaplotypeCaller version 3.2-2. Variants were called using GenotypeGVCFs and filtered by applying Variant Quality Score Recalibration (VQSR), implemented in GATK to select a SNP set that included 99% of sites that intersect with the HapMap and 1000 Genomes training set. We define callable regions as sites that are within half and two times the average coverage, and with genotype and mapping quality scores >20. We kept variants that either passed VQSR filtering or were present in the phase one 1000 Genomes reference panel, which served as the starting point for subsequent haplotype phasing. We followed the procedure described in the phase three 1000 Genomes article (1000 Genomes Project Consortium 2015) to process fosmid sequencing data. We generated phased haplotypes from five individuals plus samples NA20847 (Kitzman et al. 2011), HGDP01029, and HGDP00456 (Meyer et al. 2012), which were published previously, and obtained haplotype phased blocks using ReFHap (Duitama et al. 2012).

For NA19240, HG02799, HG03108, and NA21302, we used phase-determined SNPs from trio genotyping available from HapMap and Affymetrix to guide paternal and maternal allele assignment within blocks. We determined paternal and maternal allele identity based on the majority of phased SNP assignments, then identified and corrected switch errors only if the increase in the minimum error correction (MEC) value was <50 after correction. For NA20847, HG03428, HGDP01029, and HGDP00456, phase-determined SNPs from trio data are unavailable. For these samples, we applied Prism (Kuleshov et al. 2014), a statistical phasing algorithm designed to assemble local blocks into long global haplotype contigs. This method is an extension to Li and Stephens’s model (Li and Stephens 2003) which uses a reference panel of phased haplotypes and a genetic map of the genome with an additional parameter representing the phase of each block in the hidden Markov model to enforce the locally phased structure at the global phasing level. We grouped local blocks into windows with size <1 Mbp and with at least two local blocks. Each window overlapped by one local block, which was used to link adjacent windows together. For sample NA12878, we directly used the phased SNP haplotypes constructed by fosmid pool sequencing from a previous study (Duitama et al. 2012), downloaded from http://www.molgen.mpg.de/∼genetic-variation/SIH/data, and we obtained callable regions and high-confidence SNP call sets from the sequencing results of the pilot 1000 Genomes Project (1000 Genomes Project Consortium 2010) to construct full haplotypes.

MSMC analysis

We applied the MSMC (Schiffels and Durbin 2014) model on four haplotypes, two haplotypes per individual for each population. We used “fixedRecombination” and “skipAmbigous” for inference of population separation. MSMC analysis yields inferred cross-population and within-population coalescence rates. We calculated the relative cross-coalescence rate (RCCR) by dividing the cross-population coalescence rate by the average of the within-population coalescence rate and plotted it as a function of time. We also applied MSMC on individual diploid genomes, which is very similar to PSMC but with subtle differences due to the underlying model SMC′ (Marjoram and Wall 2006) vs. SMC (McVean and Cardin 2005). To differentiate it from PSMC, we refer to such analysis as PSMC′.

PSMC on a pseudodiploid genome

PSMC (Li and Durbin 2011) inference was performed as previously described (Li and Durbin 2011). PSMC builds a hidden Markov model to infer the local TMRCA based on the local density of heterozygotes. In the model, hidden states are discretized TMRCA values, and transitions represent ancestral recombination events. On autosomal data, we use the default setting with Tmax = 15, n = 64, and pattern “1×4+25×2+1×4+1×6.” When applying PSMC on a pseudodiploid genome, there are four possible configurations of the two haplotypes, namely, hap1-hap1, hap1-hap2, hap2-hap1, and hap2-hap2. We applied PSMC to each possible configuration and took the average of the estimates. We obtained the inferred TMRCA distribution directly from the PSMC output, with the fifth column representing the fraction of the genome that coalesced in an indicated TMRCA bin.

ABC analysis

We implemented an ABC framework to estimate split time and migration rate given the inferred TMRCA distribution from PSMC output. We computed the coalescence time density of two chromosomes based on the IM model (Wang and Hey 2010; Hobolth et al. 2011), and integrated coalescence time density on the 64 time intervals in which PSMC is parameterized. We used chi-square statistics calculated between the observed TMRCA distribution obtained from PSMC output and the computed one as the distance between estimates in the ABC framework.

We formulate the IM model as a continuous time Markov chain (Wang and Hey 2010; Hobolth et al. 2011). The rate matrix Q is given by:

Q=(.2m102/θ10m2.m10002m2.02/θ2000.m1000m2.),

where the states are S11 (both genes are in population 1), S12 (one gene is in population 1 and the other is in population 2), S22 (both genes are in population 2), S1 (the genes have coalesced and the single gene is in population 1), S2 (the genes have coalesced and the single gene is in population 1), θ1 and θ2 are the scaled population sizes, and m1 and m2 are the migration rates. The density of coalescence time can be calculated as follows:

f(t)=(eQt)S12S11(2θ1)+(eQt)S12S22(2θ2) for  t<T
f(t)=[(eQT)S12S11+(eQT)S12S12+(eQT)S12S22]×2θa(t)×exp(Tt2θa(t)dt)  for t>T,

where T is the split time and θa(t) the ancestral population size. We use the ancestral population size inferred from the PSMC of the pseudodiploid genome as the ancestral population size, and use the inferred population size of each diploid genome (from PSMC) as the population size for each population after the split. For African populations, we assume constant population size after the split. For non-African populations, we assume that the population experienced a bottleneck event after the split and experienced population growth beginning 40 KYA. For our ABC framework, the parameters of interests are T (split time) and m (migration rate after the split). We assumed a uniform prior for the split time and time when migration ends, a uniform prior on migration rate in log10 scale, and applied an ABC method based on sequential Monte Carlo (Toni et al. 2009) to the parameter estimation, since it can be easily run in parallel and is more efficient than an ABC rejection sampler. We drew a pool of 5000 candidate parameter values (called particles) from the prior distribution. Instead of setting the final stringent cut-off ε (if the distance between summary statistics are lower than ε, we accept it), we gradually lowered the tolerance ε1 > ε2 > ε3 >> 0, thus the distributions gradually evolve toward the target posterior. The first pool was generated by sampling from the prior distribution. The particles that were accepted using the first threshold ε1 were sampled by their weights and perturbed to get new particles. As the tolerance threshold lowered to the final cut off, we obtained the target posterior distribution. In each iteration, we chose the threshold ε such that 20% of particles were accepted, achieving N = 1000 accepted particles. The perturbation kernels for all parameters are uniform, K = σU(−1,1), with σ equal to 20% of the difference between maximum and minimum values. We perform three iterations and summarized the mean, median, and 95% highest posterior density (HPD) C.I. for each parameter. For simulations, we generated 100 30-Mb sequences of two individuals representing African and European samples and having split times ranging from 60 to 150 KYA, with subsequent migration until 30 KYA using MaCS (Chen et al. 2009).

Data availability

The reconstructed haplotypes are available for download from DataDyrad under accession 10.5061/dryad.r7fs8. The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Results

Haplotype reconstruction

We used ReFHap to reconstruct haplotypes using fosmid pool sequencing of eight individuals from diverse populations [NA19240 (YRI), HG02799 (GWD), HG03108 (ESN), HG03428 (MSL), NA21302 MKK, HGDP01029 (San), HGDP00456 (Mbuti), and NA20847 (GIH)] and obtained phased haplotypes for NA12878 (CEU) from a previous study (Duitama et al. 2012; Prüfer et al. 2014). In total across all pools, each genome was covered by an average of 6–17 clones with a median sequence coverage ranging 16.9–24.8× (Table S2). The effect of increased clone counts on phased block size is dramatic: when doubling the number of fosmid clones, the N50 of phased blocks tripled, achieving >1 Mbp for four of the African samples (Figure 1 and Table 1).

Figure 1.

Figure 1

Haplotype assembly results. The relationship of block size and the cumulative length of constructed haplotypes are plotted. Dashed lines indicate the N50 of phased blocks, the block length such that 50% of the total is represented by blocks of that size or greater.

Table 1. Phasing statistics from fosmid pool sequencing.

Population Sample Number of clones after filter Number of blocks N50 (kbp) MEC value Number of SNPs to be phased Phased SNPs within blocks (%) Number of blocks assigned parental allele SNPs assigned parental allele (%) Switch error count Switch error corrected Switch error rate (%)
YRI NA19240 521,783 16,334 347 37,143 2,588,454 92.94 15,171 92.74 586 421 0.091
GWD HG02799 1,141,020 5,236 1416 82,387 2,780,269 99.00 3,041 98.38 327 146 0.163
ESN HG03108 1,058,027 5,416 1294 77,499 2,756,725 99.07 3,323 98.45 258 115 0.127
MKK NA21302 892,863 6,097 1416 175,935 2,736,727 98.60 3,751 97.96 336 265 0.098
MSL HG03428a 1,424,234 4,390 1849 167,294 2,775,099 99.30 3,549 97.90
GIH NA20847a 571,419 16,838 385 44,870 1,680,704 93.37 13,319 90.98
San HGDP01029a 358,759 17,695 228 27,712 2,623,001 87.10 16,516 89.15
Mbuti HGDP00456a 381,075 18,465 242 28,629 2,517,569 78.70 17,385 81.72

We resolved haplotypes using fosmid pool sequencing. MEC is the number of entries to correct when resolving haplotypes. Switch errors are counted as the number of switches required to obtain the same haplotype phase when comparing inferred haplotype phase with true haplotype phase. Switch error rate is switch error normalized by number of variants for comparison.

a

Indicates samples that have had Prism applied to link adjacent blocks together.

Although SNPs within each block are phased, the relationships between blocks cannot be established directly due to the absence of linking fosmid clones. We used two approaches to overcome this limitation. For samples that are members of genotyped trios, we used SNP transmission patterns to link adjacent blocks together producing near-to-complete haplotypes, encompassing >97% of total heterozygous SNPs for HG02799, HG03108, NA21302, and 92.7% for NA19240. Comparison with SNPs phased using transmission patterns in genotyped trios identified potential switch errors due to insufficient clone support within our inferred haplotypes, which we corrected prior to subsequent analysis (Figure S1 and Table 1). We find 99.66% concordance between the fosmid-phased SNPs for NA19240 and heterozygous SNPs phased based on transmission from this sequenced trio (1000 Genomes Project Consortium 2010). We further compared our phased haplotypes for NA19240 to the sequence of 33 fosmid clones from the same individual (Kidd et al. 2008), observing differences at 5 of the 1013 heterozygous sites (0.5%) encompassed by the 33 clones (Table S4). In total, the aligned clones encompass 1,102,213 bp excluding alignment gaps, and have 51 single nucleotide differences in comparison with our data. If we assume that all of these differences are errors in our inferred sequences, this suggests that our haplotypes have an overall sequence error rate of <0.005% or a Phred (Ewing and Green 1998) quality score >Q40.

For individuals HG03428 (MSL), NA20847 (GIH), HGDP01029 (San), and HGDP00456 (Mbuti), trio data are unavailable. For these samples, we assigned 80–98% of SNPs to a parental allele using Prism (Kuleshov et al. 2014), a statistical phasing algorithm designed to assemble short local blocks into longer global haplotype contigs. To evaluate how well Prism performs in this context of large haplotype-block assignment, we applied Prism to NA19240 and HG02799 and compared the assignment of local blocks with our assignment based on trio phase-determined SNPs. For NA19240, 6575 out of 13,591 blocks (47.6%) were assigned differently, affecting 45.88% of total heterozygous SNPs. For HG02799, 1214 out of 2810 blocks (43.2%) were assigned differently, affecting 41.82% of total heterozygous SNPs. This results in a mean interswitch distance of 2335 kbp and mean incorrectly phased haplotype length of 1967 kbp, with a 0.03% switch error rate.

Comparison with statistical phasing

We applied SHAPEIT (Delaneau et al. 2012) using either the phase one 1000 Genomes reference panel (1092 individuals, 14 populations) (Abecasis et al. 2012) or phase three reference panel (2504 individuals, 27 populations) (1000 Genomes Project Consortium 2015) separately to statistically phase each individual (Figure S2 and Table 2). For haplotypes phased using the phase one 1000 Genomes reference panel, the average switch error rate is 2.52%, half of which are flip errors, namely, single alleles appearing on the opposite haplotype. Haplotypes phased using the phase three reference panel have a higher concordance rate (72.79%), longer mean length of incorrectly phased haplotype (108 kbp) and mean interswitch distance (184.2 kbp), but similar levels of switch error rate (2.04%) and flip error rate (1.12%). This reflects the high accuracy of the phase three 1000 Genomes release haplotypes, a result of a multi-stage phasing process that used a haplotype scaffold of trio-genotyped SNPs. For NA21302, HGDP01029, and HGDP00456, whose populations are not included in the 1000 Genomes reference panel, the level of switch errors and incorrectly phased haplotype were similar using either the phase one or phase three reference panel. We observed slightly increased phasing accuracy for HGDP01029 when using only African populations as the reference panel, but phasing accuracy remained the same for NA12878 when using only European populations as the reference panel. For HG03428, NA20847, HGDP01029, and HDP00456, the comparisons of haplotypes are restrained to within blocks since blocks were statistically linked into long global haplotypes.

Table 2. Comparison between statistical phasing.

Individual Haplotype concordance (%) Switch error rate (%) Flip error rate (%) Mean interswitch distance (kbp) Mean length of incorrectly phased haplotype (kbp)
Fosmid-phased haplotypes vs. SHAPEIT phased haplotypes using the phase one 1000 Genomes reference panel
NA19240 54.60 1.33 0.60 84.6 69.6
HG02799 52.46 1.84 0.79 52.2 43.3
HG03108 53.62 1.05 0.47 94.1 78.8
NA12878 53.18 0.87 0.32 170.0 144.0
NA12878b 52.82 0.85 0.31 172.0 145.0
NA21302 52.00 2.32 1.02 43.6 37.6
HG03428a 70.01 1.88 0.95 42.6 28.5
NA20847a 79.30 1.83 0.97 46.5 29.5
HGDP01029a 69.83 6.87 3.50 12.5 7.3
HGDP01029a,b 72.49 5.39 2.81 15.3 9.0
HGDP00456a 78.09 4.68 2.70 16.1 8.8
Average 62.57 2.52 1.26 62.5 49.7
Fosmid-phased haplotypes vs. SHAPEIT phased haplotypes using phase three 1000 Genomes reference panel
NA19240 68.00 0.33 0.21 480.5 293.6
HG02799 77.10 0.63 0.27 296.5 124.4
HG03108 69.40 0.42 0.27 346.5 208.5
NA12878 58.90 0.67 0.32 264.4 204.4
NA12878b 60.86 0.66 0.33 282.9 209.4
NA21302 53.10 2.44 1.08 41.2 32.9
HG03428a 89.70 0.66 0.50 132.2 56.1
NA20847a 91.50 1.00 0.73 70.0 36.9
HGDP01029a 69.97 7.17 3.77 12.0 6.9
HGDP01029a,b 72.38 5.84 3.11 14.1 8.1
HGDP00456a 77.47 5.08 2.97 14.9 8.1
Average 72.79 2.04 1.12 184.2 108.0
Fosmid-phased haplotypes: assign parental alleles using trio data vs. using Prism
NA19240 54.12 0.05 0.00 1242.6 1115.0
HG02799 58.18 0.02 0.00 3427.3 2821.9
Average 56.15 0.03 0.00 2335.0 1968.5

We calculated haplotype concordance, switch error rate, flip error rate, mean interswitch distance, and mean length of incorrectly phased haplotype between haplotypes resolved by fosmid pool sequencing and haplotypes statistically phased using either the phase one or phase three 1000 Genomes reference panels.

a

Indicates that trio data were unavailable to link blocks together and phasing comparison analysis was limited to comparisons within RefHap blocks.

b

Indicates that the sample was phased using a population-specific reference panel (NA12878 used only European populations, and HGDP01029 used only African populations).

The impact of phasing error on inference using MSMC

We applied PSMC′, similar to PSMC but using the SMC′ framework, to perform demographic inference on nine individuals from nine populations. We assumed a human mutation rate of 1.25 × 10−8 bp per generation and 30 years as generation time, although results can be rescaled easily for comparison with other estimates (Kong et al. 2012; Schiffels and Durbin 2014). Consistent with previous findings, the PSMC′ curves of the nine individuals revealed that all populations shared the same twofold increase of ancestral population size prior to 300 KYA, after which the inferred population size of the African populations began to differentiate from non-African populations with all gradually experiencing an effective population size reduction (Figure 2); although we note that the simulations indicate that such shifts in PSMC curves may overestimate the timing of population size changes (Li and Durbin 2011; Prüfer et al. 2014; Henn et al. 2015). Non-African populations experienced a 10-fold reduction of effective population size, but experienced a rapid population growth after 30 KYA. Such observations were equivalent to previous PSMC analysis on diploid genomes after adjusting for differences in assumed mutation rate (Li and Durbin 2011; Meyer et al. 2012; Schiffels and Durbin 2014).

Figure 2.

Figure 2

PSMC′ inferred population history. Population sizes inferred from the autosomes of nine individuals from nine populations are shown.

The MSMC (Schiffels and Durbin 2014) model extends PSMC to multiple individuals. MSMC estimates the relative cross-population coalescence rate, which drops from one to zero as populations separate. We applied the MSMC model on four physically phased haplotypes, two haplotypes per individual from each population, and plotted the relative cross-coalescence rate as a function of time (Figure S3), using the time when RCCR drops to 50% as an estimate of the split time (Figure 3). We noticed that for some comparisons, the RCCR does not reach 1.0 even in the very ancient past, a pattern also found in the original MSMC article (Schiffels and Durbin 2014), and we caution against overinterpreting the RCCR at very ancient times. We also noticed that the more ancient the split event, the wider the inferred time interval. A similar pattern was also observed using simulated data (Figure S4). We also performed MSMC analysis on haplotypes inferred using SHAPEIT (▴, Figure 3). Statistically phased haplotypes show a more recent separation time and a narrower time span, particularly for comparisons involving San or Mbuti samples.

Figure 3.

Figure 3

MSMC inferred split times. ● or ▴ represents the time when the cross-coalescence rate dropped to 50%, with lines representing the time when cross-coalescence rate reached 25 and 75%. Inferred split times were inferred using haplotypes phased by the fosmid pools approach (●) or SHAPEIT (▴).

To further explore the impact of phasing accuracy on MSMC inference we performed three additional analyses. First, we performed MSMC on San and CEU populations with haplotypes inferred using SHAPEIT with population-specific reference panels. Although we found a slight increase in phasing accuracy for the San individual using an African-only reference panel (Table 2), we did not observe substantial changes in the inferred split times (Figure S5). Second, we simulated sequences with split time varying from 60 to 120 KYA and having switch errors every 50, 200, and 1000 kb. As shown in Figure S6, switch errors result in more recent estimated split times, consistent with what we observe when comparing fosmid and statistically phased haplotypes. Third, we simulated 1 and 2% flip error rates on either singleton variants (such as ABAA) or on doubleton variants (such as ABAB). As shown in Figure S7, flip errors at singleton variants result in more ancient estimated split times; while flip errors on doubleton variants result in more recent estimated split times.

An ABC method to infer population split time using PSMC on pseudodiploid genomes

PSMC applied to pseudodiploid samples also provides information on population separation history. If population splits are total and sudden, no coalescent events between populations will occur after their separation. Thus, when applying PSMC on a pseudodiploid individual, where one chromosome comes from one population and the second chromosome comes from another population, the time when the PSMC estimate of Ne goes to infinity provides an estimate for the population split time (Li and Durbin 2011). However, the inferred PSMC curve usually increases in a stepwise manner, making it difficult to determine the exact time of the split event. Subsequent migration after the split is a further confounding factor (Pritchard 2011).

To better interpret pseudodiploid PSMC curves (Figure 4 and Figure S8), we implemented an ABC framework to estimate the population split time and migration rate given the TMRCA distribution inferred from the PSMC output. We compared the observed TMRCA distribution with the analytical distribution determined by an IM model (Wang and Hey 2010; Hobolth et al. 2011) with the indicated values for split time and postseparation migration, and applied an ABC method based on sequential Monte Carlo (Toni et al. 2009) to estimate the target posterior distribution of each parameter. We tested this approach using simulated data with a split time ranging from 60 to 150 KYA, with subsequent migration continuing until 30 KYA (Figure S9). For each split time, we considered three levels of symmetrical migration: 2 × 10−5, 10 × 10−5, and 40 × 10−5. For small levels of migration, the inferred split is quite accurate, with the mean value of the posterior distribution centered on the true value. However, for larger migration rates the inferred split time tends to be smaller than the true value. This bias is exacerbated with subsequent iterations of ABC sampling. The magnitude of the inferred migration rate is reasonably accurate, as observed in the log10 scale.

Figure 4.

Figure 4

PSMC on pseudodiploid genomes. Population sizes inferred from combined autosomes are shown using with indicated values for generation time (g) and mutation rate (u), with one haplotype chosen from each population. Plotted curves are the average results obtained from four possible global haplotype configuration, namely, hap1-hap1, hap1-hap2, hap2-hap1, and hap2-hap2. Haplotypes were constructed using the fosmid pool approach.

An additional complication in the application of this method to real data is the treatment of unphased sites, which generally impact <10% of SNPs in each comparison (Table S5). Using our simulations, we evaluated three methods for processing unphased SNPs: (1) randomly assigning the phase, (2) marking unphased sites along with all homozygous segments ending in an unphased heterozygous site as missing data (as recommended for MSMC) (Schiffels and Durbin 2014), and (3) marking only unphased SNPs as missing data. Even with 10% of unphased sites, the third method results in a PSMC curve similar to the original, while the first two methods give PSMC curves shifted to an earlier, increased effective population size, which may result in an earlier inferred split time (Figure S10). We therefore applied the third method to unphased SNPs in our analysis. We note that MSMC is less sensitive to the treatment of unphased data (Figure S11).

Inferred split times using physically phased genomes

We applied our ABC method to date the split times among African and European populations (Figure 5 and Figure S12). We find that the San population separated from the other samples the earliest, ∼120–140 KYA, with a subsequent migration rate of ∼10–15 × 10−5 until 30–40 KYA, an estimate that is more recent than that obtained from MSMC analysis (the median point of seperation using MSMC of San from other African populations was ∼130 KYA, and 160 KYA from the CEU population). The separation between West African and CEU populations occurred 70–80 KYA, with migration at a rate of 8–40 × 10−5 until 30 KYA; while MKK, a population with a previously reported history of admixture, separated from the CEU population ∼50 KYA with a greater amount of gene flow until present, with migration rate of the magnitude of 10−3. The separation between the West African and MKK population occurred ∼36–40 KYA, also with a great amount of gene flow until present, with migration rate of the magnitude of 10−3. The separation between CEU and GIH occurred ∼36–38 KYA, with ongoing migration of the magnitude of 10−3 until present. We focused on these population pairs due to the challenges in distinguishing more recent events using only two haplotypes from each population. Comparisons with statistically phased data suggest that the impact of phasing error on our PSMC-ABC method is less dramatic than for MSMC analysis; however, when using haplotypes phased by SHAPEIT, a larger proportion of the genome coalesced ∼50 KYA than when fosmid-phased haplotypes are used (Figure S13 and Figure S14). This may result in larger amounts of inferred gene flow when using statistically phased data.

Figure 5.

Figure 5

Split times and migration rate inferred using PSMC and ABC. We implemented an ABC sequential Monte Carlo framework to estimate (top) split time and (bottom) migration rate given the inferred TMRCA distribution obtained from PSMC output. The posterior distribution of last iteration (N = 1000 particles) and the mean value is shown.

Discussion

The utility of phase-resolved genome sequence data in the interpretation of variants affecting gene expression, transcription factor binding, human disease, and genome assembly has motivated the development of multiple approaches for determining phase. Here, we focus on samples phased using fosmid-based dilution haplotyping, and analyze a diverse set of eight phase-resolved human genomes. As expected, we find that phase results improve with increasing number of sequenced clones. We also demonstrate that statistical phasing performs well using existing reference panels, particularly when the panel captures population variation from the studied individuals. Nonetheless, the resulting phase errors are sufficient to affect inference of population history using the MSMC model. We find that the statistically phased haplotypes show a more recent inferred population split time. This effect is particularly noticeable for comparisons involving more deeply diverged population samples that are not well-phased using existing reference panels. Simulations show that flip errors at doubleton sites yield more recent estimated population split times, while flip errors at singleton sites yield more ancient estimates (Figure S7). Both types of flip errors break up long, true haplotypes into shorter segments. Flip errors at singleton sites (i.e., ABBB sites) may sometimes disrupt a true cross-population coalescence event, leading to the inference of within-population coalescence as the first event and a decreased RCCR at all time periods during the separation. In contrast, doubleton sites are often embedded within haplotypes shared across populations. Flip errors at doubleton sites (i.e., changing ABAB to BAAB) would break these shared haplotypes into shorter shared segments. This increases the RCCR at ancient times leading to an apparently more recent separation.

Existing PSMC and MSMC approaches represent important methodological advances and have had a clear impact on the inference of population history using individual genome sequences. However, these approaches provide only a qualitative sense of population separation history. Here, we describe the fitting of a standard IM model to cross-population TMRCA distributions inferred from PSMC. This allows the acquisition of parameter estimates under standard models widely used in population genetic inference. However, as expected, multiple combinations of split time and migration rate are sometimes indistinguishable, highlighting the difficulty of inferring split times with the presence of migration (Pritchard 2011). This is partly due to the limitations of discretizing time and the poor resolution for recent history when given two haplotypes. Additionally, we find very high levels of migration for recent population splits (MKK and CEU, GIH and CEU, YRI and MKK), values which might be overestimated because of the high uncertainty for estimates of recent population history.

The split times inferred using our ABC method are generally concordant with the time when RCCR dropped to 50% as inferred using MSMC, however our method provides a narrower range while quantifying the level of subsequent migration (Table 3). Using this approach with fully phased haplotypes from nine populations, we provide additional estimates of key population separation in human population history. We note, however, that our model is limited by several assumptions. In addition to the mutation rate, which is used to scale results into calendar years, the assumed recombination rate impacts our inference. Simulations indicate that a larger value for the recombination rate r yields more recent estimated split times, although the 25–75% RCCR intervals largely overlap (Figure S15). Additionally, our ABC model assumes migration occurs at a constant rate for an extended time period. Other conceptualizations that involve variable migration rates or pulses of admixture may be more realistic. However, overall, our estimates are broadly consistent with other contemporary methods (Table S6), and our estimates are consistent with the timing of TMRCA of African and non-African mitochondrial DNA, ∼78.3 KYA and the timing of the mitochondrial MRCA for all modern humans at 157 KYA (Fu et al. 2013).

Table 3. Posterior estimates of split time and migration rate using IM model.

Split time (in KYA) Migration rate (in log10 scale) Migration end (in KYA)
Mean Median 95% lower HPD 95% higher HPD Mean Median 95% lower HPD 95% higher HPD Mean Median 95% lower HPD 95% higher HPD
YRI-CEU 73.3 72.5 70.2 81.6 −3.71 −3.70 −4.12 −3.41 27.2 30.1 6.2 38.7
MKK-CEU 53.9 53.9 52.9 55.1 −2.34 −2.22 −2.66 −2.04
GIH-CEU 37.2 37.2 36.2 38.2 −2.88 −2.87 −3.17 −2.60
San-CEU 129.5 128.8 121.3 140.9 −3.95 −3.96 −4.07 −3.83 37.2 37.1 33.4 41.5
Mbuti-CEU 117.6 116.9 103.1 139.1 −3.73 −3.73 −3.82 −3.63 34.6 34.4 30.6 39.1
YRI-MKK 38.2 38.1 36.2 40.6 −2.15 −2.16 −2.30 −2.00

We report the mean, median, and 95% credible intervals for the posterior distribution. Migration rate are in log10 scale. We set migration continuing to the present for recent separations.

Similar to previous results (Schiffels and Durbin 2014), the separation history between CEU and MKK populations was different from that observed between CEU and LWK (Luhya, another east African population). Two pulses of admixture have been estimated in the ancestors of the MKK, occurring 8 and 88 generations ago (Pagani et al. 2012; Pickrell et al. 2014). Since the impact of long segments of shared ancestry due to recent admixture is unclear, we masked out regions of recent European ancestry in our MKK sample using RFMix (Maples et al. 2013) (Figure S16) and found that the MSMC curves are not altered when recent segments of European ancestry are masked (Figure S17). Although such ancestral masking becomes increasingly imperfect for older admixture events, this suggests that long segments of shared ancestry due to recent admixture do not explain the latter divergence of the MKK population compared to other African populations and supports a more complex ancient history for the MKK.

When constructing global haplotypes for individuals without trio phasing data available, we applied Prism to statistically link blocks together. Prism was designed to link much shorter phased segments into longer blocks. When applied to our phased haplotype blocks, we found that ∼40% of blocks were assigned incorrectly, resulting in switch errors every 2 Mbp. However, we found that the MSMC curves using global haplotypes constructed by Prism were very similar to those constructed with trio phasing data (Figure S18), indicating long switch errors have little effect on such inference. This is reassuring since we are using Prism to construct global haplotypes for four individuals; but, the inferred split times involving the San and Mbuti populations are still likely underestimated.

Our results indicate that the separation of the studied human populations was a gradual event, with substantial genetic exchange continuing after an initial split; a finding consistent with hypotheses of long-standing ancient population structure in Africa (reviewed in Harding and McVean 2004; Henn et al. 2012). We provide a comparison of PSMC- and MSMC-based methods with other contemporary methods on inferring population separation history, and our results emphasize the importance of accurately phased haplotypes on MSMC analyses, especially for more ancient splits.

Acknowledgments

We thank Jacob Kitzman, Peedikayil Thomas, Jeffrey W. Innis, and the University of Michigan DNA Sequencing Core Facility for guidance on fosmid pool construction and sequencing.

Footnotes

Available freely online through the author-supported open access option.

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.192963/-/DC1.

Communicating editor: J. D. Wall

Literature Cited

  1. 1000 Genomes Project Consortium , 2010.  A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. 1000 Genomes Project Consortium , 2015.  A global reference for human genetic variation. Nature 526: 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Abecasis G. R., Auton A., Brooks L. D., DePristo M. A., Durbin R. M., et al. , 2012.  An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen G. K., Marjoram P., Wall J. D., 2009.  Fast and flexible simulation of DNA sequence data. Genome Res. 19: 136–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Delaneau O., Marchini J., Zagury J.-F., 2012.  A linear complexity phasing method for thousands of genomes. Nat. Methods 9: 179–181. [DOI] [PubMed] [Google Scholar]
  6. Duitama J., McEwen G. K., Huebsch T., Palczewski S., Schulz S., et al. , 2012.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 40: 2041–2053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Ewing B., Green P., 1998.  Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8: 186–194. [PubMed] [Google Scholar]
  8. Fu Q., Mittnik A., Johnson P. L., Bos K., Lari M., et al. , 2013.  A revised timescale for human evolution based on ancient mitochondrial genomes. Curr. Biol. 23: 553–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Harding R. M., McVean G., 2004.  A structured ancestral population for the evolution of modern humans. Curr. Opin. Genet. Dev. 14: 667–674. [DOI] [PubMed] [Google Scholar]
  10. Henn B. M., Cavalli-Sforza L. L., Feldman M. W., 2012.  The great human expansion. Proc. Natl. Acad. Sci. USA 109: 17758–17764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Henn B. M., Botigue L. R., Peischl S., Dupanloup I., Lipatov M., et al. , 2016.  Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc. Natl. Acad. Sci. USA 113: E440–E449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hobolth A., Andersen L. N., Mailund T., 2011.  On computing the coalescence time density in an isolation-with-migration model with few samples. Genetics 187: 1241–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kidd J. M., Cheng Z., Graves T., Fulton B., Wilson R. K., et al. , 2008.  Haplotype sorting using human fosmid clone end-sequence pairs. Genome Res. 18: 2016–2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kitzman J. O., Mackenzie A. P., Adey A., Hiatt J. B., Patwardhan R. P., et al. , 2011.  Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29: 59–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kong A., Frigge M. L., Masson G., Besenbacher S., Sulem P., et al. , 2012.  Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488: 471–475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kuleshov V., Xie D., Chen R., Pushkarev D., Ma Z., et al. , 2014.  Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32: 261–266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li H., Durbin R., 2009.  Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li H., Durbin R., 2011.  Inference of human population history from individual whole-genome sequences. Nature 475: 493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li N., Stephens M., 2003.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165: 2213–2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Maples B. K., Gravel S., Kenny E. E., Bustamante C. D., 2013.  RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93: 278–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Marjoram P., Wall J. D., 2006.  Fast “coalescent” simulation. BMC Genet. 7: 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., et al. , 2010.  The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McVean G. A., Cardin N. J., 2005.  Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360: 1387–1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Meyer M., Kircher M., Gansauge M.-T., Li H., Racimo F., et al. , 2012.  A high-coverage genome sequence from an archaic Denisovan individual. Science 338: 222–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pagani L., Kivisild T., Tarekegn A., Ekong R., Plaster C., et al. , 2012.  Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91: 83–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pagani L., Schiffels S., Gurdasani D., Danecek P., Scally A., et al. , 2015.  Tracing the route of modern humans out of Africa by using 225 human genome sequences from Ethiopians and Egyptians. Am. J. Hum. Genet. 96: 986–991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pickrell J. K., Patterson N., Loh P.-R., Lipson M., Berger B., et al. , 2014.  Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl. Acad. Sci. USA 111: 2632–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Prado-Martinez J., Sudmant P. H., Kidd J. M., Li H., Kelley J. L., et al. , 2013.  Great ape genetic diversity and population history. Nature 499: 471–475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pritchard J. K., 2011.  Whole-genome sequencing data offer insights into human demography. Nat. Genet. 43: 923–925. [DOI] [PubMed] [Google Scholar]
  30. Prüfer K., Racimo F., Patterson N., Jay F., Sankararaman S., et al. , 2014.  The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505: 43–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Schiffels S., Durbin R., 2014.  Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46: 919–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Schraiber J. G., Akey J. M., 2015.  Methods and models for unravelling human evolutionary history. Nat. Rev. Genet. 16: 727–740 [DOI] [PubMed] [Google Scholar]
  33. Toni T., Welch D., Strelkowa N., Ipsen A., Stumpf M. P., 2009.  Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6: 187–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Veeramah K. R., Hammer M. F., 2014.  The impact of whole-genome sequencing on the reconstruction of human population history. Nat. Rev. Genet. 15: 149–162. [DOI] [PubMed] [Google Scholar]
  35. Wang Y., Hey J., 2010.  Estimating divergence parameters with small samples from a large number of loci. Genetics 184: 363–379. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The reconstructed haplotypes are available for download from DataDyrad under accession 10.5061/dryad.r7fs8. The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES