Abstract
Knowledge of genome-wide genealogies for thousands of individuals would simplify most evolutionary analyses for humans and other species, but has remained computationally infeasible. We developed a method, Relate, scaling to > 10,000 sequences while simultaneously estimating branch lengths, mutational ages, and variable historical population sizes, as well as allowing for data errors. Application to 1000 Genomes Project haplotypes produces joint genealogical histories for 26 human populations. Highly diverged lineages are present in all groups, but most frequent in Africa. Outside Africa, these mainly reflect ancient introgression from groups related to Neanderthals and Denisovans, while African signals instead reflect unknown events, unique to that continent. Our approach allows more powerful inferences of natural selection than previously possible. We identify multiple novel regions under strong positive selection, and multi-allelic traits including hair color, body mass index (BMI), and blood pressure, showing strong evidence of directional selection, varying among human groups.
Large-scale genetic variation datasets are now available for many species, including tens of thousands of humans. In principle, all information about a sample’s genetic history is captured by their underlying genealogical history, which records the historical coalescence, recombination, and mutation events producing the observed variation patterns. In practice, several key existing approaches (e.g., Refs. [1,2]) leverage an underlying coalescent model, which provides a flexible modelling framework and is the limiting behavior of a variety of finite-population models3,4. However, coalescent-based inference is complicated by the structure of the model, and the extremely large space of probabilistically plausible sample histories conditional on observed data5. Other approaches6–11 use more heuristic coalescent approximations, sometimes reducing accuracy: regardless, published existing methods scale to tens or a few hundred samples at most.
These issues have restricted the use of direct genealogy-based approaches to infer recombination, mutational ages, and natural selection to smaller datasets1,2, while for larger datasets diverse approaches based on data summaries12–14 or downsampling15,16 have predominated. In humans, such tools have detected genetic structure and admixture in good agreement with independent evidence17,18, changes in population size15,19–21 and introgression with archaic groups, including Neanderthals22.
We developed a scalable method, Relate, to estimate genome-wide genealogies (Figure 1; Methods; https://myersgroup.github.io/relate/ for implementation). Relate separates two steps; firstly identifying a genealogical framework at each site in the genome, describing ancestral relationships among sequences but not their coalescence times. Secondly, these times are estimated after mutations are mapped to branches of these trees, allowing for variable population sizes simultaneously inferred from the data, to produce complete genealogies. These are then used directly for downstream inferences. Our approach approximates the coalescent model, but performs as well as or better than existing approaches in our simulations, whilst being thousands of times faster.
We demonstrate the utility of a genealogy-based analysis by applying Relate to 4,956 haplotypes of the 1000 Genomes Project (1000GP) dataset23,24. We estimate population sizes of all 26 populations in the dataset and their split times using cross-coalescence rates. In agreement with previous work, we identify an increase in the mutation rate of TCC to TTC around 10,000 to 20,000 years ago25. The estimated genealogies contain signals of introgression between Neanderthals and modern humans in Eurasia, and between modern East and South Asians and Denisovans, alongside other signals specific to African groups. Finally, we suggest a test statistic that identifies loci under positive selection by tracking mutation frequencies through time. We demonstrate that for plausible scenarios of selection on complex traits, involving selection dispersed over many loci, this test improves power over the integrated Haplotype Score (iHS)26, and identify previously unreported genomic regions under strong positive selection. We find a remarkable enrichment of SNPs identified in genome-wide association studies (GWAS) among targets of selection, and evidence of widespread directional polygenic adaptation.
Results
Overview of the Relate approach
At each genomic position, Relate first identifies a non-symmetric distance matrix whose rows estimate the relative order of coalescence events between a particular sequence and the remaining observed sequences. To do this, Relate uses the posterior probabilities output by a hidden Markov model (HMM) similar to that proposed by Li and Stephens27, but leveraging knowledge of ancestral and derived status at each single nucleotide polymorphism (SNP) to improve speed and accuracy. This distance matrix is used to construct a rooted binary tree using a bespoke algorithm. Mathematical arguments demonstrate, encouragingly, that if the “infinite-sites” model is satisfied so that each observed mutation occurs exactly once, our approach is guaranteed to generate genealogies exactly producing the observed data, in the limiting cases where either there is no recombination, where the recombination rate is very large, or where all recombination occurs in intense widely spaced hotspots (Supplementary Note). Because the distance matrix is position specific, these binary trees adapt to changes in local genetic ancestry due to recombination. In practice, we save computational time by only rebuilding trees at a subset of sites along the genome (Methods, Supplementary Figures 1 and 2).
To estimate branch lengths while allowing for changing population sizes, we first map mutations onto each genealogical tree and then apply an iterative Markov Chain Monte Carlo (MCMC) algorithm to estimate times under a coalescent prior. We simultaneously estimate a stepwise varying effective population size through time, using the genome-wide collection of estimated genealogies (Methods). Our final time estimates account for changes in population size, assuming an unstructured population. We can also explore population stratification within a sample, by leveraging estimated coalescence rates of any pair of sampled sequences. By averaging pairwise coalescence rates within and across groups, we obtain effective population size estimates for sub-populations and cross-coalescence rates between populations. As we show in the next section, this can provide accurate estimates despite the fact that our tree-builder does not account for such population stratification.
Simulations
We evaluated Relate for its speed, accuracy of inferred trees, robustness, and ability to infer evolutionary parameters, by simulating data under the coalescent with recombination using msprime28. We compared performance to ARGweaver2, which samples from a time-discretized approximation to the coalescent with recombination, and which we therefore expect to perform well on these simulations. Relate was >4 orders of magnitude faster than ARGweaver, for cases we were able to apply the latter, and also much faster than RENT+11 (Figure 2a,b). Our approach scales linearly in sequence length and quadratically in sample size Ni, enabling genealogical inference for e.g. 10,000 human samples genome-wide using a compute cluster.
To evaluate accuracy, we compare, at each locus and for each of the pairs of haplotypes, the estimated time to their most-recent common ancestor (TMRCA) to the truth (Figure 2c,d), observing improved performance relative to both ARGweaver and RENT+. Relate also showed improved robustness to errors in the data, identified misclassified ancestral alleles, and estimated times well in the context of varying population size (Supplementary Figure 3). Other accuracy measures yielded similar improvements (Supplementary Note). Relate identified repeat mutations and variable mutation rates, and is robust to computational rephasing of haplotypes (Supplementary Figure 4). We next compare Relate’s inferred population sizes to those from applying two leading specialist approaches, MSMC20 and SMC++21. For multiple previously tested20,21 scenarios including oscillating population sizes and bottlenecks similar to those observed in out-of-Africa events of modern humans, Relate obtains more accurate estimates, particularly in the recent past (Figure 2e, Supplementary Figure 3). While Relate assumes a single population when estimating branch lengths, when applied to a combined sample from two diverged populations, it still performs well in recovering their distinct population histories and estimating their split time(s) (Figure 2f,g).
Genome-wide human genealogies
We applied Relate to 2,478 1000GP individuals with diverse genetic ancestry and approximately 81 million SNPs (see Methods for data pre-processing). Computation time, using up to 300 processors, was ~4 days (Supplementary Table 1). 86% of all SNPs (>96% of SNPs at >0.2% derived-allele frequency (DAF)) map uniquely to trees, falling to 76% for CpG dinucleotides, known to possess strongly elevated mutation rates (Supplementary Figure 5). The number of different trees in a genomic subregion strongly correlates with recombination rate (r 2 = 0.63) and the average tree has 3,883 SNPs mapped to it, reflecting block-like structures of human haplotypes between recombination hotspots (Supplementary Figure 5).
We estimated within and across-group coalescence rates for pairs of groups, by first extracting the genealogy for members of a particular subsample of interest embedded within the full genealogy, and then re-estimating coalescence rates for this genealogy. We observe a clear out-of-Africa bottleneck for Eurasian populations (CHB: Chinese in Beijing and GBR: British in England and Scotland shown), and gradual separation from African populations (YRI: Yoruba in Ibadan, Nigeria shown) already visible 200,000 years before present (YBP) and lasting to around 60,000YBP (Figure 3a,b). This is consistent with recent studies15,29 and might reflect several out-of-Africa dispersal events. Asian (CHB shown) and European (GBR shown) populations separate more recently, with a clear and visibly more sudden separation around 30,000 YBP (Figure 3c). We also detect, and date, very recent separations <10,000 YBP, such as between CHB-JPT (JPT: Japanese in Tokyo) or FIN-GBR (FIN: Finnish in Finland) (Figure 3d,e). Finnish samples exhibit a second bottleneck, around 3000-9,000 YBP following separation from GBR30,31, with other population-specific events in e.g. Peruvians and Gujarati individuals (Supplementary Figure 6). The Finnish bottleneck is thought to have caused enrichment of certain disease-causing gene variants, commonly classified as Finnish heritage diseases30,31. A strong bottleneck post-dating separation from Eurasian groups is absent in African populations (LWK: Luhya in Webuye, Kenya and YRI shown, Figure 3f). All populations show a remarkable increase often to >1,000,000, in the recent past (Supplementary Figure 6), however we note possible inaccuracies due to incomplete power to detect rare variants24 leading to underestimation, and computational phasing leading to overestimation (Supplementary Figure 4).
Exploring the relative mutation rate of particular mutation classes through time confirms, as reported previously25, a strong elevation in the rate of trinucleotide changes including TCC->TTC in West Eurasian groups, which we date to 5,000-30,000 YBP, but infer to be weak or absent in the present day (Figure 4a). Other mutation types show more subtle temporal biases and signatures consistent with GC-biased gene conversion32 (Supplementary Figure 6). Overall, these results support accuracy of our inferred historical relationships, including the timing of a range of different historical events, identified within a single analysis framework.
Neanderthal/Denisovan and unexplained introgression events
Introgression from distantly related groups in the past is expected to introduce lineages which forward in time can randomly spread in the tree, and backward in time remain distinct from other lineages, resulting in an excess of long branches associated with particular times. We identified such deep branches (>1 million years (MY) in age and with varying lower end), across human groups (Figure 4b,c). It is established that all nonAfrican human groups possess similar levels of Neanderthal introgression, and specific Asian and Australasian groups possess admixture from a group related to Denisovans22,33. We therefore label deep branches possessing at least two derived mutations by whether at least one mutation is shared with the sequenced Denisova33 or Neanderthal22,34 genomes (Figure 4b shows one example of likely introgression from Neanderthals into European GBR, but not African YRI individuals). After classifying deep branches based on their lower-end times, for branches originating within the last 10,000YBP, 85-90% are shared with Neanderthal or Denisovan for most Eurasian groups (Figure 4c, Methods). Any lineages from recent introgression events will show a lower-end age younger than the time of introgression, and upper-end older than the split time of the introgressing group, so we expect branches with a younger lower-end to be most enriched with lineages that came from distantly diverged introgressing groups. This suggests that aside from groups closely related to Neanderthals and Denisovans, no strongly diverged hominid has left a major, recent impact in non-African populations studied here. An exception is IBS, which has more long branches shared with African populations (Supplementary Figure 6). In East and South Asian groups, the data suggest a very recent arrival of Denisovan DNA (mainly <15,000YBP). In non-Africans, Neanderthal sharing remains high for branches with lower-end age younger than ~30,000YBP. These dates are only lower bounds on the introgression time, and an accurate arrival date of Neanderthal DNA would require estimating a joint genealogy which requires further work. Nevertheless, they are consistent with previous estimates based on linkage disequilibrium (LD)35, and of direct evidence of hybrids35,36 around 40,000 YBP. Moreover, elevation in the sharing of quite deep haplotypes with Neanderthals steadily increases for branches with lower-end age of ~100,000 YBP towards the present, which is suggestive of introgression beginning from this time in nonAfrican individuals, although it is important to note that our date estimates for individual events might be over- or under-estimates in some cases.
In contrast to non-African groups, sharing with Neanderthal/Denisovans is lower (<20%, Figure 4c) in African populations, and declines towards the present, suggesting minimal recent interactions22,33. This is despite the fact that African populations possess far more long branches (on average, on deep branches with lower coalescence age <30,000 YBP; 42,434 vs. 7,012 mutations occur in African vs. non-African populations). Of mutations on long branches found in Africa, 98% are Africa-specific, indicating separate events occurring in non-African and African populations (Supplementary Figure 6). Comparing YRI, GBR, BEB (Bengali in Bangladesh), and CHB to expectations under panmixia, we observe a strong excess of mutations on deep branches with lower coalescence age <40,000 YBP in all cases, which is almost entirely explained by Neanderthals/Denisovans in the non-African populations, but not in YRI (Figure 4d, Methods). In panmictic simulations with matched population size histories, we observe no such excess (Supplementary Figure 6). This evidences ancient but uncharacterised population structure within Africa, as suggested elsewhere37,38. Figure 4b shows one example consistent with an introgression event in YRI, not involving a closely relative of Neanderthals.
Powerful tree-based approaches to study natural selection
By directly modelling how mutations arise and spread, genealogical trees offer the potential to powerfully investigate different modes of natural selection. For example, a recent method, SDS, indirectly tests for differences in tree tip branch lengths between carriers and non-carriers using the density of singletons around a focal SNP39 and a tree-based analogue (trSDS) tests this directly40. Here, we propose a class of approaches (Relate Selection Tests) based on estimating the speed of spread of a particular lineage (carrying a particular mutation), relative to other “competing” lineages, over some chosen time range. To test for selection over the entire lifetime of a mutation, we condition on the number of lineages present when it first arises, and use as a test statistic the number of present-day carriers. Assuming no population stratification, the null distribution of this statistic can be calculated analytically and is robust in principle to population size changes (Methods).
Simulated data (Figure 5a) show a close match in null no-selection scenarios of our p-values p R to the expected uniform distribution. Across a range of selective advantages and SNP frequencies (Figure 5b, Supplementary Figure 6), our approach increases power relative to (tr)SDS, as well as iHS for weaker selection in particular. trSDS is more powerful than SDS, while applying the Relate Selection Test to true genealogical trees yields a test that is uniformly more powerful than other approaches (Figure 5b), indicating the strength of tree-based approaches. In practice, there is some decrease in power from the need to infer trees via Relate. The power increases for weak selection might be particularly beneficial for testing complex, polygenic traits, where small effect sizes at individual loci are expected to yield small selection coefficients41.
Calculating p R for SNPs across twenty 1000GP populations (Methods) identified 35 distinct (24 novel) stringent signals genome-wide (p R < 5 × 10-8 in each of three or more groups) (Supplementary Table 3). These include the LCT region associated with Lactose tolerance in Europeans, and a mutation in the EDAR gene in East Asian populations42,43, with both likely causal variants strongly associated with our most significant mutation (r 2 ≥ 0.8). We also observe a previously-detected strong signal of positive selection in the MHC region in GBR44 (Figure 5c). Among new regions, we identify selection evidence at the EDARADD gene - which interacts with the EDAR gene45 in the formation of hair follicles, sweat glands, and teeth43 - in all South Asian populations and Finns, with pR < 10-6 in all other European populations. In 16 of 35 regions, we identify GWAS catalogue hits (OR=6.44; p=0.01), non-synonymous mutations (OR=2.49; p=0.16), or expression quantitative trait loci (eQTLs; OR=1.74; p=0.1), in LD with the mutation with strongest selection evidence (r 2 ≥ 0.8, Methods), suggesting functional effects, reaching statistical significance for the case of GWAS hits despite the small number of cases tested. Notably, 18 of the 35 regions are found only for African populations.
SNPs in functional parts of the genome are significantly enriched among targets of positive selection (Figure 5d, Methods), with strongest enrichment for GWAS hits, across all considered populations. This encouragingly supports a link between evidence of selection and SNPs with detectable influences on phenotypes at the organism level. Multiple previous studies46–49 have attempted to test polygenic traits for evidence of directional selection, but confounding due to population stratification50,51 is potentially problematic in practice. To leverage potential power gains, we tested whether derived mutations increasing (or decreasing) a trait show increased selection evidence relative to randomly sampled control mutations of the same frequency (one-sided Wilcoxon test; Methods). For each trait, we thin GWAS hits to account for LD and examined only SNPs showing “genome-wide significant” associations (p < 5 × 10-8), because confounding due to population stratification is thought to occur through relatively small - but systematic - biases in effect size estimates50,51, but is not expected, in general, to produce genome-wide significant false-positives. At each SNP, we use only the association direction, rather than its strength, to offer additional robustness to potential confounding.
If positive selection influences a trait in a certain direction, e.g. increasing, we would expect positive selection on trait-increasing and negative selection on trait-decreasing mutations. We expect our test to be sensitive mainly to the former, because selection will increase frequencies of such SNPs, and the Relate Selection Test has reduced power to identify selection at rarer markers (Figure 5b). However, for traits with a large number of hits and strong selection, it is theoretically possible to observe some selection evidence in both directions52,53, because to avoid ascertainment effects we condition on SNP allele frequencies at traitinfluencing sites. Therefore, we additionally test for differences in present-day DAFs between trait-increasing and trait-decreasing mutations, which can provide orthogonal evidence of polygenic adaptation, aiding interpretation of results (Methods).
As a positive control, we applied our test to GWAS for hair colour within the UK Biobank54 (Figure 6a). As in previous studies49,55,56, we find a signal for SNPs associated with blonder hair color among European populations. We further observe strong selection towards light brown hair color and against black hair color, including more weakly in South Asians, but not in other groups. Testing based on iHS scores decreases significance by up to 4 orders of magnitude (Figure 6a), and some signals become non-significant. We applied the same test to test 84 traits: 6 from the UK Biobank, and 78 with at least 10 genome-wide significant GWAS catalogue association signals in each effect direction, in all populations except recently admixed groups. 61 of these (73%) showed nominal selection evidence (p<0.05) in at least one population (Figure 6b), with strong geographic clustering. The most significant signal (p = 6 × 10-14) is for SNPs associated with decreased Body Mass Index (BMI) in CEU. The largest number of selection signals are observed for Europeans, likely because many GWAS were conducted in these groups. Interestingly, East Asians have the fewest signals and no enrichment of low p-values (Supplementary Figure 8), possibly explained by their stronger population bottleneck, which would theoretically be expected to weaken selection signals.
Height, BMI, and Schizophrenia have been studied previously and show a large number of association signals57. While several studies have reported genetic differentiation between populations58–60, evidence for selection remains controversial40,47–51,58,59,61 and some studies reporting recent selection on increased height in Europeans appear confounded by subtle population stratification40,50,51. Our test finds selection evidence for both effect directions in each population for height, except in East Asians, using the UK Biobank GWAS. DAFs tend to be larger towards the height-decreasing direction. This complex picture may be a consequence of both negative and positive selection acting on height, as well as pleiotropy. SNPs impacting other traits might also impact height (Supplementary Note). We identify strong evidence of selection favouring BMI-decreasing SNPs across almost all populations, with agreement of DAF shifts, indicative of directional selection. For both traits, we detect little evidence of selection in the smaller GWAS catalogue collection. Decreased risk of Schizophrenia has evidence of selection in Europeans, and some South and East Asian populations, while African populations show selection evidence towards a risk increase.
Among other phenotypes, we see selection evidence for a variety of blood-related phenotypes, with congruent DAF signals. In Europeans and some South Asians, we detect a strong signal favoring SNPs associated with blood pressure increases, contrary to previous studies suggesting the opposite direction55,62. We moreover find evidence in many groups for selection favouring SNPs associated with decreased hemoglobin concentration and related traits, and with increases in platelet-related traits.
Discussion
We introduced Relate, a scalable method for estimating genealogies genome-wide and demonstrated its accuracy and utility on a diverse set of applications. In many settings, Relate improved on existing state-of- the-art methods, which have previously required separate analyses: by instead obtaining inferences from the same genealogies, comparisons across different applications become straightforward. This approach is highly modular; methods developed for genealogy-based inference should be applicable regardless of the specific algorithm used for estimating marginal trees. Although we have focused on human genomes, Relate should work equally well in other recombining species.
In our 1000GP data analysis, we provide several examples whereby Relate-based trees are able to capture evolutionary processes that are themselves evolving through time: “evolution of evolution”. Temporal changes in mutation rates, population size, migration, and archaic admixture are simultaneously inferred, as are population-specific signals of natural selection. Genealogies provide a powerful, natural way to study these complex, interacting phenomena, and we believe studies of other evolutionarily and temporally dynamic processes - for example of evolution of recombination rates63,64 - will yield new insights.
Interpretation of our findings regarding natural selection requires some care. A strength of our selection test is that it provides p-values, which are naturally calibrated, even if population sizes vary through time. In common with previous studies, we find a relatively small (<40) number of clear signals of strong, ongoing selection across multiple human populations. In contrast, we find a much larger collection of phenotypes where - based on published GWAS - there is evidence of an influence of directional selection. These traits include BMI, blood pressure, and white and red blood cell counts, and more generally, we see an enrichment of selection evidence at loci shown to associate with human phenotypes. These findings appear highly consistent with the polygenic nature of most human phenotypes - which are expected to impose very weak selection, but on a large collection of loci41. However, temporal changes in selection, overlapping genetic influences across traits, and the possibility of compensatory evolution in response to other genetic changes or the environment, are among reasons complicating the assignment of selection signals to specific phenotypes (Supplementary Note).
Relate provides age estimates for mutations and other events, and these enable us to construct statistics to understand evolutionary history, including natural selection either on individual mutations or collections of mutations. We regard the selection statistics introduced here as initial approaches along a path towards a richer inference framework, including e.g. background selection, full selective sweeps, or balancing selection. Development of methods to better understand directional migration and ancient admixture is another direction for future work. As one example, our results suggest a large impact of ancient substructure specific to African populations, as has been previously hypothesized37,38. More generally, we hope that methods will be developed to perform statistical analyses on a set of trees generated either by Relate or other approaches. Other analyses might use estimated mutational ages obtained here directly (https://zenodo.org/record/3234689).
There are several natural extensions to Relate itself, e.g. allowing for increasing sample sizes. A recently developed method, tsinfer65, has impressive scaling with sample size and might readily extend to even millions of samples, while Relate currently only handles at most a few tens of thousands of samples genome-wide. While tsinfer currently only infers tree topologies (as part of a full ancestral recombination graph structure), and so cannot infer tree times or model demographic histories, it would be possible to use tsinfer-based tree topologies in our framework, allowing full tree-based inference for huge sample sizes. Incorporation of ancient DNA sequences is another important direction. Such samples may have substantially higher error rates or more missing data than modern-day individuals, potentially requiring an approach that “threads” (ancient) sequences through genealogies that are initially built using modern individuals2. This approach might also be useful for efficient statistical phasing and/or imputation of individuals only typed at a subset of markers.
Online Methods
Relate overview
We estimate genealogies as a sequence of rooted binary trees, where each tree captures the genealogy for a subregion of the genome. This representation serves as an approximation of an Ancestral Recombination Graph (ARG)4. We estimate local ancestry without global constraints on tree topology, thereby transforming genealogy reconstruction into a feasible and highly parallelizable problem.
Our approach can be divided roughly into three steps, which we detail below (also see Figure 1, Supplementary Figure 1, and Supplementary Note).
Calculating position specific distance matrices
While trees vary along the genome, our method heavily utilizes ancestry information from nearby SNPs to reconstruct the tree at a specific position. We achieve this by using a HMM similar to that first proposed by Li and Stephens27 (see Supplementary Figure 2 for parameter choices). Intuitively, this HMM reconstructs a haplotype as a mosaic of other sample haplotypes along the genome (Supplementary Figure 1), allowing for mismatching in the copying process, and viewing changes in haplotype as recombination events. After applying the HMM, at a focal SNP ℓ each of the other haplotypes j therefore has some probability pij¿ of being copied from, to generate haplotype i. After rescaling log pij¿ appropriately (Supplementary Note), we obtain a position-specific distance matrix d whose entry (i,j) converges to the number of mutations derived in i and ancestral in j in the limit of no recombinations. In the presence of recombination, this d can be interpreted to store a local number of derived mutations, where more closely related haplotypes tend to have fewer mismatches over longer stretches, therefore receiving a smaller distance in this matrix.
We modified the Li-and-Stephens HMM to account for ancestral and derived states, a modification that guarantees our approach will construct the correct tree topology under the infinite-sites assumption with no recombination, while simultaneously speeding up the calculation of posterior copying probabilities.
Tree builder
The distance matrix is turned into a binary tree using a hierarchical clustering algorithm. This hierarchical clustering algorithm is motivated by the observation that each row of the distance matrix should indicate the order in which this haplotype coalesced with other haplotypes of the dataset. This can be shown mathematically in some limit conditions, such as the case with no recombination (Supplementary Note).
Our algorithm iteratively merges clades of haplotypes, corresponding to past coalescences. After merging clades, we update the distance matrix by combining the corresponding rows and columns using a weighted sum, with weights determined by the size of clades. In each step of the algorithm, we merge the pair of clades that coalesce with each other before coalescing with any other clade, as determined using rows of the distance matrix. If multiple pairs of clades satisfy this condition, we choose the pair with minimum symmetrized score in the distance matrix. If the data are consistent with a binary tree under the infinite-sites model, such a pair always exists. In practise, errors in the data, complex recombination histories, or violations of assumptions made by our model, may result in a distance matrix that is inconsistent with a binary tree. To be robust to such cases, we relax the conditions for identifying pairs of clades to coalesce.
Mapping mutations to branches and estimating branch lengths
Once tree topology is estimated as above, where possible we map mutations to the (unique) branch that has the identical descendants as the carriers of the derived allele in the data. To be robust to errors, where necessary we use an approximate rule for such mapping; however some mutations, e.g. repeat mutations or error-prone loci, may still not map to a unique branch. For these loci, we determine the smallest collection of branches, such that the data can be fully recovered. If a mutation maps to the tree only after reinterpreting the derived allele as the ancestral allele (and vice versa), we “flip” ancestral and derived alleles at this locus. For computation efficiency, to avoid having to construct a new tree at every locus we construct trees starting at the 5’ end of a region or chromosome, and move along the region constructing a new tree only when a SNP is flipped or cannot be mapped to a unique branch. Finally, after identifying equivalent branches in adjacent trees along the genome, we apply a Metropolis-Hastings type Markov Chain Monte Carlo (MCMC) algorithm to estimate branch lengths. The MCMC algorithm has a coalescent prior assuming a single panmictic population3.
Estimating coalescence rates through time
We estimate the effective population size, defined as the inverse of the coalescence rate, by applying the following iterative algorithm. We initially estimate branch lengths using a constant effective population size. We then calculate a maximum-likelihood estimate of the coalescence rates between pairs of haplotypes given the branch lengths (Supplementary Note). By averaging coalescence rates over all pairs of haplotypes and taking the inverse, we obtain a population-wide estimate of the effective population size. We then use this population size estimate to re-estimate branch lengths, which requires only the final MCMC step of the branchlength estimation. By repeating these two steps until convergence (in practice, we use only 5 iterations as this provides good performance), we obtain a self-contained algorithm for jointly estimating branch lengths and the effective population size. We can average pairwise coalescence rates in different ways to obtain rates for sub-populations and cross-coalescence rates between populations.
Estimating relative mutation rates through time
We estimate the mutation rate through time for all 96 triplet mutations (Figure 4a, Supplementary Figure 6). To estimate mutation rates for a mutation category of interest, we calculate, for each epoch, the quotient of the number of mutations in that category by the total branch length over bases at which such a mutation may have occurred. In our model, we fix the average mutation rate to a constant value through time, such that any change in average mutation rate should in theory be absorbed in our population size estimate. We therefore first eliminate any remaining temporal trends in the average mutation rate by dividing by the average mutation rate in each epoch. For each population, we then normalise the mutation rates such that the average rate over time equals 1. In simulations (Supplementary Figure 4), we show that variable mutation rates among categories can be detected by this approach, and approximately dated.
Pre-processing of the 1000 Genomes Project dataset
The 1000 Genomes Project dataset comprises 2504 individuals, from 26 populations. We obtained a phased version of the dataset (see Data availability). We next excluded multi-allelic SNPs, and we exclude one individual (two haplotypes) from each population for future applications, and analyzed the remaining 2,478 individuals (Supplementary Table 2). We use a genomic mask provided with the 1000 Genomes Project dataset (see Data availability) to exclude regions in the marked as “not passing” in the pilot mask, to remove loci with low certainty of genotypes. We also exclude any base for which the fraction of “not passing” bases within 1,000 bases to either side exceeds 0.9. To account for this filtering, we readjust the number of bases between SNPs at which we could have potentially observed a SNP. We use an estimate of the human ancestral genome (see Data availability) to identify the most likely ancestral allele for each SNP.
Identifying branches indicative of Neanderthal and Denisovan introgression
We use genome sequences of the Vindija22 and Altai34 Neanderthals (NEA), and a Denisovan (DEN)33 to identify branches indicative of Neanderthal and Denisovan introgression into non-African populations. To identify branches that remain segregated from other human lineages for a long time, we use the world-wide genealogy of 2,487 samples. To identify whether a branch is shared with NEA or DEN, at least one mutation needs to be mapped to that branch. We therefore exclude any mutation that has not been genotyped (or does not pass the genomic masks) in these ancient genomes. We further restrict our analysis to branches with at least two mutations mapped to them, as well as having an upper end that is older than 1M YBP. Of any such branches, we calculate the fraction of branches with at least one NEA or DEN mutation. In Figure 4c, we plot these fractions as functions of the lower-end age of the branch. Because the same branch may persist over multiple trees, we identify equivalent branches (Supplementary Note) and average ages of lower and upper ends across these equivalent branches. We assign a branch to a population if at least one descendant of that branch is in the population.
In Figure 4d, we observe an enrichment of branches indicative of introgression. This enrichment is identified by comparing the observed number of mutations in bins divided by upper and lower coalescence age to that expected in a panmictic history. To calculate the expected number of mutations in each bin, we fix the ages of coalescence events in each tree but randomise the topology assuming a panmictic population. The probability of upper and lower coalescence ages falling into bins s and r, conditional on the mutation arising while k lineages remain, is given by where I denotes the indicator function. Assuming neutrality, a mutation is equally likely to have arisen anywhere on the branch it maps to. We therefore calculate the weighted average with weights wk defined as the proportions of a branch while k lineages remain. Summing this over all SNPs yields the expected number of mutations with upper and lower coalescence age falling, respectively, into bins s and r. In Figure 4d, log10 age bins are defined by [− ∞, 4.25),[4.25,4.75),[4.75,5.25),[5.25,5.75),[5.75, ∞).
Tree-based statistic for detecting positive selection
Positive selection is expected to result in favourable mutations spreading rapidly in a population. One approach to capture this is via the number of lineages ultimately descending from the potentially favourable mutation(s): although we note that this is not the maximum likelihood approach, it has the benefit of making calculations straightforward. Under a null model of the standard coalescent model without selection, it is known that while k lineages remain, the joint distribution of the number of descendants of these k lineages is uniform in the partitions of N haplotypes to k lineages (see e.g., Ref. [66]). Using this property, we analytically calculate the marginal distribution that two of k lineages have more than fZ descendants, where fZ is the present-day DAF of the mutation. Here, we choose k to be the number of lineages remaining when the mutation of interest increased from frequency 1 to 2 (see Supplementary Note for the mathematical details).
To remove false-positive selection hits due to poorly inferred genealogies, our analysis for the 1000 Genomes Project dataset is based on a subset of all SNPs mapping to trees, and present in 3 or more copies in the dataset. Specifically, we remove SNPs failing any of the following filters: (i) the number of mutations mapping to that SNP’s tree is in the bottom 5th percentile, or (ii) the fraction of tree branches having at least one SNP is in the bottom 5th percentile. This excludes approximately 7% of SNPs.
Simulation of positive selection
To simulate positive natural selection, we adopt the pipeline outlined in Ref. [49]. We first simulate the trajectory of the DAF using simuPOP67. We vary the selection coefficient between s = 0.001 and s = 0.05 and assume that the selected allele is beneficial throughout its history. We fix the present-day DAF to 0.7 (see Supplementary Figure 7 for other present-day DAFs). We then use mbs268 (mutation rate μ = 1.25 × 10-8, constant recombination rate ρ = 5 × 10-9) to simulate a region of 20 Mb, given the DAF trajectory for the central selected SNP. For each non-zero selection coefficient, we perform 200 simulations, and we perform 500 simulations for the neutral case. We assume a population size history as for our estimates for YRI and GBR, in separate simulations.
We compare to iHS, SDS, and a tree-based variant of SDS (trSDS) proposed in Ref. [40]. For iHS, SDS, and trSDS, we standardise scores using the mean and standard deviation in the neutral case, which is an idealised setting that should favour the power estimates of these methods. We then determine a critical standardized score that corresponds to a given type I error rate in the neutral case to estimate the statistical power. For Relate, we use frequency-conditioned p-values, by calculating a critical p-value that yields the desired false-positive rate in the neutral case (for the statistical power using raw p-values, see Supplementary Figure 7).
Enrichment of SNPs with functional annotation among targets of positive selection
We merge selection evidence for SNPs by region (AFR: Africans, EAS: East Asians, EUR: Europeans, SAS: South Asians) by first calculating Z-scores of the logarithm of selection p-values within populations, and then averaging these Z-scores across populations. We exclude groups expected to be highly admixed69 (ACB, ASW, CLM, MXL, PEL, PUR (Supplementary Table 2)), because recent admixture may confound selection signals. We further exclude SNPs with a DAF <5% in the region of interest.
To assess statistical significance for the observed enrichment of GWAS hits and functional mutations in groups of SNPs showing evidence of selection, we used a block bootstrap with a block size of 1 Mb. This will account for LD at scales below this threshold. In each bootstrap iteration, we resample blocks containing SNPs with a selection Z-score within the range of interest, with replacement, and calculate the ratio of the number of SNPs with functional annotation obtained using the HaploReg database70 (see Data availability) and the GWAS catalogue to the expected number of such SNPs, conditional on DAF. We condition on frequency, to account for the possibility that skewed frequency spectra in functional SNPs could be driving the signal.
Pre-processing of GWAS
We use SNP-trait associations documented in the GWAS catalogue71 (see Data availability) to study polygenic adaptation. We use only association signals whose GWAS p-value is smaller than 5 × 10-8. For each trait, we also remove any duplicate SNPs.
For every combination of population, trait, and effect direction, we compile a set of approximately independent GWAS signals as follows.
For each pair of population and trait, we remove associations that are in close physical proximity and may therefore be in LD. For this, we first group SNPs into approximately independent blocks, such that any two GWAS hits in separate blocks are separated by at least 100 kb and there are no intervals larger than 100 kb with no GWAS hit inside a block. We then choose one GWAS hit from each block uniformly at random. We remove any SNP with a DAF <5%. To determine the effect direction of a SNP, we use the annotation in column “95% CI (TEXT)” combined with the indicated risk allele. We then realign the effect direction to the derived allele. We only consider SNPs for which an effect direction can be determined with this procedure. As described in the main text, we only analyze traits with at least 10 independent hits in both effect directions in all populations. This results in 76 traits and a total of 7,302 GWAS hits (before filtering for SNPs in close proximity in each population).
For Schizophrenia, we are unable to obtain an effect direction using the procedure described above. Instead, we downloaded results for a large-scale GWAS conducted by the Psychiatric Genomics Consortium72. We considered SNPs reaching a GWAS p-value of 5 × 10-8 of which there were 9,138. We intersected this set of SNPs with SNPs segregating in each of the considered populations. As for the GWAS catalogue, we identified approximately independent blocks. We then chose the SNP with lowest GWAS p-value in each block, resulting in 81 to 89 hits per population.
In addition, we use GWAS conducted as part of the UK Biobank54, focussing on highly polygenic physical traits. Our pre-processing protocol is analogous to that for schizophrenia detailed above. The number of approximately independent hits per population range from 272 hits for waist circumference to 989 hits for standing height.
Trait selection test
For every combination of population, trait, and effect direction, we test whether p-values are smaller than expected. For this test, we first sample SNPs that we use for comparison. For each SNP associated with the population, trait, and effect direction tuple of interest, we sample 20 SNPs uniformly at random with replacement from SNPs, with the same present-day DAF in the population of interest. We then use a one-sided Wilcoxon rank-sum test to test whether the p-values of SNPs associated with the tuple of interest tend to be smaller than those for the frequency-matched set of SNPs. We repeat this test 20 times and report the mean p-value of the Wilcoxon rank-sum test.
Our primary test identifies selection evidence conditional on DAF. However, shifts in DAF can themselves serve as orthogonal evidence of polygenic adaptation, complementing our inferences. Therefore, we conducted a one-sided Wilcoxon rank-sum test to test whether DAFs of SNPs associated with the effect direction with selection evidence tend to exceed those associated with the opposing effect direction, and compared to our results conditional on SNP frequency. We note that we expect to lack power to reliably detect selection with this test, given that there are typically only tens of SNPs independently associating with each trait In addition, the relationship between selection and SNP frequencies can be complex if selection strength varies through time and/or geographic locations.
Supplementary Material
Acknowledgements
We thank Nick Barton, Daniel Falush, Molly Przeworski, Guy Sella, Jonathan Terhorst, Pier Palamara, Gerton Lunter, Jonathan Marchini, Sile Hu, Christopher B. Cole, Thaddeus Aid, Clare E. West for helpful comments, ideas, and suggestions. L.S. acknowledges the support provided through the Engineering and Physical Sciences Research Council (EPSRC) [grant number EP/G03706X/1]. M.F. acknowledges the support provided through the Natural Sciences and Engineering Research Council of Canada (NSERC, PGS D) and the Clarendon Scholarship. S.R.M. acknowledges the support provided by the Wellcome Trust Investigator Award [grant number 098387/Z/12/Z and 212284/Z/18/Z]. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. Financial support was provided by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.
Footnotes
Author contributions
S.R.M. designed the study. L.S. and S.R.M. developed Relate with contributions by M.F. in the development of the algorithm for estimating coalescence rates. L.S. and S.R.M. performed the analysis, S.S. provided supplementary data and L.S. and S.R.M. wrote the manuscript.
Competing Interests
S.R.M. is a director of GENSCI limited. The remaining authors declare no competing financial interests.
Data availability
Relate-estimated coalescence rates, allele ages, and selection p-values for the 1000 Genomes Project can be downloaded from https://zenodo.org/record/3234689.
Datasets used in the current study were obtained from the following URLs:
1000 Genomes Project phased dataset, https://mathgen.stats.ox.ac.uk/impute/1000GPPhase3.html (13 Jan 2017); Genomic mask, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible genome masks/ (20 Jul 2017); Human ancestral genome, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis results/supporting/ancestral alignments/ (20 Jul 2017); GWAS catalogue, https://www.ebi.ac.uk/gwas/api/search/downloads/full (9 Nov 2017); PGC GWAS study, https://www.med.unc.edu/pgc/results-and-downloads (23 Nov 2018); HaploReg, http://archive.broadinstitute.org/mammals/haploreg/data/haploregv4.020151021.vcf.gz (21 Oct 2017); GTEx eQTL https://storage.googleapis.com/gtex analysis v7/single tissue eqtl data/GTEx Analysis v7 eQTL.tar.gz (13 Jan 2019); UK Biobank GWAS summary statistics, http://www.nealelab.is/uk-biobank (4 Oct 2018); PopHumanScan, https://pophumanscan.uab.cat (13 Jan 2019)
Code availability
The software Relate can be downloaded from https://myersgroup.github.io/relate under an Academic Use Licence.
External software used in the current study were downloaded from the following URLs:
ARGweaver, https://github.com/mdrasmus/argweaver (24 Jan 2017);RENT+, https://github.com/SajadMirzaei/RentPlus (2 Oct 2017); msprime, https://github.com/tskit-dev/msprime (22 Jul 2017); msmc, https://github.com/stschiff/msmc2 (14 Oct 2017); SMC++, https://github.com/popgenmethods/smcpp (14 Oct 2017); simuPOP, http://simupop.sourceforge.net/ (27 Jun 2018); mbs, http://www.sendou.soken.ac.jp/esb/innan/InnanLab/ (27 Jun 2018); SDS, https://github.com/yairf/SDS (27 Jun 2018), selscan, https://github.com/szpiech/selscan (31 Jul 2018); hapbin, https://github.com/evotools/hapbin (11 Dec 2018)
References
- 1.Griffiths RC, Marjoram P. Ancestral inference from samples of DNA sequences with recombination. J Comput Biol. 1996;3:479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
- 2.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-Wide Inference of Ancestral Recombination Graphs. PLoS Genet. 2014;10 doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19:27–43. [Google Scholar]
- 4.Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 1983;23:183–201. doi: 10.1016/0040-5809(83)90013-8. [DOI] [PubMed] [Google Scholar]
- 5.McVean GAT, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc London B Biol Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hein J. Reconstructing evolution of sequences subject to recombination using parsimony. Math Biosci. 1990;98:185–200. doi: 10.1016/0025-5564(90)90123-g. [DOI] [PubMed] [Google Scholar]
- 7.Song YS, Hein J. Constructing minimal ancestral recombination graphs. J Comput Biol. 2005;12:147–169. doi: 10.1089/cmb.2005.12.147. [DOI] [PubMed] [Google Scholar]
- 8.Kececioglu J, Gusfield D. Reconstructing a history of recombinations from a set of sequences. Discret Appl Math. 1998;88:239–260. [Google Scholar]
- 9.Wang L, Zhang K, Zhang L. Perfect phylogenetic networks with recombination. J Comput Biol. 2001;8:69–78. doi: 10.1089/106652701300099119. [DOI] [PubMed] [Google Scholar]
- 10.Wu Y. New methods for inference of local tree topologies with recombinant SNP sequences in populations. IEEE/ACM Trans Comput Biol Bioinforma. 2011;8:182–193. doi: 10.1109/TCBB.2009.27. [DOI] [PubMed] [Google Scholar]
- 11.Mirzaei S, Wu Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics. 2017;33:1021–1030. doi: 10.1093/bioinformatics/btw735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in Europeans. Science. 1978;201:786–792. doi: 10.1126/science.356262. [DOI] [PubMed] [Google Scholar]
- 13.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Novembre J, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Henderson D, (Joe) Zhu S, Lunter G. Demographic inference using particle filters for continuous Markov jump processes. bioRxiv: 382218. 2018 doi: 10.1371/journal.pone.0247647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Reich DDE, et al. Linkage disequilibrium in the human genome. Nature. 2001;411:199–204. doi: 10.1038/35075590. [DOI] [PubMed] [Google Scholar]
- 20.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46:919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history froth hundreds of unphased whole genomes. Nat Genet. 2017;49:303–309. doi: 10.1038/ng.3748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Green RE, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Harris K. Evidence for recent, population-specific evolution of the human mutation rate. Proc Natl Acad Sci U S A. 2015;112:3439–3444. doi: 10.1073/pnas.1418652112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using singlenucleotide polymorphism data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bae CJ, Douka K, Petraglia MD. On the origin of modern humans: Asian perspectives. Science. 2017;358:eaai9067. doi: 10.1126/science.aai9067. [DOI] [PubMed] [Google Scholar]
- 30.Liu X, Fu Y-X. Exploring population size changes using SNP frequency spectra. Nat Genet. 2015;47:555–559. doi: 10.1038/ng.3254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chheda H, et al. Whole genome view of the consequences of a population bottleneck using 2926 genome sequences from Finland and United Kingdom. Eur J Hum Genet. 2017;25:477–484. doi: 10.1038/ejhg.2016.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Duret L, Galtier N. Biased Gene Conversion and the Evolution of Mammalian Genomic Landscapes. Annu Rev Genomics Hum Genet. 2009;10:285–311. doi: 10.1146/annurev-genom-082908-150001. [DOI] [PubMed] [Google Scholar]
- 33.Meyer M, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338:222–226. doi: 10.1126/science.1224344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Prüfer K, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505:43–49. doi: 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. The date of interbreeding between Neandertals and modern humans. PLoS Genet. 2012;8:e1002947. doi: 10.1371/journal.pgen.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Fu Q, et al. An early modern human from Romania with a recent Neanderthal ancestor. Nature. 2015;524:216–219. doi: 10.1038/nature14558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hammer MF, Woerner AE, Mendez FL, Watkins JC, Wall JD. Genetic evidence for archaic admixture in Africa. Proc Natl Acad Sci U S A. 2011;108:15123–15128. doi: 10.1073/pnas.1109300108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ragsdale AP, Gravel S. Models of archaic admixture and recent history from two-locus statistics. bioRxiv: 489401. 2018 doi: 10.1371/journal.pgen.1008204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Mathieson I, et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Edge M, Coop G. Reconstructing the history of polygenic scores using coalescent trees. bioRxiv: 389221. 2018 doi: 10.1534/genetics.118.301687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Simons YB, Bullaughey K, Hudson RR, Sella G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biol. 2018;16:e2002985. doi: 10.1371/journal.pbio.2002985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Enattah NS, et al. Identification of a variant associated with adult-type hypolactasia. Nat Genet. 2002;30:233–237. doi: 10.1038/ng826. [DOI] [PubMed] [Google Scholar]
- 43.Hardouin E, et al. Positive Selection in East Asians for an EDAR Allele that Enhances NF-κB Activation. PLoS One. 2008;3:e2209. doi: 10.1371/journal.pone.0002209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Miretti MM, et al. A high-resolution linkage-disequilibrium map of the human major histocompatibility complex and first generation of tag single-nucleotide polymorphisms. Am J Hum Genet. 2005;76:634–646. doi: 10.1086/429393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Sadier A, Viriot L, Pantalacci S, Laudet V. The ectodysplasin pathway: from diseases to adaptations. Trends Genet. 2014;30:24–31. doi: 10.1016/j.tig.2013.08.006. [DOI] [PubMed] [Google Scholar]
- 46.Pritchard JK, Pickrell JK, Coop G. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr Biol. 2010;20:R208–R215. doi: 10.1016/j.cub.2009.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zhang G, Muglia LJ, Chakraborty R, Akey JM, Williams SM. Signatures of natural selection on genetic variants affecting complex human traits. Appl Transl Genomics. 2013;2:78–94. doi: 10.1016/j.atg.2013.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Berg JJ, Coop G. A population genetic signal of polygenic adaptation. PLoS Genet. 2014;10:e1004412. doi: 10.1371/journal.pgen.1004412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Field Y, et al. Detection of human adaptation during the past 2000 years. Science. 2016;354:760–764. doi: 10.1126/science.aag0776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Sohail M, et al. Signals of polygenic adaptation on height have been overestimated due to uncorrected population structure in genome-wide association studies. bioRxiv: 355057. 2018 [Google Scholar]
- 51.Berg JJ, et al. Reduced signal for polygenic adaptation of height in UK Biobank. bioRxiv: 354951. 2018 doi: 10.7554/eLife.39725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Maruyama T. The age of an allele in a finite population. Genet Res. 1974;23:137. doi: 10.1017/s0016672300014750. [DOI] [PubMed] [Google Scholar]
- 53.Kiezun A, et al. Deleterious Alleles in the Human Genome Are on Average Younger Than Neutral Alleles of the Same Frequency. PLoS Genet. 2013;9:e1003301. doi: 10.1371/journal.pgen.1003301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Casto AM, Feldman MW. Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does selection affect unlinked SNPs with shared trait associations? PLoS Genet. 2011;7:e1001266. doi: 10.1371/journal.pgen.1001266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Wilde S, et al. Direct evidence for positive selection of skin, hair, and eye pigmentation in Europeans during the last 5,000 y. Proc Natl Acad Sci U S A. 2014;111:4832–4837. doi: 10.1073/pnas.1316513111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bulik-Sullivan BK, et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Turchin MC, et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet. 2012;44:1015–1019. doi: 10.1038/ng.2368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Robinson MR, et al. Population genetic differentiation of height and body mass index across Europe. Nat Genet. 2015;47:1357–1362. doi: 10.1038/ng.3401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Novick D, Montgomery W, Treuer T, Moneta MV, Haro JM. Sex differences in the course of schizophrenia across diverse regions of the world. Neuropsychiatr Dis Treat. 2016;12:2927–2939. doi: 10.2147/NDT.S101151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Crespi B, Summers K, Dorus S. Adaptive evolution of genes underlying schizophrenia. Proc R Soc B Biol Sci. 2007;274:2801–2810. doi: 10.1098/rspb.2007.0876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Young JH, et al. Differential susceptibility to hypertension is due to selection during the out-of-Africa expansion. PLoS Genet. 2005;1:e82. doi: 10.1371/journal.pgen.0010082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Hinch AG, et al. The landscape of recombination in African Americans. Nature. 2011;476:170–5. doi: 10.1038/nature10336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Fledel-Alon A, et al. Variation in human recombination rates and its genetic determinants. PLoS One. 2011;6:e20321. doi: 10.1371/journal.pone.0020321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Kelleher J, Wong Y, Albers P, Wohns AW, McVean G. Inferring the ancestry of everyone. bioRxiv: 458067. 2018 [Google Scholar]
- 66.Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Stoch Model. 1998;14:273–295. [Google Scholar]
- 67.Peng B, Kimmel M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics. 2005;21:3686–3687. doi: 10.1093/bioinformatics/bti584. [DOI] [PubMed] [Google Scholar]
- 68.Teshima KM, Innan H. mbs: modifying Hudson’s ms software to generate samples of DNA sequences with a biallelic site under selection. BMC Bioinformatics. 2009;10:166. doi: 10.1186/1471-2105-10-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Ruiz-Linares A, et al. Admixture in Latin America: Geographic Structure, Phenotypic Diversity and SelfPerception of Ancestry Based on 7,342 Individuals. PLoS Genet. 2014;10:e1004572. doi: 10.1371/journal.pgen.1004572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 2011;40:D930–D934. doi: 10.1093/nar/gkr917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.MacArthur J, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2016;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ruderfer DM, et al. Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell. 2018;173:1705–1715.e16. doi: 10.1016/j.cell.2018.05.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Relate-estimated coalescence rates, allele ages, and selection p-values for the 1000 Genomes Project can be downloaded from https://zenodo.org/record/3234689.
Datasets used in the current study were obtained from the following URLs:
1000 Genomes Project phased dataset, https://mathgen.stats.ox.ac.uk/impute/1000GPPhase3.html (13 Jan 2017); Genomic mask, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible genome masks/ (20 Jul 2017); Human ancestral genome, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis results/supporting/ancestral alignments/ (20 Jul 2017); GWAS catalogue, https://www.ebi.ac.uk/gwas/api/search/downloads/full (9 Nov 2017); PGC GWAS study, https://www.med.unc.edu/pgc/results-and-downloads (23 Nov 2018); HaploReg, http://archive.broadinstitute.org/mammals/haploreg/data/haploregv4.020151021.vcf.gz (21 Oct 2017); GTEx eQTL https://storage.googleapis.com/gtex analysis v7/single tissue eqtl data/GTEx Analysis v7 eQTL.tar.gz (13 Jan 2019); UK Biobank GWAS summary statistics, http://www.nealelab.is/uk-biobank (4 Oct 2018); PopHumanScan, https://pophumanscan.uab.cat (13 Jan 2019)
The software Relate can be downloaded from https://myersgroup.github.io/relate under an Academic Use Licence.
External software used in the current study were downloaded from the following URLs:
ARGweaver, https://github.com/mdrasmus/argweaver (24 Jan 2017);RENT+, https://github.com/SajadMirzaei/RentPlus (2 Oct 2017); msprime, https://github.com/tskit-dev/msprime (22 Jul 2017); msmc, https://github.com/stschiff/msmc2 (14 Oct 2017); SMC++, https://github.com/popgenmethods/smcpp (14 Oct 2017); simuPOP, http://simupop.sourceforge.net/ (27 Jun 2018); mbs, http://www.sendou.soken.ac.jp/esb/innan/InnanLab/ (27 Jun 2018); SDS, https://github.com/yairf/SDS (27 Jun 2018), selscan, https://github.com/szpiech/selscan (31 Jul 2018); hapbin, https://github.com/evotools/hapbin (11 Dec 2018)