Abstract
The role of natural selection in shaping biological diversity is an area of intense interest in modern biology. To date, studies of positive selection have primarily relied on genomic datasets from contemporary populations, which are susceptible to confounding factors associated with complex and often unknown aspects of population history. In particular, admixture between diverged populations can distort or hide prior selection events in modern genomes, though this process is not explicitly accounted for in most selection studies despite its apparent ubiquity in humans and other species. Through analyses of ancient and modern human genomes, we show that previously reported Holocene-era admixture has masked more than 50 historic hard sweeps in modern European genomes. Our results imply that this canonical mode of selection has probably been underappreciated in the evolutionary history of humans and suggest that our current understanding of the tempo and mode of selection in natural populations may be inaccurate.
Subject terms: Molecular evolution, Population genetics
Through analyses of ancient and modern human genomes, the authors show that previously reported Holocene-era admixture has masked more than 50 historic hard sweeps in modern European genomes.
Main
The rapidly growing availability of genomic datasets provides a powerful resource that can be used to address a fundamental question in evolutionary biology, namely, the role of natural selection in shaping biological diversity1. The proliferation of genomic datasets has been accompanied by parallel developments in statistical methods for uncovering genetic signals of positive selection; however, although the power and precision of these statistical methods has continued to improve, they can be confounded by complex aspects of population history that are not modelled or remain unknown2. In particular, previous genomic studies of positive selection typically do not account for past phases of interpopulation mixing (that is, admixture), which can alter genomic signatures of positive selection and either mask these signals or lead to erroneous inferences about the underlying modes of selection3–6.
In the case of humans, a consistent finding has been the apparent paucity of classical ‘hard sweep’ signals in modern genomic datasets (that is, where a new beneficial mutation increases to 100% frequency in a population). This has resulted in suggestions that humans adapted to new environmental and sociocultural pressures through alternate modes of selection—for example, ‘soft sweeps’ (where the beneficial allele sits on multiple genetic backgrounds)7,8 or polygenic selection (subtle frequency shifts across numerous loci with small fitness effects)9,10—though the evidence for these processes has also been challenged5,11–13. The apparent ubiquity of admixture in human population history14–16 suggests instead that the absence of hard sweep signals in modern human genomes may be a consequence of the masking effects of historical admixture events. Moreover, growing evidence suggests that admixture events pervade the history of most natural populations17, which may have led to a potentially biased view of the modes of selection operating in nature3–5,18.
Ancient human genomes uncover historical hard sweeps
The recent emergence of population-scale ancient human genomic datasets from Eurasia provides a new means of searching for any hard sweeps that have occurred in the ancestors of modern Europeans, by applying selection scans to genomic datasets that predate known admixture events. Modern Europeans are largely composed of three distinct ancestries—that is, Western Hunter-Gatherers (WHG), Anatolian Early Farmers (Anatolian EF) and Steppe Pastoralists (Supplementary Text 1)—as a result of extensive admixture between these ancestral populations and their descendants that occurred from the Early Holocene to Bronze Age (~12–5 ka19–21). Accordingly, we reprocessed 1,162 ancient western Eurasian genomes (mostly high-density single-nucleotide polymorphism (SNP) scans; Methods) through a uniform bioinformatic pipeline, thereby mitigating potential processing artefacts, and grouped these samples into 18 distinct ancient populations that existed before and after documented Holocene admixture events based on their archaeological and genetic relationships (Fig. 1, Supplementary Figs. 1 and 2 and Supplementary Table 1). To ascertain the impact of Holocene admixture on modern European populations, we also analysed three modern European populations from the 1000 Genomes project22 (specifically, Utah residents with northern and western European ancestry (CEU), Finnish in Finland (FIN) and Toscani in Italy (TSI)) and included two other non-European 1000 Genomes populations for further comparison (one East Asian: Han Chinese in Beijing (CHB); and one African: Yoruba in Ibadan (YRI)).
We used SweepFinder2 (SF2)23,24 to scan the genomes of each ancient and modern population for regions exhibiting distorted allele frequency patterns characteristic of a fixed hard sweep (that is, an excess of low-frequency and high-frequency alleles25; see also Supplementary Text 2). By contrasting each window against site frequency spectrum (SFS) patterns from the overall genomic background, SF2 tests explicitly for the signature of fixed hard selective sweeps while controlling for demographic history, which often causes false positives in such tests2,23. A set of candidate sweep regions was determined by first assigning SF2 scores to a set of ~19,000 annotated human genes and subsequently aggregating neighbouring outlier genes into single sweep regions (Supplementary Fig. 3 and Methods). Importantly, testing of our sweep detection pipeline on simulated genomic datasets modelled on previously estimated Eurasian demographic history confirmed that it is robust to the effects of substantial demographic bottlenecks, including strong bottlenecks associated with the founding Eurasian and subsequent WHG populations (successive 12-fold and sixfold population size reductions, respectively; Fig. 2a), and also to missing data, ascertainment bias and alignment error (Fig. 2b, Supplementary Figs. 4–7 and Supplementary Methods).
Hard sweeps were common in ancient West Eurasian populations
In direct contrast to previous studies of modern human genomes26, we were able to identify large numbers of hard sweeps in the ancient West Eurasian populations (Supplementary Figs. 8 and 9 and Supplementary Table 2), identifying 57 with high confidence (estimated study-wide false discovery rate (FDR) of <11% under a realistic simulated Eurasian demography; Fig. 2; Methods and Supplementary Methods). None of these sweep signals were evident in the YRI population, and ~90% of the sweeps showed significantly inflated levels of genetic divergence relative to African populations using Fst-based tests (outFLANK method27; Methods and Supplementary Fig. 10), consistent with the underlying selection pressures postdating the separation of the founding Eurasian group from African populations.
Although the estimated strength of selection was sufficient for these 57 putatively selected loci to have swept to high frequencies in West Eurasian populations by the Early Holocene to Bronze Age era (that is, ~12–5 ka; 41 sweeps having s > 1%, with the largest value of s nearing 10%; Supplementary Table 2; Methods and Supplementary Text 6), only two sweeps (1:35.4–36.5 and 6:29.5–32.8; Supplementary Figs. 8 and 9 and Supplementary Table 2) were still identified as SF2 outliers in selection scans of modern European populations. This dramatic reduction in hard sweep signals was not an artefact of differences in either data quality or quantity between modern and ancient populations (Supplementary Figs. 11–14 and Supplementary Methods). Nor could their absence be explained by the degradation of the sweep signals through random allele frequency changes (that is, genetic drift) and new mutations, as hard sweep signals in Eurasian human genomes are expected to remain visible to our analytical pipeline for around 70,000 years in the absence of other effects (Supplementary Fig. 15, Supplementary Methods and refs. 23,28). Rather, the loss of the hard sweep signals is consistent with the reduction in sweep haplotype frequency below detection limits caused by the introduction of other orthologous haplotypes during the Holocene admixture phases. For the analytical pipeline used in this study, power drops below 50% if the sweep haplotype does not occur in at least 85% of the population at the time of sampling, and the sweep signal becomes effectively undetectable if less than half of the population has the sweep haplotype (Supplementary Fig. 15 and Supplementary Methods; see also ref. 29).
A historical hard sweep in the MHC-III region
Perhaps the most striking example of a previously unreported hard sweep in this study was provided by a ~1.5 Mb region overlapping the major histocompatibility complex class III (MHC-III) region on chromosome 6 that showed depleted genetic variation in the Anatolian EF population (Fig. 3). The MHC-III region includes numerous genes that encode antigen proteins involved in the immune response and has previously been identified as a recurrent target of balancing selection30–32 and also as a potential source of selection artefacts owing to high levels of local genetic variation hindering read alignment across this region. However, nearly all (98.6%) SNPs in the MHC-III sweep region were found to lie in regions of high mappability, consistent with levels observed in the other 56 sweeps (minimum, 96.7%; Supplementary Table 3), indicating that poor read alignment was not a problem for our sweeps in general.
The authenticity of the MHC-III hard sweep is further supported by the distinctive trough of genetic diversity observed across the affected region in the Anatolian EF samples, which is located between two recombination hotspots (that is, genomic regions characterized by locally inflated recombination rates) that also demarcate the sharp transition to background diversity levels in either side of the sweep region (Fig. 3). This correlation between genetic diversity and local recombination rates is unlikely to arise from read misalignment artefacts but can be closely approximated by a hard sweep model acting on a centrally located beneficial variant that incorporates the local recombination rates (Fig. 3 and Supplementary Text 3), pointing to historical positive selection as the more likely cause.
In contrast to the Anatolian EF population, the two other major contributors to modern European genetic ancestry (that is, WHG and Steppe) had SF2 scores and patterns of local genomic variation that were indistinguishable from neutral background levels in the same ~1.5-Mb MHC-III region (Supplementary Fig. 16). Holocene-era populations show patterns that are consistent with a progressive dilution of the sweep signal across this period, with a weak SF2 signal still evident in EF populations (which draw ancestry from both Anatolian EF and WHG) that returns to neutral background levels in subsequent Bronze Age and modern European populations following the introduction of additional Steppe ancestry (Fig. 3 and Supplementary Fig. 16). Taken together, the evidence strongly implies that the MHC-III region was a target of strong positive selection in the ancestors of Anatolian EF and that the underlying hard sweep signal became masked in descendant populations by Holocene-era admixture involving genetically diverged populations.
Admixture can obscure historical hard sweeps
The Anatolian EF-specific MHC-III hard sweep suggests that if Holocene admixture is responsible more generally for masking hard sweep signals in modern European genomes, then sweeps that were specific to only one of the admixing ancient Eurasian populations (for example, where selection is driven by a population-specific or local pressure) should be particularly prone to SF2 signal dilution from Holocene admixture events. Conversely, sweeps occurring closer in time to the Out-of-Africa migration are more likely to have signals that survive Holocene admixture events by virtue of being present at high frequencies in all admixing populations. To test whether sweep signal presence was dependent on the sweep antiquity, we inferred the approximate onset of the underlying selection pressure by manually scanning five moderate-to high-coverage Upper Palaeolithic individuals for evidence of the sweep haplotype—namely, Ust’-Ishim ~45 ka33, Kostenki14 ~37 ka34, GoyetQ-116 ~35 ka34, Věstonice16 ~30 ka34 and El Mirón ~18 ka34—assigning each sweep to one of five age classes based on the oldest sample in which the sweep haplotype was observed (Fig. 4, Methods and Supplementary Text 4). Concordant sweep age classifications were obtained using an alternate quantitative method (based on a set of diagnostic marker alleles; Supplementary Fig. 17 and Supplementary Methods), suggesting that our classifications should provide robust inferences for aggregated sets of sweeps from the same age class, even though individual sweep dates are likely to be less reliable.
Sweep presence across the 12 ancient Eurasian populations was broadly consistent with the inferred antiquity of the selection pressure (Fig. 4b). Sweeps inferred to have started by 35 ka (that is, observed in GoyetQ-116 or an older sample) were detected at consistent levels across ancient populations both before and after the Holocene admixture events (no significant differences in sweep detection rates; logistic regression P > 0.17 for all three sweep age categories; Methods and Supplementary Text 4). By contrast, sweep detection differed significantly across the ancient populations for more recently selected loci (logistic regression P < 0.002 for both age categories) and exhibited patterns consistent with historical selection acting on populations ancestral to a subset of the tested populations (Fig. 4b; noting that the more pronounced signal loss evident in modern Europeans may reflect further dilution from admixture events following the Bronze Age; Discussion). Similarly, sweeps starting by 35 ka were more likely to appear as partial sweeps in two previous studies of positive selection in modern European populations35,36 (66% versus 46% for sweeps dated less than 35 ka; Fig. 4c) and were also more frequently observed as outliers for a statistic (XP-EHH) that is sensitive to loci nearing fixation (25% versus 8%; Fig. 4c). These results demonstrate that admixture can sufficiently distort the genetic signals resulting from ancient fixed sweeps, often leading to haplotype patterns in admixed populations that are misinterpreted as resulting from recent and potentially ongoing selection28,35,37–39.
Evolutionary scenarios underlying sweep signal dilution
To further assess the plausibility of Holocene admixture events masking historical hard sweep signals in modern human genomes, we used population genetic simulations (SLiM3)40 to model selection (Supplementary Figs. 18–20; Methods and Supplementary Methods) within a plausible West Eurasian demographic model informed by ancient DNA studies41–43 (Fig. 2a and Supplementary Fig. 19). Although our simulations might not capture the full complexity of West Eurasian demographic history, the model has sufficient detail to provide general insights into the effect of admixture on signatures of hard sweeps in these lineages. Beneficial mutations were introduced on the Main Eurasian branch before (55 ka) and after (44 ka and 36 ka) the diversification of the founding Eurasian population (Fig. 5 and Supplementary Figs. 19 and 20), with selection tests performed on the three ancient source populations (that is, Anatolian EF and WHG, both sampled at 8.5 ka, and Steppe sampled at 5 ka) and three admixed European populations descending from two separate Holocene admixture phases44 that occurred at 8 ka (European EF, sampled at 7 ka) and 4.5 ka (Bronze Age Europeans, sampled at 4 ka, and Modern Europeans, sampled at 0 ka; Supplementary Text 1). All simulated datasets matched the sample sizes, SNP numbers, missing data and ascertainment bias observed in empirical data from the relevant populations (Methods).
We first investigated a model in which selection is active along all population branches that inherit the beneficial mutation until the 8 ka admixture event, at which point the selection pressure is relaxed (for example, owing to an environmental change or the emergence or introduction of a technological innovation proximal to the admixture event). Indeed, recent evidence suggests that spatiotemporal changes in selection pressures may have been common in recent human evolutionary history45–49. Our results clearly demonstrate that Holocene-era admixture can effectively mask historical sweep signals in the absence of any ongoing selection pressures: only the sweeps that precede the division of Eurasian lineages (that is, selection starting at 55 ka) could be detected with reasonable power both before and after the admixture events. By contrast, sweeps starting after the diversification of Eurasian lineages (that is, 44 ka and 36 ka) had slightly increased power in unadmixed populations that experienced the selection pressure (consistent with having more recent fixation times and less signal loss from drift), whereas power was negligible in all admixed European populations (Fig. 5 and Supplementary Fig. 20). As expected, the detection rates were strongly positively correlated with selection strength, whereas sample size had a more moderate positive association.
Sweep detection power remained high in admixed populations whenever sweeps predated the split between the source population lineages, despite one of the source populations (WHG) having power close to zero under all modelled scenarios (Fig. 5 and Supplementary Fig. 20). The low power observed for WHG is likely to have been a consequence of the relatively small effective population size41 exacerbating drift and causing rapid degradation of any fixed sweep signals (expected signal loss in 0.2 × Ne generations23 equates to ~11,000 years to loss in WHG assuming a generation time of 29 years50, compared with ~70,000 years on the main Eurasian branch). This result indicates that hard sweeps are reasonably robust to admixture-induced signal loss when the sweep signal has been eroded in one of the source populations owing to drift, provided that the other source populations have retained the sweep signal.
After amending the model to allow the selection pressure to persist following the Holocene admixture events, we observed a dramatic increase in the detection rate of sweeps postdating the subdivision of Eurasian populations (that is, selection onset at 44 ka and 36 ka), with power being particularly high for strongly selected loci (s = 10%) across all admixed European populations (power between 65 and 85%; Fig. 5 and Supplementary Fig. 20). Notably, Modern Europeans achieved detection power exceeding that observed for any of the three ancestral source populations (Anatolian EF, Steppe and WHG) in nearly all cases. This result implies that many of the historical sweep signals should have been present in modern European populations had the underlying selection pressure persisted after the Holocene admixture phase, suggesting an attenuation of the underlying selection pressure(s), or the influence of other factors that were not included in the simulations, following the European Bronze Age (Discussion).
The mutational basis of the Eurasian hard sweeps
Under canonical models of hard sweeps, beneficial mutations only emerge following the environmental shift marking the onset of the positive selection pressure, though recent theoretical and simulation-based work has shown that hard sweep signals are possible when selection acts instead on mutations segregating at the time of the environmental shift (that is, standing genetic variants; SGVs)51–56. Our results suggest that the underlying selection pressure(s) probably arose during the early stages of the Anatomically Modern Human occupation of Eurasia—for example, 16 (28%) and 44 (77%) of the 57 sweep haplotypes being observed in archaic samples dated at ~45 ka and ~35 ka, respectively (Fig. 4)—constraining the time in which de novo mutations could have arisen and suggesting that SGVs may have provided the mutational basis for many of our sweeps.
To examine this question in more detail, we used equations from ref. 57 to calculate the probability of a sweep arising from SGVs relative to de novo mutations, conditional on the fixation of the beneficial allele within a particular time interval under standard parameterizations (Supplementary Table 4 and Supplementary Text 5). Assuming that the environmental change triggering positive selection also coincided with a tenfold reduction in population size (that is, consistent with the Out-of-Africa bottleneck), we observed that fixation from SGVs was highly probable for positive selection strengths that matched the majority of our empirical estimates (s ~1%; Supplementary Table 2) when the temporal window in which fixation must occur was reasonably constrained (≤ 20 kyrs). By contrast, de novo variants dominated when purifying selection on SGVs was strong (s ≥ 1%) in the period preceding the environmental shift (Fig. 6 and Supplementary Fig. 21). Extending our calculations to include the probability that selection from SGVs produces a hard sweep signal57 (that is, only one of the beneficial haplotypes present at the time of the environmental shift ultimately fixes) revealed that hard sweeps are highly probable (>50%) regardless of the mutational input source, provided that SGVs have been deleterious in the period preceding the environmental change (Supplementary Figs. 6 and 21 and Supplementary Text 5). Although further distinguishing between the two modes of selection was hampered by the absence of prior information on several key parameters (for example, the strength of purifying selection on SGVs, the type and strength of epistasis and the distance to the new fitness optimum), our results imply that mutations underlying the hard sweeps were probably initially deleterious and reinforce previous findings showing that selection from both de novo mutations and SGVs have occurred in Eurasian evolutionary history53.
Discussion
Our analyses of >1,000 ancient West Eurasian genomes has uncovered strong evidence for 57 hard sweeps in Early Holocene to Middle Holocene populations that have been almost entirely erased from descendent populations in modern Eurasia. Most of these selected loci had probably swept to high frequencies well before the Holocene era; this is supported by the moderate-to-strong selection strengths that were inferred for the sweeps. These selection coefficients are comparable with the strongest currently known for human populations (that is, the LCT locus having s between ~2% and 6% in the recent history of European populations58,59) and suggest that such strong positive selection events have been much more common in recent human history than previously recognized.
Our empirical and simulation results implicate Holocene-era admixture as the primary factor attenuating these historical sweep signals, which has led to them being missed in previous studies or detected instead as empirical genome-wide outliers of haplotype-based selection statistics35,36. Such haplotype-based outliers are typically interpreted as ongoing partial sweeps resulting from recent selection28,35,37–39, as the underlying haplotype patterns decay quickly over time and become largely undetectable for selection starting more than 30,000 years ago in humans28,39. However, our analyses suggest that most (85%) of the 41 sweeps that overlap with a haplotype-based outlier were already under selection by 30 ka, implying that these signals are more likely to result from admixture-driven dilution of hard sweeps that mostly began before 30,000 years ago.
An intriguing implication arising from the simulations is that the selection pressure(s) underlying the sweeps may have eased during the Holocene period in some cases. This period marked the introduction of new technologies and diets, a stable warm climate and living conditions that introduced new selection pressures (for example, selection on the LCT gene to reduce the costs associated with milk consumption in adulthood59) and may have also reduced the intensity of historical selection pressures underlying the 57 hard sweeps. Alternatively, these patterns could have resulted from the introduction of new ancestry sources into Europe following the Bronze Age period60–62, which would have further diluted sweep signals and left insufficient time for their reappearance even if the selection pressure was still present. Indeed, this may explain why modern Tuscans and Finns had the fewest sweeps of any Eurasian population examined in this study, with both populations descending from post-Bronze Age mixing events that introduced distinct ancestry components (from Northern African60,61 and Siberian62 groups, respectively) that were not apparent in the other surveyed Eurasian populations (Fig. 4b).
Although our analyses point to well-known admixture events during the Holocene as the prime driver of the diluted sweep signals observed in modern European genomes, it is possible that the three populations directly ancestral to present-day Europeans (that is, Mesolithic hunter-gatherers, Anatolian Neolithic farmers and pastoralists from the Pontic–Caspian steppe) were also admixed to some degree63. However, the much stronger genetic differentiation observed between the three ancestral populations relative to the later Holocene European groups21 suggests that potential admixture events involving the three ancestral lineages were probably less influential or frequent than subsequent admixture phases in the Holocene21. This implies less perturbation of any underlying sweep signals, although we note that the occurrence of such mixing events would mean that historical hard sweeps were even more frequent than identified in our study.
In addition to masking historical sweeps in human populations, the obscuring effect of admixture might explain why species-wide selective sweep signals are rare in many species9,64 while being abundant in others65–67. Crucially, recently admixed populations often lack detectable signs of structure, in which case admixture will not be correctly accounted for in any subsequent selection tests. For example, modern European populations have been considered to be sufficiently genetically homogeneous for the purpose of selection scans8, despite ancient DNA studies revealing that Europeans have multiple diverged ancestry components21. Indeed, modern genomic data is often insufficient to establish past admixture events, with widely used principal component analysis (PCA) and ancestry decomposition methods being unable to detect historical admixture signals when suitable proxy populations for the admixing groups are lacking. Similarly, although admixture creates temporal variation in genomic coalescence rates, these patterns can be equally well explained by historical population size changes in a single continuous population68. Accordingly, species with little apparent population structure may be more susceptible to the confounding effects of admixture on selection scans than those in which structure is evident and can be directly accounted for, and this may partly explain why species with distinctive population structuring often show strong local selection signals (for example, Swedish Arabidopsis thaliana3, African Drosophila melanogaster69 and Microbotryum lychnidis-dioicae70), whereas genetically homogeneous taxa or populations tend to lack fixed hard sweeps but harbour abundant partial or soft sweep signals (for example, European humans8,35, North American D. melanogaster6,71 and M. silenes-dioicae70).
In accordance with previous work3,4,18, our results emphasize the importance of incorporating historical population structure and admixture events into the null models of selection tests. If this information is not available—for example, because DNA from ancestral source populations is lacking, the default scenario for most species—then the interpretation of historical selection signatures could be highly misleading and heavily weighted against the detection of historical hard sweep events. Although these factors imply that the extent of past hard sweep events has probably been underestimated in natural populations in general, we caution that they do not directly challenge results from previous studies proposing a substantial role for soft sweeps in the adaptive history of humans8 and other species71. Rather, we reiterate the conclusions of other recent work6,72 that advocate for improved modelling of complex but widespread evolutionary and demographic processes to achieve an unbiased understanding of the mode and tempo of adaptation in natural populations.
Methods
Population designation
All ancient individuals used in this study were assigned to historical populations based on published analyses of their genetic relationships in combination with details regarding their archaeological context, with the temporal and spatial variability between individuals being minimized where possible (Supplementary Figs. 1 and 2 and Supplementary Table 1). Our sample grouping resulted in 18 ancient populations occurring before and after the major Holocene admixture events that created the genetic landscape of modern Europe19,21 (Supplementary Text 1). Additional testing using alternate sets of samples for two ancient populations showed that sweep signal detection was reasonably robust to changes in population sample configuration (Supplementary Methods and Supplementary Figs. 13 and 14).
Data collection and processing
To produce a robust dataset and avoid potential bioinformatic batch effects, the raw sequence read data for 1,162 ancient genomic datasets (Supplementary Table 1) was retrieved from publicly available repositories (Short Read Archive: SRP029640, SRP057056; European Nucleotide Archive: ERP003900, PRJEB6622, PRJEB6272, PRJNA230689, PRJEB7618, PRJEB609, PRJEB9021, PRJEB8987, PRJEB9783, PRJEB11364, PRJEB11450, PRJEB1418, PRJEB13123, PRJEB11848, PRJEB14455, PRJEB22629, PRJEB12155, PRJEB22652, PRJEB23635, PRJEB24794, PRJEB29603) and processed through the following standardized pipeline. To minimize the risk of modern contamination, the forward and reverse reads of the paired-end reads were merged (collapsed) using fastp73, and only merged reads were retained (modern data are more likely to comprise large DNA fragments that do not collapse). All collapsed reads were filtered for potential residual adaptor sequences and chimaeras using Poly-X with fastp73. The retained filtered set of sequence reads were aligned to the human reference genome (h37d) using the Burrows-Wheeler Aligner v.0.7.15 (ref. 74). All mapped reads were sorted using SAMtools v.1.3 (ref. 75) and then realigned around insertions and deletions, and potential PCR duplicate reads were marked and removed using the Genome Analysis ToolKit v.3.5 (ref. 76).
Before variant calling, all remaining aligned reads were screened and base-calls were recalibrated for ancient DNA (aDNA) postmortem damage using mapDamage2 (ref. 77). To further limit the impact of postmortem damage on variant calling78, bamUtil79 was used to trim three base pairs from each of the 5′ and 3′ ends of each mapped read. From the resulting set of reads, pseudohaploid variants were called at the set 1240k capture SNPs80 found on the 22 autosomes, using a combination of SAMtools mpileup81 and sequenceTools (https://github.com/stschiff/sequenceTools). Pseudohaploidization of read data is a standard strategy in aDNA analyses, whereby a single read is randomly sampled at each prespecified SNP position80 to mitigate potential biases introduced by differences in coverage or postmortem damage between samples19. The 1240k capture was developed to minimize ascertainment in non-African populations and was used to generate data for most samples used in the study, whereby concentrating on the 1240k variants ensured a common and robust set of variants for the subsequent analyses. The pseudohaploid variant calls were converted from EIGENSOFT format82,83 to binary PLINK format using EIGENSOFT. PLINK v.1.9 (ref. 84,85) was used to assign samples to the predefined populations (Supplementary Table 1) and convert the variants to reference-polarized VCF files, with correct polarization being checked using BCFtools81. Finally, a custom Python script was used to generate the SFS input files for SF2 analysis (https://gist.github.com/yassineS/fe2712ad52d76460b927e3f391ea51f6).
Sweep scans
The SweepFinder2 composite likelihood ratio (CLR) statistic23,24 was computed across successive 1-kb intervals across all autosomes for each ancient and modern human population. The CLR statistic evaluates the evidence for hard selective sweeps in dynamically sized windows by calculating the expected SFS under a hard selective sweep conditional on the neutral SFS, assuming a certain selection coefficient and local recombination rate. The neutral SFS is based on the ‘background’ SFS calculated from the whole genome assuming that the influence of positive selection on the genome-wide SFS is negligible.
SF2 controls for genome-wide effects such as ascertainment bias and demographic history by allowing these processes to affect the expected SFS under neutrality (that is, the background SFS)86. Further, unlike many other selection methods, the assumptions on the input data for SF2 are suitable for the low coverage and ascertained nature of ancient DNA datasets. As it is only based on the spatial (genomic) pattern of allele frequencies but not on haplotype homozygosity or population differentiation, it is possible to detect selection without reference to a second population, calling genotypes or phasing haplotypes. Leveraging an empirical null model (that is, the background SFS) and a model-based alternative hypothesis, SF2 is both more powerful and more robust than alternate test statistics also based on deviations of the SFS from expectations under the standard neutral model (for example, Tajima’s D, Fay and Wu’s H)23,24.
Note that SF2 also has an option to detect sweeps based on local genomic reductions in diversity. However, we did not calculate this diversity-based metric as the accurate and unbiased estimation of diversity requires fully sequenced genomes, whereas our dataset consisted of an ascertained set of SNPs.
Outlier gene detection
Human gene annotations were obtained from the ENSEMBL database87 (genome reference version GRCh37), which was accessed using the R biomaRt package88,89 (v.2.36.1). Of the 24,554 annotated ‘genes’ on the biomaRt database, we removed any that were not annotated in the NCBI database (ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) and also excluded those that lacked specific protein-based and RNA-based annotations (in the biomaRt transcript_biotype field). This resulted in a list of 19,603 genes, from which we removed 26 that did not contain any polymorphic sites in our datasets (all being situated in the most terminal areas of chromosomes), leaving 19,577 genes that were used in the subsequent analyses (Supplementary Table 2).
To obtain P values for each gene, we transformed the raw SweepFinder2 CLR scores to z scores using the following series of steps (Supplementary Fig. 3). For each population, all SweepFinder2 CLR scores were logarithmically transformed and assigned to each of the 19,577 genes by binning the transformed scores within the genomic boundaries of each gene. The gene boundaries were extended by 50 kb on either side to also capture cis-regulatory regions. As this typically resulted in several scores per gene, we took the maximum score to represent the evidence for a sweep involving that gene. Each gene score was corrected for gene length using a nonparametric standardization algorithm90,91, resulting in the gene scores having an approximately standard normal distribution (Supplementary Fig. 4). Finally, P values were calculated for all genes, and a q value correction88 was applied for each population. The q value is a Bayesian posterior estimate of the P value that accounts for the expected inflation of false positives due to multiple testing88, whereby a q value of 0.01 implies an FDR of 1% per population in this study.
Candidate sweep classification
Sweeps were identified by determining a set of outlier genes across all populations, which were classified into sweep regions according to (1) the distance between the midpoint of neighbouring pairs of outlier genes (the intergene distance) and (2) overlapping sweep regions between populations. Step 1 was performed independently for each population, whereby all outlier genes with midpoints that were less than a specific distance apart from the midpoint of a neighbouring outlier gene were collapsed into a single category. After generating the collapsed categories for each population, step 2 was applied to ensure that the sweep categories sharing at least one gene across different populations were considered as a single historical sweep.
We ran our sweep quantification pipeline at three q value thresholds (q < 0.01, 0.05 or 0.10; which imply FDR values of 1%, 5% and 10% per population, respectively) and three different intergene distances (midpoint distances less than 250 kb, 500 kb or 1 Mb). As expected, changing the q value had a large impact on the number of sweeps (ranging from ~50 for q < 0.01 to ~500 for q < 0.1), whereas changing the intergene midpoint distance had comparatively little impact overall, particularly at more stringent q value cutoffs (Supplementary Fig. 8). Based on these results, we decided to use the most stringent q value cutoff and the most liberal intergene distance to define a conservative set of candidate sweeps for all further analysis. However, because this stringent cutoff might lead to the removal of potentially causal genes in a sweep (which could have values slightly lower than 0.01), we first defined our sweeps based on the more permissive q < 0.1 threshold and then removed all sweeps that did not have at least one gene with q < 0.01. To further improve sweep determination, we removed populations with small sample sizes (Impact of sample size and missing data on sweep detection) from the sweep determination process, as previous analyses of modern genomes suggest that SF2 has little power to detect sweeps when the number of haploid genome copies being analysed is 10 or less29. Finally, to ensure that the sweeps being defined were all relevant to western Eurasian history, the modern populations from East Asia (CHB) and Africa (YRI) were also excluded from the sweep classification process. This strategy resulted in a total of 57 candidate sweeps that were used in subsequent analyses.
Impact of sample size and missing data on sweep detection
Previous results suggest that SF2 maintains high power to detect sweeps when the number of haplotypes being analysed is 10 or more for modern genomes29. To examine the impact of sample size on SF2 estimation on our ancient populations, we derived a measure of sample size that incorporates both pseudohaploidy and different levels of data missingness in our ancient samples, called the ‘effective’ sample size, neff. We calculated neff for each population as kn(1 − M), where k is the ploidy of each sample, n is the number of samples and M is the average proportion of missing sites at informative SNPs in that population. Consistent with results from modern data29, we found that the number of detected sweeps tended to be systematically lower for populations with neff values <10, regardless of the q value threshold used (Supplementary Figs. 8 and 12). When inspecting the distribution of P values for gene scores in each population, we observed large deviations from the idealized distribution as neff values decreased below 10 (Supplementary Figs. 4 and 5). These results indicate that populations with neff < 10 lack sufficient power to detect sweep signals, whereby only the 12 ancient populations with neff ≥ 10 were used to determine the candidate sweeps.
Selection strength inference
We inferred the selection strength, s, for each of our 57 candidate sweeps as s = r ln(2Ne)/α (from ref. 92), where Ne is the effective population size, r is the recombination rate and α is a composite selection parameter that is estimated for each sweep region by SF2 (ref. 23) (Supplementary Table 2). For each candidate sweep, we took α at the position of the largest SF2 CLR value from all ancient Eurasian populations. Recombination rates were estimated for each sweep using information from ref. 93, and Ne was taken as 10,000 (noting that the estimation of s is robust to changes in Ne owing to the logarithmic transformation). Notably, estimating s based on SF2 α estimates tended to result in slightly lower values than expected (Supplementary Methods and Supplementary Fig. 18), suggesting that we have probably systematically underestimated the strength of selection for the 57 observed sweeps.
Estimating the earliest evidence for selection
To estimate when the selection pressure(s) underlying each of the 57 sweeps may have first arisen, we manually inspected five Upper Palaeolithic Eurasian human samples with moderate-to-high-coverage genomes (Ust’-Ishim94, Kostenki14 (ref. 95), GoyetQ-116 (ref. 34), Věstonice34 and El Mirón34) for evidence of the sweep haplotype. The oldest sample with evidence for the sweep was taken as the origin of the selection pressure and therefore provided a coarse lower bound on the onset of selection. The sweep was called as present only if the full set of alleles observed in the sweep region was observed in the sample, noting that genotype calls were only possible for Ust’-Ishim (coverage >40×), with pseudohaploid calls being used for the other four samples. As some sweeps contained complex signals with multiple peaks—which may represent the combination of two or more neighbouring selection events into a single sweep region (see Supplementary Table 2 for a list of sweeps and Supplementary Data 1–57 (ref. 96) for the distribution of SF2 CLR scores for each sweep region across all tested populations)—we concentrated on alleles underlying the strongest signal in each sweep. The plots used to discriminate the sweep haplotypes for all 57 candidate sweeps are provided in Supplementary Data 58–114 (ref. 97). To test the robustness of these qualitative assessments, we compared our qualitative classifications with those obtained from a quantitative method. Details of this method and the comparison with qualitative results are provided in the Supplementary Methods.
Sweep detection rate relative to inferred onset of selection
We predicted that hard sweep signals would be more prone to admixture-mediated loss if the onset of the selection pressure postdated the initial diversification of the Eurasian founding populations, as these sweeps are more likely to be ‘local’ to a subset of the source populations contributing to modern European ancestry. Accordingly, we reasoned that sweeps of deep antiquity should exhibit consistent detection frequencies across all ancient populations used in this study, whereas more recent sweeps should show interpopulation variation that reflects their restriction to a subset of historical source populations. To examine this hypothesis, we quantified the proportion of sweeps present in each ancient population conditional on the inferred age of the sweep (Methods: Estimating the earliest evidence for selection) and then used the glm function in the R statistical language to test whether proportions differed significantly across ancient populations grouped into five broad categories (HG, EF, Steppe, LF and LNBA; Fig. 4). Specifically, for each sweep age class, we fitted a logistic regression model where the presence of the sweep was a binary dependent variable and population group was the sole independent variable and compared the fit of this model with the null model (dependent population variable removed; that is, all groups had the same proportion of sweeps) to compute P values. Note that modern European populations were not included in these tests, as our analyses indicated that these populations experienced further signal dilution following the Bronze Age (Supplementary Text 1). Our results were unchanged after repeating all logistic regression analyses with the Southern Caucasus LNBA population removed from the LNBA group (as this population showed different ancestry composition to other European LNBA populations; Supplementary Methods).
Fst-based selection tests
To further investigate the validity of the 57 candidate sweeps, we tested whether our sweeps were enriched with highly divergent SNPs amongst the 12 ancient Eurasian populations with sufficient power to reliably detect sweeps (Methods: Impact of sample size and missing data on sweep detection) with a modern African population (YRI). We calculated Fst for each of the ~1.1M ascertained SNPs using the standard Weir–Cockerham estimator98 and used OutFLANK27 to estimate the probability that each SNP was more divergent than expected under neutrality (based on estimating fitting a χ2 distribution to Fst values from putatively neutral SNPs). Importantly, OutFLANK is robust to nonequilibrium demographic models27, including rapid range expansions that are thought to have resulted in highly divergent alleles observed in modern human populations (that is, through allele surfing99). Applying the q value correction to the P values to control for multiple testing resulted in 29 of the 57 candidate sweeps having one or more SNPs with q < 0.05 and 55 candidate sweeps having at least one SNP with q < 0.20 (Supplementary Fig. 10). We then tested whether the SNPs found in each of the 57 candidate sweeps had significantly higher Fst values than background genome levels; 49 of the 57 sweeps had significantly elevated Fst values relative to the remaining background genome (P < 0.05; one-sided Wilcoxon rank-sum test; Supplementary Fig. 10), confirming that our sweeps were most likely to have been selected following the divergence of ancestral Eurasians from African populations and were unlikely to be artefacts caused by nonequilibrium demographic processes such as allele surfing.
Forward simulations
To assess the impact of Holocene admixture on historical hard sweep signals in modern human genomes, we used forward simulations (SLiM3)40 to model moderate-to-strong selection (selection coefficient s between 1% and 10%) within a plausible West Eurasian demographic model (Fig. 2a and Supplementary Fig. 19). Model parameters were largely taken from a recent study of Eurasian demographic history41, with further parameters for the Steppe population coming from two additional studies incorporating ancient Steppe samples42,43. Two Holocene-era admixture events were included, namely, a 50% contribution of WHG ancestry into the Main Eurasian branch at 8 ka and a further 33% contribution of Steppe (Yamnaya42) ancestry onto the same branch at 4.5 ka. These admixture events generated the typical modern European individual that derives approximately one-third of their ancestry from each of the Anatolian EF (modelled as descending from the Main Eurasian branch), WHG and Steppe populations19.
Beneficial mutations were introduced on the Main Eurasian branch at three different times: 55 ka, 44 ka and 36 ka (Supplementary Fig. 19). We investigated a model where the strength of selection (s either 1%, 2% or 10%) was constant until the present and another model where selection ceased following the 8 ka admixture event (s = 0 after 8 ka). Genomic regions of 5 Mb were simulated assuming mutation rates and recombination rates of 1 × 10−8 per generation per base pair, with the beneficial variant being introduced as a single mutation in the middle of the simulated region.
Diploid samples were generated for one modern population, sampled at the present, and five ancient populations: three populations that are ancestral to modern Europeans (Anatolian EF and WHG, sampled at 8.5 ka, and Steppe, sampled at 5 ka) and two admixed populations (Central Europe EF and Central Europe LNBA, sampled at 7 ka and 4 ka, respectively). Sample sizes matched those in empirical data, with 100 diploid samples taken for the modern Europeans. For each ancient population, we simulated missing data by randomly removing loci based on the site-specific missing data distribution of the relevant empirical samples, thereby replicating the population-specific data missingness patterns. We also reproduced the pseudohaploidy of the ancient samples by randomly selecting one allele at each site. In addition, because our analyses were limited to a set of ~1.1 million variants from the 1240k capture probes (Methods: Data collection and processing), which were selected based on heterozygosity in a set of sequenced reference individuals from both African and non-African populations100, we also sought to reproduce the ascertainment bias associated with these variants. To do this, we sampled two modern African samples and only retained sites that were heterozygous in either of the two African samples or at least one of two randomly chosen modern European samples, with all other sites being discarded.
Finally, in all simulations, the number of variant sites was downsampled to 2,000 positions in accordance with the average number of variants from 1240k capture probes found in a 5-Mb region. After removal of all nonpolymorphic sites, the SF2 statistic was computed at 5,000 positions evenly distributed over the 5-Mb region. To compute the neutral background SFS necessary for the SF2 analyses, we simulated 1,000 neutral replicates of the demographic model and extracted the SFS for each population after replicating all steps in our simulation pipeline. Two hundred replicates were generated for each selection scenario and used for power analyses and to estimate the false positive rate (FPR). The final set of selection simulations were conditioned on sweeps that had escaped the initial stochastic phase (where rare beneficial alleles are lost through drift) by omitting all simulations where the beneficial mutation never reached a frequency of at least 10%. More details on the statistical modelling procedure and full specification of the power analyses and FPR estimation are provided in the Supplementary Methods.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank F. Racimo, I. Mathieson, W. Haak, J. Pritchard, N. Bean, B. Llamas, L. Weyrich, J. Breen, J. Soubrier, C. Franks, C. Miller, M. Brooks and R. Ward (deceased). We gratefully acknowledge the museums, archaeological collections, collectors and curators who were responsible for the original skeletal material and made samples available for genomic analysis. Funding was received from Australian Research Council Grants to A.C. (FL140100260), C.D.H. (DE180100883), and R.T. (DE190101069); the Australian National Database Service; Nectar Cloud; and the University of Adelaide Environment Institute, Bioplatforms Australia.
Author contributions
A.C., Y.S., R.T. and C.D.H. conceived the study; Y.S., R.T. and M.W. assembled the dataset; Y.S., R.T., A.J., C.D.H., G.G., M.C., A.R. and O.J. performed analyses; A.C. and C.D.H. supervised analyses; Y.S., R.T., A.J., C.D.H., S.G., C.T., A.R., O.J., J.S., J.T. and A.C. interpreted the results; and A.C., Y.S., R.T., A.J. and C.D.H. wrote the paper with input from all coauthors.
Peer review
Peer review information
Nature Ecology & Evolution thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Allele frequency data and SweepFinder2 results for individual ancient populations101 can be accessed under 10.25909/6324956ee6ba6. Supplemental Data 1–57 (ref. 96) and 58–114 (ref. 97) are accessible under 10.25909/631595cfd2733 and 10.25909/631595df4f1b2, respectively.
Code availability
The custom script used to compute SFS from VCF files is available at https://gist.github.com/yassineS/fe2712ad52d76460b927e3f391ea51f6. All other custom R scripts used in the analysis of empirical and simulated data are available from the corresponding authors upon request.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Yassine Souilmi, Raymond Tobler, Angad Johar, Alan Cooper, Christian D. Huber.
Contributor Information
Yassine Souilmi, Email: yassine.souilmi@adelaide.edu.au.
Raymond Tobler, Email: raymond.tobler@adelaide.edu.au.
Angad Johar, Email: johar.angad@mayo.edu.
Alan Cooper, Email: alanjcooper42@gmail.com.
Christian D. Huber, Email: cdh5313@psu.edu
Supplementary information
The online version contains supplementary material available at 10.1038/s41559-022-01914-9.
References
- 1.Jensen JD, et al. The importance of the neutral theory in 1968 and 50 years on: a response to Kern & Hahn 2018. Evolution. 2018;112:2109. doi: 10.1111/evo.13650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG. Recent and ongoing selection in the human genome. Nat. Rev. Genet. 2007;8:857–868. doi: 10.1038/nrg2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Huber CD, Nordborg M, Hermisson J, Hellmann I. Keeping it local: evidence for positive selection in Swedish Arabidopsis thaliana. Mol. Biol. Evol. 2014;31:3026–3039. doi: 10.1093/molbev/msu247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zheng Y, Wiehe T. Adaptation in structured populations and fuzzy boundaries between hard and soft sweeps. PLoS Comput. Biol. 2019;15:e1007426. doi: 10.1371/journal.pcbi.1007426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Harris RB, Sackman A, Jensen JD. On the unfounded enthusiasm for soft selective sweeps II: examining recent evidence from humans, flies, and viruses. PLoS Genet. 2018;14:e1007859. doi: 10.1371/journal.pgen.1007859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Garud NR, Messer PW, Petrov DA. Detection of hard and soft selective sweeps from Drosophila melanogaster population genomic data. PLoS Genet. 2021;17:e1009373. doi: 10.1371/journal.pgen.1009373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Harris AM, Garud NR, DeGiorgio M. Detection and classification of hard and soft sweeps from unphased genotypes by multilocus genotype identity. Genetics. 2018;210:1429–1452. doi: 10.1534/genetics.118.301502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schrider DR, Kern AD. Soft sweeps are the dominant mode of adaptation in the human genome. Mol. Biol. Evol. 2017;34:1863–1877. doi: 10.1093/molbev/msx154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hernandez RD, et al. Classic selective sweeps were rare in recent human evolution. Science. 2011;331:920–924. doi: 10.1126/science.1198878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pritchard JK, Pickrell JK, Coop G. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr. Biol. 2010;20:R208–R215. doi: 10.1016/j.cub.2009.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sohail M, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:e39702. doi: 10.7554/eLife.39702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jensen JD. On the unfounded enthusiasm for soft selective sweeps. Nat. Commun. 2014;5:5281. doi: 10.1038/ncomms6281. [DOI] [PubMed] [Google Scholar]
- 13.Berg JJ, et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8:e39725. doi: 10.7554/eLife.39725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Harris AM, DeGiorgio M. Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift. Hum. Biol. 2017;89:21–46. doi: 10.13110/humanbiology.89.1.02. [DOI] [PubMed] [Google Scholar]
- 15.Stoneking M, Krause J. Learning about human population history from ancient and modern genomes. Nat. Rev. Genet. 2011;12:603–614. doi: 10.1038/nrg3029. [DOI] [PubMed] [Google Scholar]
- 16.Hellenthal G, et al. A genetic atlas of human admixture history. Science. 2014;343:747–751. doi: 10.1126/science.1243518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Moran BM, et al. The genomic consequences of hybridization. eLife. 2021;10:e69016. doi: 10.7554/eLife.69016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kelley JL, Madeoy J, Calhoun JC, Swanson W, Akey JM. Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res. 2006;16:980–989. doi: 10.1101/gr.5157306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Haak W, et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature. 2015;522:207–211. doi: 10.1038/nature14317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mathieson I, et al. The genomic history of southeastern Europe. Nature. 2018;555:197–203. doi: 10.1038/nature25778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lazaridis I, et al. Genomic insights into the origin of farming in the ancient Near East. Nature. 2016;536:419–424. doi: 10.1038/nature19310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Huber CD, DeGiorgio M, Hellmann I, Nielsen R. Detecting recent selective sweeps while controlling for mutation rate and background selection. Mol. Ecol. 2016;25:142–156. doi: 10.1111/mec.13351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.DeGiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R. SweepFinder2: increased sensitivity, robustness and flexibility. Bioinformatics. 2016;32:1895–1897. doi: 10.1093/bioinformatics/btw051. [DOI] [PubMed] [Google Scholar]
- 25.Stephan W. Selective sweeps. Genetics. 2019;211:5–13. doi: 10.1534/genetics.118.301319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Szpak, M., Xue, Y., Ayub, Q. & Tyler-Smith, C. How well do we understand the basis of classic selective sweeps in humans? FEBS Lett., 10.1002/1873-3468.13447 (2019). [DOI] [PubMed]
- 27.Whitlock MC, Lotterhos KE. Reliable detection of loci responsible for local adaptation: inference of a null model through trimming the distribution of F(ST) Am. Nat. 2015;186:S24–S36. doi: 10.1086/682949. [DOI] [PubMed] [Google Scholar]
- 28.Sabeti PC, et al. Positive natural selection in the human lineage. Science. 2006;312:1614–1620. doi: 10.1126/science.1124309. [DOI] [PubMed] [Google Scholar]
- 29.Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol. Biol. Evol. 2014;31:1275–1291. doi: 10.1093/molbev/msu077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Black FL, Hedrick PW. Strong balancing selection at HLA loci: evidence from segregation in South Amerindian families. Proc. Natl Acad. Sci. U. S. A. 1997;94:12452–12456. doi: 10.1073/pnas.94.23.12452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Brandt DYC, César J, Goudet J, Meyer D. The effect of balancing selection on population differentiation: a study with HLA genes. G3. 2018;8:2805–2815. doi: 10.1534/g3.118.200367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bitarello BD, et al. Signatures of long-term balancing selection in human genomes. Genome Biol. Evol. 2018;10:939–955. doi: 10.1093/gbe/evy054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Fu Q, et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature. 2014;514:445–449. doi: 10.1038/nature13810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Fu Q, et al. The genetic history of Ice Age Europe. Nature. 2016;534:200–205. doi: 10.1038/nature17993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Johnson KE, Voight BF. Patterns of shared signatures of recent positive selection across human populations. Nat. Ecol. Evol. 2018;2:713–720. doi: 10.1038/s41559-018-0478-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pickrell JK, et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 2009;19:826–837. doi: 10.1101/gr.087577.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Günther T, Schmid KJ. Improved haplotype-based detection of ongoing selective sweeps towards an application in Arabidopsis thaliana. BMC Res. Notes. 2011;4:232. doi: 10.1186/1756-0500-4-232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ronen R, et al. Predicting carriers of ongoing selective sweeps without knowledge of the favored allele. PLoS Genet. 2015;11:e1005527. doi: 10.1371/journal.pgen.1005527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72. doi: 10.1371/journal.pbio.0040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Haller BC, Messer PW. SLiM 3: forward genetic simulations beyond the Wright-Fisher model. Mol. Biol. Evol. 2019;36:632–637. doi: 10.1093/molbev/msy228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kamm J, Terhorst J, Durbin R, Song YS. Efficiently inferring the demographic history of many populations with allele count data. J. Am. Stat. Assoc. 2019;115:1–16. doi: 10.1080/01621459.2019.1635482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Damgaard P, et al. 137 ancient human genomes from across the Eurasian steppes. Nature. 2018;557:369–374. doi: 10.1038/s41586-018-0094-2. [DOI] [PubMed] [Google Scholar]
- 43.Jones ER, et al. Upper Palaeolithic genomes reveal deep roots of modern Eurasians. Nat. Commun. 2015;6:8912. doi: 10.1038/ncomms9912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Turney CSM, Brown H. Catastrophic early Holocene sea level rise, human migration and the Neolithic transition in Europe. Quat. Sci. Rev. 2007;26:2036–2041. doi: 10.1016/j.quascirev.2007.07.003. [DOI] [Google Scholar]
- 45.Mathieson S, Mathieson I. FADS1 and the timing of human adaptation to agriculture. Mol. Biol. Evol. 2018;35:2957–2970. doi: 10.1093/molbev/msy180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhang X, et al. Proc. Natl Acad. Sci. USA. 2021;118:e2020803118. doi: 10.1073/pnas.2020803118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Jagoda E, et al. Disentangling immediate adaptive introgression from selection on standing introgressed variation in humans. Mol. Biol. Evol. 2018;35:623–630. doi: 10.1093/molbev/msx314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yair S, Lee KM, Coop G. The timing of human adaptation from Neanderthal introgression. Genetics. 2021;218:iyab052. doi: 10.1093/genetics/iyab052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Souilmi Y, et al. An ancient viral epidemic involving host coronavirus interacting genes more than 20,000 years ago in East Asia. Curr. Biol. 2021;31:3504–3514.e9. doi: 10.1016/j.cub.2021.05.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Moorjani P, et al. A genetic method for dating ancient genomes provides a direct estimate of human generation interval in the last 45,000 years. Proc. Natl Acad. Sci. USA. 2016;113:5652–5657. doi: 10.1073/pnas.1514696113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Hermisson J, Pennings PS. Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods Ecol. Evol. 2017;8:700–716. doi: 10.1111/2041-210X.12808. [DOI] [Google Scholar]
- 52.Höllinger I, Pennings PS, Hermisson J. Polygenic adaptation: from sweeps to subtle frequency shifts. PLoS Genet. 2019;15:e1008035. doi: 10.1371/journal.pgen.1008035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Peter BM, Huerta-Sanchez E, Nielsen R. Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet. 2012;8:e1003011. doi: 10.1371/journal.pgen.1003011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Nakagome S, Hudson RR, Di Rienzo A. Inferring the model and onset of natural selection under varying population size from the site frequency spectrum and haplotype structure. Proc. Biol. Sci. 2019;286:20182541. doi: 10.1098/rspb.2018.2541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Orr HA, Allen Orr H, Betancourt AJ. Haldane’s sieve and adaptation from the standing genetic variation. Genetics. 2001;157:875–884. doi: 10.1093/genetics/157.2.875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Teshima KM, Coop G, Przeworski M. How reliable are empirical genomic scans for selective sweeps? Genome Res. 2006;16:702–712. doi: 10.1101/gr.5105206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Hermisson J, Pennings PS. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics. 2005;169:2335–2352. doi: 10.1534/genetics.104.036947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Mathieson, I. Estimating time-varying selection coefficients from time series data of allele frequencies. Preprint at https://www.biorxiv.org/content/10.1101/2020.11.17.387761 (2020). [DOI] [PMC free article] [PubMed]
- 59.Burger J, et al. Low prevalence of lactase persistence in bronze age Europe indicates ongoing strong selection over the last 3,000 years. Curr. Biol. 2020;30:4307–4315.e13. doi: 10.1016/j.cub.2020.08.033. [DOI] [PubMed] [Google Scholar]
- 60.Antonio ML, et al. Ancient Rome: a genetic crossroads of Europe and the Mediterranean. Science. 2019;366:708–714. doi: 10.1126/science.aay6826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Arauna LR, Hellenthal G, Comas D. Dissecting human North African gene-flow into its western coastal surroundings. Proc. Biol. Sci. 2019;286:20190471. doi: 10.1098/rspb.2019.0471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lamnidis TC, et al. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nat. Commun. 2018;9:5018. doi: 10.1038/s41467-018-07483-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Marchi N, et al. The genomic origins of the world’s first farmers. Cell. 2022;185:1842–1859.e18. doi: 10.1016/j.cell.2022.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Gossmann TI, et al. Genome wide analyses reveal little evidence for adaptive evolution in many plant species. Mol. Biol. Evol. 2010;27:1822–1832. doi: 10.1093/molbev/msq079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Sattath S, Elyashiv E, Kolodny O, Rinott Y, Sella G. Pervasive adaptive protein evolution apparent in diversity patterns around amino acid substitutions in Drosophila simulans. PLoS Genet. 2011;7:e1001302. doi: 10.1371/journal.pgen.1001302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Long Q, et al. Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nat. Genet. 2013;45:884–890. doi: 10.1038/ng.2678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Bonhomme M, et al. Genomic signature of selective sweeps illuminates adaptation of Medicago truncatula to root-associated microorganisms. Mol. Biol. Evol. 2015;32:2097–2110. doi: 10.1093/molbev/msv092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Mazet O, Rodríguez W, Grusea S, Boitard S, Chikhi L. On the importance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral population size inference? Heredity. 2016;116:362–371. doi: 10.1038/hdy.2015.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Mousset S, et al. Evidence of a high rate of selective sweeps in African Drosophila melanogaster. Genetics. 2003;163:599–609. doi: 10.1093/genetics/163.2.599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Badouin H, et al. Widespread selective sweeps throughout the genome of model plant pathogenic fungi and identification of effector candidates. Mol. Ecol. 2017;26:2041–2062. doi: 10.1111/mec.13976. [DOI] [PubMed] [Google Scholar]
- 71.Garud NR, Messer PW, Buzbas EO, Petrov DA. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 2015;11:e1005004. doi: 10.1371/journal.pgen.1005004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Johri P, Stephan W, Jensen JD. Soft selective sweeps: Addressing new definitions, evaluating competing models, and interpreting empirical outliers. PLoS Genet. 2022;18:e1010022. doi: 10.1371/journal.pgen.1010022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–i890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
- 75.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Jónsson H, Ginolhac A, Schubert M, Johnson PLF, Orlando L. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics. 2013;29:1682–1684. doi: 10.1093/bioinformatics/btt193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Kircher, M. in Ancient DNA: Methods and Protocols (eds Shapiro, B. & Hofreiter, M.) 197–228 (Humana, 2012).
- 79.Jun G, Wing MK, Abecasis GR, Kang HM. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Res. 2015;25:918–925. doi: 10.1101/gr.176552.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Mathieson I, et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528:499–503. doi: 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 83.Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Nielsen R, et al. Genomic scans for selective sweeps using SNP data. Genome Res. 2005;15:1566–1575. doi: 10.1101/gr.4252305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Kinsella RJ, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 2003;31:2013–2035. doi: 10.1214/aos/1074290335. [DOI] [Google Scholar]
- 89.Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 2009;4:1184–1191. doi: 10.1038/nprot.2009.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Iglewicz B, Hoaglin DC. How to Detect and Handle Outliers. Asq,; 1993. [Google Scholar]
- 91.Daub JT, et al. Evidence for polygenic adaptation to pathogens in the human genome. Mol. Biol. Evolution. 2013;30:1544–1558. doi: 10.1093/molbev/mst080. [DOI] [PubMed] [Google Scholar]
- 92.Durrett R, Schweinsberg J. Approximating selective sweeps. Theor. Popul. Biol. 2004;66:129–138. doi: 10.1016/j.tpb.2004.04.002. [DOI] [PubMed] [Google Scholar]
- 93.Bhérer C, Campbell CL, Auton A. Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nat. Commun. 2017;8:14994. doi: 10.1038/ncomms14994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Fu Q, et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature. 2014;514:445–449. doi: 10.1038/nature13810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Seguin-Orlando A, et al. Genomic structure in Europeans dating back at least 36,200 years. Science. 2014;346:1113–1118. doi: 10.1126/science.aaa0114. [DOI] [PubMed] [Google Scholar]
- 96.Souilmi, Y. et al. Admixture has obscured signals of historical hard sweeps in humans. Supplementary data. figshare https://figshare.com/articles/figure/Supplementary_data_1-57/20365503 (2022). [DOI] [PMC free article] [PubMed]
- 97.Souilmi, Y. et al. Admixture has obscured signals of historical hard sweeps in humans. Supplementary data. figshare https://figshare.com/articles/figure/Supplementary_data_58-114/20365500 (2022). [DOI] [PMC free article] [PubMed]
- 98.Weir BS, Cockerham CC. Estimating f-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
- 99.Hofer T, Ray N, Wegmann D, Excoffier L. Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection. Ann. Hum. Genet. 2009;73:95–108. doi: 10.1111/j.1469-1809.2008.00489.x. [DOI] [PubMed] [Google Scholar]
- 100.Lu, Y., Patterson, N., Zhan, Y., Mallick, S. & Reich, D. Technical design document for a SNP array that is optimized for population genetics. https://reichdata.hms.harvard.edu/pub/datasets/humanOrigins/8_12_2011_Technical_Array_Design_Document.pdf (2011).
- 101.Souilmi, Y. et al. Admixture has obscured signals of historical hard sweeps in humans. Datasets. figshare https://adelaide.figshare.com/articles/dataset/Frequency_data_and_SweepFinder2_output/20837695 (2022). [DOI] [PMC free article] [PubMed]
- 102.Narasimhan VM, et al. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 2017;8:303. doi: 10.1038/s41467-017-00323-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Allele frequency data and SweepFinder2 results for individual ancient populations101 can be accessed under 10.25909/6324956ee6ba6. Supplemental Data 1–57 (ref. 96) and 58–114 (ref. 97) are accessible under 10.25909/631595cfd2733 and 10.25909/631595df4f1b2, respectively.
The custom script used to compute SFS from VCF files is available at https://gist.github.com/yassineS/fe2712ad52d76460b927e3f391ea51f6. All other custom R scripts used in the analysis of empirical and simulated data are available from the corresponding authors upon request.