Abstract
It is increasingly evident that natural selection plays a prominent role in shaping patterns of diversity across the genome. The most commonly studied modes of natural selection are positive selection and negative selection, which refer to directional selection for and against derived mutations, respectively. Positive selection can result in hitchhiking events, in which a beneficial allele rapidly replaces all others in the population, creating a valley of diversity around the selected site along with characteristic skews in allele frequencies and linkage disequilibrium among linked neutral polymorphisms. Similarly, negative selection reduces variation not only at selected sites but also at linked sites, a phenomenon called background selection (BGS). Thus, discriminating between these two forces may be difficult, and one might expect efforts to detect hitchhiking to produce an excess of false positives in regions affected by BGS. Here, we examine the similarity between BGS and hitchhiking models via simulation. First, we show that BGS may somewhat resemble hitchhiking in simplistic scenarios in which a region constrained by negative selection is flanked by large stretches of unconstrained sites, echoing previous results. However, this scenario does not mirror the actual spatial arrangement of selected sites across the genome. By performing forward simulations under more realistic scenarios of BGS, modeling the locations of protein-coding and conserved noncoding DNA in real genomes, we show that the spatial patterns of variation produced by BGS rarely mimic those of hitchhiking events. Indeed, BGS is not substantially more likely than neutrality to produce false signatures of hitchhiking. This holds for simulations modeled after both humans and Drosophila, and for several different demographic histories. These results demonstrate that appropriately designed scans for hitchhiking need not consider BGS’s impact on false-positive rates. However, we do find evidence that BGS increases the false-negative rate for hitchhiking, an observation that demands further investigation.
Keywords: background selection, forward simulation, hitchhiking, selective sweeps
The impact of natural selection on genetic diversity within and between species has been debated for decades (Kimura 1968, 1983; Gillespie 1984, 1991; Kern and Hahn 2018; Jensen et al. 2019). Perhaps the strongest evidence that selection influences the amount of diversity at linked neutral alleles comes from the correlation between diversity levels and recombination rates across the genome (Begun and Aquadro 1992; Smukowski and Noor 2011; McGaugh et al. 2012; Corbett-Detig et al. 2015). This observation is consistent with genetic hitchhiking, in which a beneficial mutation rapidly increases in population frequency and carries its genetic background along with it. These events, which are also referred to as selective sweeps, result in the complement of genetic diversity in the vicinity of the selected site being largely replaced by descendants of the chromosome(s) that acquired the adaptive mutation. The width of the resulting valley of diversity will depend in part on the recombination rate, as crossover events will allow linked variation to “escape” by shuffling alleles onto and off the set of sweeping chromosomes (Maynard Smith and Haigh 1974; Kaplan et al. 1989). The correlation between recombination rate and diversity can also be explained by background selection (BGS), wherein neutral alleles linked to a deleterious mutation are purged via negative (or purifying) selection unless they can escape via recombination (Charlesworth et al. 1993). Whether primarily due to hitchhiking, BGS, or—more likely—a combination of the two (Elyashiv et al. 2016; Booker and Keightley 2018), there is growing evidence that natural selection has a profound impact on the amount and patterns of diversity across the genome in a variety of species (Begun et al. 2007; Lohmueller et al. 2011; Langley et al. 2012; Corbett-Detig et al. 2015; Booker and Keightley 2018). Indeed, it appears that natural selection is in part responsible for the limited range of levels of genetic diversity observed across species (Lewontin 1974; Leffler et al. 2013; Corbett-Detig et al. 2015), although other forces are likely at play as well (Coop 2016).
There is ample reason to suspect that BGS may substantially reduce the amount of polymorphism genome-wide. For example, approximations have been derived for the expected percent reduction in diversity at a given site, termed B, due to negative selection acting on linked sites (Hudson and Kaplan 1994, 1995; Nordborg et al. 1996). The extent to which BGS affects diversity can thus be predicted from a genome annotation and estimated distribution of fitness effects (DFE) (e.g., McVicker et al. 2009). While there may be some uncertainty over the true DFE, such B-maps predict that BGS has a sizeable impact, removing an estimated ∼20% and ∼45% of diversity on the autosomes in humans and Drosophila, respectively (McVicker et al. 2009; Comeron 2014).
More controversial is the potential role of positive selection in shaping the landscape of diversity across the genome (Stephan 2010). A number of approaches exist for detecting positive selection in population genomic data. For example, variants of the McDonald–Kreitman test, which searches for an excess of nonsynonymous divergence between species, have found that in many organisms a large fraction of amino acid substitutions are beneficial (Smith and Eyre-Walker 2002; Charlesworth and Eyre-Walker 2006; Enard et al. 2016; Galtier 2016). Efforts have also been made to fit genome-wide parameters of recurrent hitchhiking models to population genetic data in Drosophila, in some cases suggesting appreciable rates of hitchhiking events (Andolfatto 2007; Jensen et al. 2007; Li and Stephan 2006). An alternative approach to assess the frequency of adaptive substitutions is to directly search for recent selective sweeps. For this reason, and because identifying hitchhiking may provide clues about recent adaptations and selective pressures, a large number of methods for locating selective sweeps in the genome have been devised (Hudson et al. 1994; Fay and Wu 2000; Kim and Stephan 2002; Sabeti et al. 2002; Kim and Nielsen 2004; Nielsen et al. 2005; Voight et al. 2006; Lin et al. 2011; Ronen et al. 2013; Ferrer-Admetlla et al. 2014; Pybus et al. 2015; Schrider and Kern 2016; Mughal and DeGiorgio 2018). Nevertheless, detecting signatures of hitchhiking remains a major challenge. For example, it is well known that demographic events such as population bottlenecks can mirror selective sweeps (Simonsen et al. 1995; Jensen et al. 2005; Nielsen et al. 2005) and the feasibility of detecting hitchhiking in the presence of nonequilibrium demography remains hotly debated (Harris et al. 2018; Schrider and Kern 2018). It has also been suggested that BGS could be mistaken for selective sweeps (e.g., Comeron et al. 2012; DeGiorgio et al. 2016), and it is this possibility that we investigate here.
Intuitively, one may expect BGS to resemble hitchhiking because both forces can create localized reductions of polymorphism. Moreover, BGS is a very flexible model in part because of the astronomical number of possible arrangements of selected sites across a chromosome; it is straightforward to construct a BGS scenario that somewhat mirrors the valley of diversity caused by hitchhiking by placing a large cluster of selected sites in a region flanked by vast stretches of unselected sites (e.g., Mughal and DeGiorgio 2018). At first blush this may suggest that distinguishing between hitchhiking and BGS should be extremely challenging [reviewed in Stephan (2010)]. However, in practice we often know (with some degree of uncertainty) where selected sites reside in genomes with high-quality annotations of genes and conserved noncoding elements (CNEs); such information is used to create the B-maps alluded to above. Thus, rather than focusing on the most pessimistic scenario in which BGS could mirror selective sweeps, it is possible to ask how often BGS would be expected to actually mirror hitchhiking in real genomes. In addition, hitchhiking events affect diversity in ways other than just removing polymorphisms: they can also dramatically skew the site frequency spectrum toward low- and high-frequency-derived alleles (Braverman et al. 1995; Fay and Wu 2000) and increase linkage disequilibrium (LD) on either flank of the selected site while reducing LD between polymorphisms on opposite sides of the sweep site (Kim and Nielsen 2004). While BGS may also influence these aspects of polymorphism, such effects may be subtle (Charlesworth et al. 1995; Tachida 2000; Zeng 2013). Thus, it is unclear whether BGS will resemble hitchhiking when additional features of genetic variation are examined.
With this in mind, here we examine the separability of BGS and hitchhiking models via simulations of large genomic regions summarized by a number of statistics capturing the amount of nucleotide and haployptic diversity, the shape of the site frequency spectrum, and patterns of LD. Our approach is to simulate BGS scenarios designed to match annotated genomes with respect to their locations of selected sites and estimated DFEs, and to compare the resulting patterns of diversity to those expected under selective sweeps. In addition to qualitative comparisons of these models using various summaries of diversity, we also use a classification approach to ask how often BGS simulations are mistaken for hitchhiking events, as this is informative about how frequently realizations of these two models will resemble one another. We first examine simulations modeled after the human genome, including both equilibrium and nonequilibrium demographic histories. We then consider the Drosophila melanogaster genome, which has both a much larger density of selected sites and a different estimated DFE. We then conclude with a discussion of the implications of our results for efforts to detect positive selection in the face of potentially widespread BGS.
Materials and Methods
Annotation data
We downloaded annotation data from the University of California Santa Cruz (UCSC) Table Browser (Karolchik et al. 2004) for both the human (Lander et al. 2001) and D. melanogaster (Adams et al. 2000) genome assemblies (GRCh37/hg19 and release 5/dm3 coordinate spaces, respectively; all data accessed on December 21, 2018). These data included refSeq protein-coding gene coordinates for both genomes and phastCons elements (Siepel et al. 2005) conserved across vertebrates and insects, respectively, as well as the locations of gaps within both genome assemblies. The human genetic map from Kong et al. (2010) was also obtained from this resource. In addition to data from the UCSC Table Browser, we used the D. melanogaster genetic map from Comeron et al. 2012.
Overview of simulation strategies
We used four different simulation strategies to assess the similarities between BGS and hitchhiking models. These included: (1) forward simulations of various scenarios of BGS in moderately sized chromosomal windows (e.g., 1 cM in humans); (2) forward simulations of BGS in larger chromosomal regions, (3) coalescent simulations of selective sweeps used both to qualitatively compare with BGS and to train a classifier to more formally quantify the similarity of BGS and hitchhiking (by asking how often simulations of BGS are misclassified as selective sweeps); and (4) forward simulations containing both BGS and selective sweeps, used for assessing the extent to which the signatures of hitchhiking events are weakened in the presence of BGS.
Forward simulations of BGS
We used fwdpy11 version 0.1.4 (Thornton 2014) to perform forward simulations of 1.1-Mb regions modeled after human populations and 110-kb regions modeled after D. melanogaster. In all simulations with BGS, 75% of mutations within selected regions (either exons or CNEs) were deleterious (i.e., the selection coefficient, s, was drawn from the appropriate DFE), while the remaining 25% were selectively neutral. The selection coefficients for deleterious mutations were γ-distributed: the DFE for deleterious mutations in humans had a mean of −0.030 and shape parameter of 0.206 [as estimated by Boyko et al. (2008)], and for Drosophila the mean and shape were −0.000133 and 0.35, respectively (Huber et al. 2017). We simulated populations under four different scenarios (Figure 1): (1) no selection; (2) a scenario we refer to as central BGS, wherein the central ∼5% of the simulated region is a coding sequence; (3) a scenario where each simulated replicate is modeled after a randomly selected genomic region as described below (we refer to this scenario as real BGS because it should more accurately model BGS in real genomes than scenario 2); and (4) a scenario identical to 3 but where the selection coefficients in CNEs are 10-fold lower than in coding regions, although the DFEs have the same shape (real BGS–weak CNE). Each of our simulated scenarios contained a single, fixed dominance coefficient, h, for all deleterious mutations. We simulated replicates of each real BGS scenario (scenarios 3 and 4 above) with dominance values of 0, 0.25, 0.5, and 1.0. The fitness values of homozygous wild-type, heterozygous, and homozygous mutant individuals were 1, 1-hs, and 1-s, respectively. Unless otherwise noted, we report results from simulations with dominance of 0.25 because deleterious mutations may often be partially recessive (García-Dorado and Caballero 2000; Peters et al. 2003; Agrawal and Whitlock 2011), although our results do not appear to change qualitatively with different dominance values as discussed in the Results. All forward simulations began with a burn-in period of 10N generations, where N is the ancestral population size. Although this may not have been a sufficient burn-in period to ensure that all lineages coalesced normally in the ancestral population, the strong concordance between the mean values of summary statistics calculated from our neutral forward and coalescent simulations (e.g., Figure 3) implies that the burn-in duration did not substantially impact our results. At the end of each simulation we randomly sampled 100 chromosomes from the population.
For each replicate of our real BGS simulations, we modeled a random genomic window of the appropriate size for the two species. This was done by first selecting the endpoint of the window, which we constrained to be a multiple of 100 kb for humans and 10 kb for Drosophila. Window locations were drawn with replacement, and all possible locations across the genome had equal probability of being selected, although windows with ≥75% of positions in assembly gaps were disallowed. The simulation replicate was then modeled after the selected region by taking the locations of annotated exons and phastCons elements within the region and allowing deleterious mutations to occur at these sites only, i.e., all mutations outside of these elements were neutral. Our neutral simulations followed the same procedure as the real BGS simulations for randomly selecting a region to model but used only the region’s recombination landscape.
For our human simulations, we used two different demographic histories: a constant-sized population with Ne = 10,000, and the European population size history estimated by Tennessen et al. (2012); the latter model contains two successive population contractions followed by a period of exponential growth and then a phase of more rapid growth continuing until the present. For Drosophila, we followed the three-epoch demographic model estimated by Sheehan and Song (2016), wherein the population experiences a protracted but moderate bottleneck followed by a nearly full recovery. Note that all of these demographic histories are single-population models with no gene flow. Our average mutation and recombination rates (μ and r, respectively) in humans were 1.2 × 10−8 (Kong et al. 2012) and 1.0 × 10−8, while in Drosophila these rates were set to 5 × 10−9 (Schrider et al. 2013; Assaf et al. 2017) and 2.3 × 10−8, respectively [based on Comeron (2014)]. To allow the mutation rate to vary across simulated replicates, for each simulation we drew the rate uniformly from a range spanning a full order of magnitude and centered around the specified mean; thus, mutation rates varied considerably among replicates but were constant across the simulated region within each replicate. In the interest of computational tractability we reduced our population sizes and simulation durations (in generations) 10-fold and 100-fold for the human and Drosophila simulations, respectively. Concordantly, mutation rates, recombination rates, and selection coefficients were increased by the same factor so that θ = 4N μ, ρ = 4N r, and α = 2N s were unaffected by this rescaling.
Simulations of BGS in larger chromosomal regions
The simulation strategy above generated thousands of replicates of 1.1 and 110 kb in length for humans and Drosophila. Because the impact of BGS can extend beyond these distances, we also simulated larger chromosomes of length 12.1 and 1.21 Mb in humans and Drosophila. For these simulations, we used the approach of the real BGS and real BGS–weak-CNE scenarios described above, in which the locations of selected sites are based on randomly selected 12.1- and 1.21-Mb regions of the human and Drosophila genomes, respectively. This approach allowed us to examine the impact of BGS on sweep detection when the region being examined is affected by both proximal and distal negatively selected sites spread across a chromosome. These simulations were carried out for each combination of demographic model and DFE described above.
Coalescent simulations of hitchhiking
Our goal was to compare the results of the BGS simulations described above to recent hard and soft selective sweeps. To rapidly simulate positive selection while conditioning on fixation of the adaptive allele, we used the coalescent simulator discoal (Kern and Schrider 2016). These simulations used the same demographic histories and average values of locus-wide θ and ρ as the forward simulations for BGS above. Again, θ varied uniformly across an order of magnitude from replicate to replicate. Rather than following a particular genetic map, ρ was drawn from a truncated exponential with a maximum value fixed to three times the mean; larger values of ρ require more memory and therefore sometimes cause the simulation to crash. This strategy allowed for variation in recombination rate while skewing toward lower rates. For each demographic history, we simulated 4000 examples of neutral evolution, hard sweeps occurring at the center of each of 11 adjacent equally sized windows partitioning the simulated chromosome, and soft sweeps at the center of each window. Our soft sweeps consisted of selection on a previously neutral allele that is segregating at a specified frequency at a time at which it becomes beneficial and sweeps to fixation (Hermisson and Pennings 2005), rather than the alternative model of soft sweeps from recurrent adaptive mutations (Pennings and Hermisson 2006). Note that this model does not ensure that multiple independent lineages that harbor the adaptive allele at the onset of selection will participate in the sweep (i.e., it is possible that all lineages but one may go to extinction, making the outcome somewhat more similar to selection on a de novo mutation). We also note that an alternative definition of hard and soft sweeps is commonly used in the literature, where a sweep is defined as hard if all sweeping lineages trace their ancestry to a single individual at the onset of selection and defined as soft otherwise (e.g., Hermisson and Pennings 2017). However, here we simply equate hard sweeps with selection on a de novo mutation and soft sweeps with selection on standing variation. For each hard- and soft-sweep replicate, the selection coefficient of the beneficial mutation was drawn uniformly from between 0.0001 and 0.05, while the initial selected frequency for soft sweeps ranged from between 0 and 0.05. The fixation time for each sweep was randomly chosen from between 0 and 200 generations ago, thereby modeling relatively recent selective sweeps (e.g., completing within the last ∼5000 years in humans). As with our forward simulations, our sample size was set to 100 haploid genomes.
Forward simulations with hitchhiking and BGS
We also sought to compare the influence of hitchhiking on diversity in regions with and without BGS. Therefore, we used forward simulations to model the real BGS scenario described above while also conditioning on the recent fixation of an adaptive mutation near the center of the region. The results of these simulations were then compared with simulated hitchhiking events on otherwise neutrally evolving chromosomes. The procedure for this simulation was as follows: first the simulation runs up until a randomly selected time drawn from ∼U(201, 5000) generations before the present, and the simulation state is saved. Next, the selected phase begins either by introducing a de novo beneficial mutation at the center of a randomly selected chromosome in the case of hard sweeps, or by changing the selection coefficient of the polymorphism nearest to the center of the chromosome having a frequency within a specified range in the case of a soft sweep. If the selected mutation is lost or does not reach fixation by the end of the simulation, the simulation is restarted from the point at which the state was previously saved, and this process repeats until fixation is achieved.
We sought to use the same uniform distributions as the coalescent simulations described above for the selection coefficient, fixation time, and initial selected frequency (in the case of soft sweeps). Because some combinations of these parameters are more likely than others to yield simulation replicates matching our acceptance criteria, we downsampled our set of completed replicates after splitting them into bins that were equally sized with respect to the fraction of each parameter range encompassed. For hard sweeps, we split our parameter ranges for the selection coefficient and fixation time into thirds, and drew an equal number of replicates from each bin in the resulting two-dimensional grid of nine parameter ranges. For soft sweeps we split our three parameter ranges (selection coefficient, fixation time, and initial selected frequency) into halves, and drew an equal number of replicates from each bin in the three-dimensional grid of eight parameter ranges. The resulting distribution of accepted replicates was then somewhat similar to that produced under our coalescent simulations, although our binning procedure was fairly coarse and also reduced our total number of replicates considerably. We performed this procedure for both hard and soft sweeps with and without BGS under the Tennessen et al. (2012) model of European demography, using the same rescaling factor as above (0.1). The resulting numbers of replicates were as follows: 342 for hard sweeps on a neutrally evolving background, 104 for soft sweeps on a neutrally evolving background, 378 for hard sweeps on a background experiencing BGS, and 128 for soft sweeps with BGS.
Summary statistics and visualization
For each coalescent and forward simulation we calculated the following statistics: π (Nei and Li 1979; Tajima 1983), (Watterson 1975), and Fay and Wu’s H (Fay and Wu 2000), Tajima’s D (Tajima 1989), the maximum derived allele frequency (DAF) (Li 2011), the number of distinct haplotypes, H12 and H2/H1 (Garud et al. 2015), Kelly’s ZnS (i.e., average r2; Kelly 1997), Kim and Nielsen’s ω (Kim and Nielsen 2004), and the variance, skewness, and kurtosis of the distribution of densities of pairwise differences between chromosomes. All of these can be calculated using diploS/HIC (Kern and Schrider 2018) in haploid mode. For our human simulations, these statistics were calculated both within 100- and 1-kb windows to visualize variation across coarse and fine scales, respectively. In Drosophila, these window sizes were 10 kb and 100 bp. The smaller window sizes produced plots that were quite noisy, so we further smoothed values by plotting running averages across 10 windows.
Classifying sweeps
We adopted a classification approach to ask how often our forward simulations resembled selective sweeps on the basis of the set of summary statistics described above. In particular, we used the diploS/HIC software package in haploid mode to classify each simulated region as a hard sweep, a soft sweep, linked to a hard sweep, linked to a soft sweep, or neutrally evolving. A classifier was trained to discriminate among these five classes as follows (note that none of these classes contain BGS, but this classifier can still be used to classify simulations with BGS, thereby revealing which of the five classes a BGS replicate most closely resembles).
Training was performed by first dividing the coalescent simulations described above into 11 windows. Hard- and soft-sweep simulations with a selected mutation located in the central window were then labeled as hard and soft, respectively, while those in other windows were labeled as “hard-linked” and “soft-linked,” respectively. Next, a balanced training set with 2000 examples of each of the five classes was constructed from these simulations, and a separate set of the same size was set aside for testing. We then trained our classifier using diploS/HIC’s train command with default parameters, before using the predict command to obtain classifications, which told us which of the five classes above most closely resembled a given simulation outcome according to diploS/HIC. Three classifiers were trained in total: one for the human equilibrium model, one for the human model of European demography from Tennessen et al. (2012), and one for the D. melanogaster model of African demography from Sheehan and Song (2016). Each classifier was then applied to additional simulations generated under the corresponding demographic model as described in the Results.
Data availability
All code for generating forward and coalescent simulations, calculating and visualizing summary statistics, training, and applying diploS/HIC are available at https://github.com/SchriderLab/posSelVsBgs. In addition, all simulated data, along with statistics from each replicate in both text and graph form, are available at https://figshare.com/projects/posSelVsBgs/72209. Supplemental material available at figshare: https://doi.org/10.25386/genetics.12863981.
Results
Simplistic models of BGS in humans can resemble selective sweeps
We begin by simulating constant-sized populations with mutation and recombination rates matching estimates from the human genome, and comparing average patterns of diversity after a selective sweep to those from two different models of BGS. The first model of BGS that we examined (dubbed central BGS) includes a 50-kb coding region in the center of the simulated locus, flanked by nonfunctional DNA on either side occupying the remainder of the 1.1-Mb locus (note that similar models have been used to describe the expected patterns of polymorphism under BGS) (e.g., Mughal and DeGiorgio 2018). In the second model (real BGS), for each replicate simulation a 1.1-Mb window was randomly selected from the human genome, and our simulation was designed to match several features of this genomic window including the locations of selected sites (exons and CNEs) and the recombination landscape (Materials and Methods). Although parameter values varied across replicates, the overall locus-wide mean population-scaled mutation and recombination rates, θ and ρ, were set to roughly match values expected in a human population of effective size 10,000 and a total locus size of 1.1 Mb. Note that for our BGS simulations, polymorphism at selected sites is included in our observations; our results therefore reflect the action of direct purifying selection as well as BGS. Selective sweeps were generated via coalescent simulation, while the BGS scenarios were modeled via forward simulation (Materials and Methods).
In Figure 2 we show levels of diversity as measured by three estimators of θ within simulated regions experiencing different modes of linked selection (averaged across 1000 replicates). We see that hard selective sweeps produce a large valley of diversity at the center of the simulated region with a gradual recovery toward equilibrium moving away from the selected site (Figure 2A), as expected (Maynard Smith and Haigh 1974). Note that this valley is more pronounced for π than for θW, due to the expected deficit of intermediate-frequency alleles produced by a hitchhiking event (Braverman et al. 1995). On the other hand, θH is elevated in the regions flanking the selective sweep due to the excess of high-frequency derived alleles that escaped the sweep via recombination (Fay and Wu 2000). For soft sweeps (Figure 2B), we see a qualitatively similar pattern but with a less pronounced valley of diversity. Again, π is lower than θW in the simulated region, indicating a deficit of intermediate-frequency alleles. It has been observed that soft sweeps with a fairly high initial selected frequency (e.g., 5% or more) can sometimes yield an excess of intermediate-frequency alleles (Teshima et al. 2006; Schrider et al. 2015), but here our initial selected frequency is constrained to values ≤5%, so closer concordance between hard and soft sweeps is expected. Additional statistics summarizing information about the site frequency spectrum, haplotype diversity, and LD are shown in Figure 3. These statistics show patterns concordant with expectations under a sweep: we observe peaks in the number of high-frequency derived alleles (Fay and Wu 2000; Hahn 2018) and in LD (Kim and Nielsen 2004) in regions flanking the sweep, decaying haplotype homozygosity with increasing distance from the selected site (Garud et al. 2015), and characteristic spatial patterns of the variance, skewness, and kurtosis of pairwise diversity around the sweep (Kern and Schrider 2018).
In the central BGS scenario (Figure 2C), we observe a strong localized reduction in diversity caused by direct selection against deleterious mutations, and then a rapid increase in diversity as we move away from the selected sites. Still, these diversity levels are somewhat reduced relative to the neutral expectation due to their linkage with the selected region, and recover gradually as we move further away as expected under BGS (Charlesworth et al. 1993). There are also apparent changes to the site frequency spectrum in the selected region [e.g., reduced Tajima’s D and maximum (DAF)], although these quickly recover toward neutral expectations with increasing distance from the selected region. Thus, while the average realization of this particular scenario does not perfectly match the predictions of a selective sweep, it is consistent with the possibility that regions experiencing BGS may commonly be mistaken for sweeps, especially if negatively selected sites are also examined, as may typically be the case when scanning for positive selection in practice.
Finally, in Figure 2D we show the mean values of the three estimators of θ across the real BGS simulations, wherein each replicate draws its recombination map and locations of selected sites from a random region of the human genome. We see that diversity is somewhat reduced in these simulations relative to the neutral expectation due to the combination of direct and linked negative selection. Because each replicate models a different genomic region, there is no consistent spatial pattern shown in Figure 2D, but this does not mean that individual replicates in this set do not resemble selective sweeps. We examine this possibility below.
Expected patterns of diversity produced by BGS in particular regions of the human genome
In the previous section, we examined the average values of different summary statistics under four different evolutionary models (Figure 2 and Figure 3), including one model of BGS where the chromosomal locations of exons and CNEs as well as the recombination map were chosen to match randomly selected regions in the human genome. Because these regions may differ dramatically in the number and locations of selected sites, and thus the expected impact of selection on diversity, rather than looking at the average across regions, a more useful question to ask is whether any particular region’s measures of diversity are expected to resemble selective sweeps. We examine this in Supplemental Material, Figures S1–S10, where we show the values of 15 summary statistics calculated from sets of simulations, each modeled after a particular randomly chosen region of the human genome, with 1000 replicates for each region. These plots also show the density of exonic and conserved noncoding sites in 100-kb windows, revealing the concordance between the peaks and valleys in the density of selected sites and the average values of the summary statistics. Among these 10 examples, we see considerable variation in the number and arrangement of selected sites along the chromosome. There are corresponding differences in the mean values of summary statistics from region to region, and generally across windows within a region we see subtle shifts in the amount of nucleotide diversity that coincide with changes in the density of selected sites (i.e., peaks in conserved elements correspond to slight dips in π). However, in none of these regions do the expected patterns of summary statistics resemble a selective sweep, or even the central BGS scenario. An examination of the arrangements of selected sites along these chromosomal regions reveals why this is so: in none of these 10 regions do we see a high density of selected sites in the center flanked by largely unconstrained sequence. Thus, these results suggest that scenarios of BGS that are most likely to resemble selective sweeps may not be appropriate models for the typical manner in which BGS shapes diversity across the human genome. We investigate this possibility more systematically in the following section.
BGS rarely produces patterns of diversity resembling selective sweeps in equilibrium populations
In the previous section we examined simulated data based on 10 randomly selected 1.1-Mb regions of the human genome, finding that none are expected to produce patterns of variation mimicking a recent hitchhiking event. However, the human genome is large, consisting of ∼3000 such regions. Thus, even if a small minority of regions have an arrangement of selected sites and recombination maps that lend themselves to producing large valleys of diversity at their center, then BGS could still result in a large number of regions somewhat resembling selective sweeps. We sought to examine this directly by asking how many of our simulated examples from the real BGS set are mistaken for sweeps on the basis of their spatial patterns of population genetic summary statistics. To do this, we used the S/HIC framework (Schrider and Kern 2016), which represents a genomic region as a large vector of population genetic summary statistics calculated in and normalized across each of a number of windows within this region (Materials and Methods); we refer to this set of statistics as our feature vector. S/HIC then classifies the central window of this region into one of five distinct evolutionary models: a hard sweep (i.e., a hard selective sweep recently occurred at the region’s center), a soft sweep, linked to a hard sweep (i.e., a hard sweep recently occurred within or near the region, but not within the central subwindow), linked to a soft sweep, or evolving neutrally (i.e., no recent sweep in the vicinity of the region). This inference is made via supervised machine learning: we first train a classifier on the basis of feature vectors calculated from genomic regions whose true class is known prior to applying the trained classifier to data whose true class may be unknown. In our case, the training data are obtained via coalescent simulation of regions with a sweep in a center, surrounded by unselected sequence (Materials and Methods). Because of the design of its feature vector, S/HIC is well suited for determining whether a given genomic window resembles a selective sweep or not on the basis of its spatial patterns of genetic variation. We note that there are similar approaches that may be equally suitable for this task (e.g., Lin et al. 2011; Mughal and DeGiorgio 2018).
After training our S/HIC classifier, we applied it to the 1000 replicates from our central BGS and real BGS sets of forward simulations (i.e., the same data examined in Figure 2D). First, we assessed S/HIC’s ability to perform the task for which it was trained: discriminating between selective sweeps, regions linked to selective sweeps, and neutrally evolving regions (top 5 rows of Figure 4). Overall the classifier performed quite well, although discrimination between hard and soft sweeps is difficult for both the sweep and sweep-linked classes; this is not unexpected given that our soft sweeps had a fairly low initial selected frequency (≤5%), making them more similar to hard sweeps where the initial frequency is 1/2N. Moreover, here we equate soft sweeps with selection on standing variation regardless of the number of independent copies that participate in the sweep (Materials and Methods), raising the possibility that, in some cases, only a single ancestral copy will reach fixation. However, our primary concern is the extent to which sweeps of any type can be distinguished from alternative evolutionary models. Importantly, we see that 4% of neutrally evolving regions are misclassified as selective sweeps (all as soft; Figure 4); these results are based on forward simulations but similar numbers are obtained when we use a test set of coalescent simulations generated in the same manner as those used to train S/HIC. Thus, due to the stochasticity of the evolutionary process, despite the vast difference in the expectations between sweep models and neutrality, we can expect to occasionally see neutrally evolving regions whose spatial patterns of genetic diversity resemble selective sweeps closely enough for S/HIC to misclassify them.
Next, we assessed S/HIC’s behavior on BGS models not included in training (bottom 3 rows of Figure 4), first asking how often examples in our central BGS set were mistaken for selective sweeps by S/HIC. Perhaps unsurprisingly given the sharp valley of diversity observed in Figure 2C, we find that 32.5% of these simulated regions are classified as sweeps by S/HIC, with the majority classified as soft sweeps (29.5% classified as soft vs. 3% as hard). However, when examining the real BGS simulations, which are designed to more accurately model BGS in the human genome, we find that 4.7% of examples are classified as sweeps (4.3% as soft and 0.4% as hard), similar to the corresponding fraction of neutrally evolving examples (P = 0.51; Fisher’s exact test). When simulating weaker selection on CNEs than on protein-coding exons, we again find no significant elevation in the rate of false-sweep calls (4.9% of examples classified as sweeps; P = 0.39); however, we note that the real BGS models do result in regions being classified as affected by linked soft selective sweeps (i.e., the soft-linked class) at a substantially higher rate than are neutral regions (∼11% for real BGS models vs. 6.2% under neutrality). In addition to running our classifier on each simulated replicate, we have created plots similar to Figures S1–S10, but rather than showing the mean values of each statistic we plot each simulation replicate separately. Readers curious about the extent of variability in patterns across individual realizations of each of our simulated scenarios—which can be considerable—may wish to explore these plots of each individual simulation (available at https://figshare.com/projects/posSelVsBgs/72209).
Thus far our simulations do not support the claim that BGS frequently alters diversity in a manner consistent with selective sweeps. This may imply that, at least in the case of parameterizations relevant for the human genome, BGS should be readily separable from models of selective sweeps by examining summaries of variation taken across a large chromosomal region. However, up to this point we have only considered a constant-sized population. Given that nonequilibrium population dynamics can have a profound impact on genetic diversity and the effect of BGS (Torres et al. 2018, 2019), we examine two such models in the following sections.
Selective sweeps and BGS are readily distinguishable in the presence of drastic population size change
To this point, we have shown that BGS and sweep models are readily distinguishable in simulated constant-sized populations with genomic regions modeled after those randomly selected from the human genome. It is known that human populations have experienced a number of demographic changes that have reshaped patterns of diversity genome-wide (Gravel et al. 2011). These include recent explosive population growth (Tennessen et al. 2012) and, in non-African populations, a severe bottleneck associated with the migration out of Africa (Marth et al. 2004). Thus, if our goal is to model BGS in humans we should consider the effects of dramatic population size changes. Therefore, we repeated all of the analyses described above under a model of European population size history (Tennessen et al. 2012).
Panels A and B of Figure 5 show that selective sweeps under the European model produce a fairly similar pattern to sweeps in a constant population (Figure 2), with one noticeable difference in that θH is depressed rather than elevated in regions flanking the sweep (although it remains considerably higher than π). Again, we see a sharp dip in diversity around the selected region in the central BGS model (Figure 5C) and on average a global reduction in diversity in the real BGS model (Figure 5D). When comparing the values of a larger set of population genetic statistics across the full 1.1-Mb region, we see that on average the central BGS model bears a passing resemblance to sweeps for some statistics but not others, echoing our results from the constant population size case (Figure S11).
We also reexamined the same 10 randomly selected genomic regions shown in Figures S1–S10, this time simulating BGS under the European model of Tennessen et al. 2012 (Figures S12–S21). Again, none of these 10 regions show the appearance of a sweep. To more formally ask how often examples of each of our scenarios resemble hitchhiking events, we trained a S/HIC classifier under the model of Tennessen et al. 2012 (Materials and Methods) and recorded the number of simulations that were misclassified as a selective sweep (Figure 6). Population bottlenecks are expected to produce sweep-like signatures (Simonsen et al. 1995; Jensen et al. 2005; Nielsen et al. 2005), and we do find that under this demographic model we observe a slightly higher false-positive rate for neutrally evolving regions than under constant population size (6.2% in total vs. 4% under constant population size; P = 0.033). In addition, we do have greater difficulty distinguishing between hard and soft sweeps, perhaps because population bottlenecks during a sweep can reduce diversity among chromosomes harboring the beneficial allele, thereby “hardening” the sweep (Wilson et al. 2014). As observed under equilibrium demographic history, the false-positive rate is considerably higher under the central BGS model than under neutrality (0.9% and 22.7% of simulations classified as hard and soft, respectively Figure 6). Under the real BGS model, the false-positive rate is similar to that under neutrality (5.8% in total, with 5.7% and 0.1% of simulations classified as hard and soft, respectively; P = 0.78 for the comparison with neutrality); we find similar results when simulating regions with weaker selection on CNEs (5.6% false positives, with 4.9% classified as hard and 0.7% as soft; P = 0.64 when compared with neutrality). In sum, under neither real BGS model do we see an excess of regions classified as being linked to a sweep. Thus, under realistic arrangements of selected sites, models of BGS in humans do not appear to produce signatures of selective sweeps even in the context of severe population size change.
BGS rarely mimics sweeps in Drosophila
Thus far our results are based on simulations modeled after the human genome, which has a low density of genes and conserved DNA (∼5%; Siepel et al. 2005) compared to more compact genomes. To examine the impact of BGS in a genome with a higher density of conserved elements, we simulated BGS in regions modeled after 110-kb windows in the D. melanogaster genome (see Materials and Methods), in which >50% sites show evidence of purifying selection (Andolfatto 2005; Halligan and Keightley 2006). For these simulations we used Sheehan and Song’s (2016) three-epoch demographic model of a Zambian population sample (Lack et al. 2015). In Figure 7 we again show values of three estimators of θ under our four difference scenarios. For this demographic history, there is a more pronounced difference between hard and soft sweeps (Figure 7, A and B). This may be a consequence of the larger effective population size (Ne) for Drosophila, which yields a much stronger effect observed for hard sweeps than under previous scenarios, while soft sweeps contain a drift phase whose duration is constant (in coalescent units) across values of Ne. Under the central BGS scenario (Figure 7C), we again see a strong reduction in diversity in the immediate vicinity of the selected sites flanked by a rapid recovery. In the real BGS scenario (Figure 7D), we again do not see any spatial pattern on average, as expected given our random sampling across loci; however we do see a larger mean reduction in variation relative to expectations in the absence of selection than seen in the human scenarios due to the denser placement of selected sites in Drosophila. The average patterns of additional summary statistics are shown in Figure S22; these statistics show a strong hitchhiking effect for hard sweeps (a valley of nucleotide diversity, a plateau of LD, an SFS skewed toward low- and high-frequency derived alleles, etc.) and a somewhat different pattern for soft sweeps (e.g., elevated Tajima’s D and a fairly high ratio of H2/H1). Although for central BGS the depth of the valley of diversity closely resembles that of soft sweeps, as expected there is no spatial pattern on average for any statistics under the real BGS scenario.
We examine expected values of summary statistics in 10 randomly selected individual regions in Figures S23–S32. Here, we see a much more conspicuous relationship between summaries of diversity and the density of conserved elements, in part because this density is an order of magnitude higher than in our human simulations. Again, none of our 10 randomly selected regions resemble a selective sweep at all. Using an S/HIC classifier trained on coalescent simulations under the Sheehan and Song model (Materials and Methods), we see that no neutrally evolving regions are misclassified as sweeps (Figure 8); this is perhaps unsurprising given the much stronger signatures shown in Figure 7 than in either human model we examined. A modest fraction of the central BGS cases are misinferred to contain sweeps (3.7% in total, with 2.3% hard and 1.4% soft), but for real BGS these false positives occur more rarely (1%, with 0.1% hard and 0.9% soft; these fractions are 0.5% hard and 0.2% soft when selection on CNEs is weaker). Unlike in the human scenarios, the real BGS simulations do result in an excess of sweep calls (P = 0.0019 and P = 0.015 in the standard and weak-CNE real BGS scenarios, respectively), although this effect is quite modest (false-positive rate ≤1% in either case). As in the human scenarios, we also note a sizeable fraction of simulations of the real BGS scenario assigned to S/HIC’s soft-linked class (16.7% and 18.9% for the standard and weak-CNE real BGS examples, vs. 2.2% under neutrality).
Dominance of deleterious mutations
We also examined the impact of the dominance coefficient of deleterious mutations on the propensity of regions experiencing BGS to be mistaken for selective sweeps. As shown in Figure S33, there does not appear to be a strong relationship between dominance and the fraction of real BGS simulations classified as a sweep by diploS/HIC. In our simulated constant-sized human populations, the fraction of spurious sweep calls (either hard or soft) hovers ∼5%, with no significant difference across dominance values and no discernible trend with increasing dominance. Similar results are observed under the European model, where the fraction of sweep calls ranges between 4% and 6%, and the African D. melanogaster model, where this fraction is ∼1% or less in all cases, again with no significant difference among dominance classes. These results imply that, at least for models of BGS in which every deleterious mutation has the same dominance coefficient, patterns of variation are unlikely to resemble those expected under recent hitchhiking events regardless of that coefficient’s value.
Properties of regions misclassified as selective sweeps
Although we find that the fraction of real BGS simulations misclassified as sweeps is not dramatically different from that of neutral simulations, it may be useful to ask whether the propensity of a genomic region under BGS to produce a signature of hitchhiking can be predicted a priori from the arrangement of selected sites and the recombination rate. In Table 1, we show Spearman’s correlation coefficients between the probability that a given simulation contains a sweep according to S/HIC (represented by the sum of the diploS/HIC’s posterior probability estimates for the hard and soft classes) and the number of selected sites in the central window, the number of selected sites in all other windows, and the total recombination rate of the simulated region. We calculated these correlations for each combination of species, demographic history, and strength of selection on CNEs in our simulated data set. For each data set we observe a significant negative correlation between the recombination rate and S/HIC’s predicted posterior probability that the region is a sweep. In Drosophila, we also observe a significant correlation between the number of selected sites in the central window and the posterior probability of a sweep in the weak-CNE scenario. Overall, our results suggest that there may be some power to predict which regions are most likely to be produce spurious sweep-like signatures; as one might expect, such regions have lower crossover rates and a greater density of selected sites in the central window (although the latter was significant in only one of our simulated scenarios).
Table 1. Spearman’s correlation coefficients between various properties of the simulated region and the sum of the posterior probabilities for diploS/HIC’s hard and soft classes.
Genome and demographic scenario | Strength of selection on CNEs | Number of selected sites in central window | Number of flanking selected sites | Total recombination rate |
---|---|---|---|---|
Human (equilibrium) | Same as exons | ρ = −0.038 (P = 0.90) | ρ = −0.048 (P = 0.13) | ρ = −0.21 (P = 2.3 × 10−11)a |
Human (equilibrium) | 10-fold weaker | ρ = 0.044 (P = 0.16) | ρ = −0.027 (P = 0.38) | ρ = −0.18 (P = 7.8 × 10−9)a |
Human (Tennessen) | Same as exons | ρ = 0.021 (P = 0.50) | ρ = −0.028 (P = 0.37) | ρ = −0.10 (P = 0.0011)a |
Human (Tennessen) | 10-fold weaker | ρ = 0.020 (P = 0.53) | ρ = 0.028 (P = 0.37) | ρ = −0.11 (P = 2.7 × 10−4)a |
Drosophila (Sheehan and Song) | Same as exons | ρ = 0.05 (P = 0.085) | ρ = −0.054 (P = 0.087) | ρ = − 0.15 (P = 3.9 × 10−6)a |
Drosophila (Sheehan and Song) | 10-fold weaker | ρ = 0.13 (P = 2.7 < 10−15)a | ρ = −0.032 (P = 0.30) | ρ = −0.22 (P = 7.7 × 10−13)a |
All simulations were under either the real BGS model or the real BGS model with weaker selection on CNEs. BGS, background selection; CNE, conserved noncoding element; recomb, recombination.
Uncorrected P-values are shown marking correlations that are significant with α = 0.05 after Bonferroni correction.
To more closely examine the impact of recombination rate on our results, for each demographic model we binned all of the real BGS simulation according to mean recombination rate across the simulated chromosome. Five bins were used: one bin reserved for the fairly small number of replicates with a recombination rate of zero, and the four quartiles among simulations with a nonzero recombination rate. In Figure 9, we show the fraction of examples misclassified as selective sweeps of either type by S/HIC for each recombination bin and in each case compare to the misclassification rate for neutral simulations (rightmost bar). The results of this same analysis for the real BGS–weak-CNE model is shown in Figure S34. We see that for most of the human recombination rate bins there is no significant elevation of the false-positive rate relative to neutral simulations, although this may in part be due to inadequate statistical power, with one exception being the no-recombination bin for the weak-CNE simulations (P = 0.04), and a trend toward higher false-positive rates in replicates with less recombination is seen in the human equilibrium simulations. However, in Drosophila, the effect of recombination is clear: regions with no recombination or in the lowest nonzero recombination rate bin account for the majority of false-sweep calls by S/HIC, and outside of these two bins the false-positive rate is ∼0. Thus, the small but significant elevation in false-positive rate produced by BGS in Drosophila seems to be driven entirely by low- or nonrecombining regions.
BGS does not mimic sweeps in larger simulated chromosomes
Until this point, we have limited our analysis to simulated chromosomes 1.1 Mb and 110 kb in length for humans and Drosophila, respectively; these lengths were chosen to match the region size that we trained S/HIC to examine. Although this allowed us to simulate thousands of replicates, the impact of BGS may be smaller in such simulations because they do not include the effect of selection in more distant linked regions, which also influences diversity (Comeron 2014). Therefore, we simulated much larger chromosomes (>10-fold larger than those described above; see Materials and Methods), with a smaller number of replicates (100 per demographic model–DFE combination) due to the greater computational demands of these simulations. These chromosomes were 12.1 and 1.21 Mb in humans and Drosophila, 11 times the length of our original simulations. Average nucleotide diversity in these simulations was qualitatively similar to that of the smaller simulations: π per site in the central 1.1-Mb window was 4.1 × 10−4 and 4.4 × 10−4 averaged across all small- and large-scale human equilibrium simulations, respectively, and 2.7 × 10−4 and 2.6 × 10−4 for the small- and large-scale simulations under the Tennessen et al. (2012) European model; in the central 110-kb window of the Drosophila simulations, average π was 0.0045 and 0.0044 in the small- and large-scale simulations. This suggests that the simulations used in the preceding sections may be adequate for addressing the similarity of BGS and hitchhiking models, despite the relatively small chromosomes being modeled, perhaps because the impact of selection on additional linked sites is relatively small compared with the combined effect of direct and linked selection within the focal window. Moreover, we do not observe a significantly elevated false-positive rate in the large-scale simulations when examining the central window within the chromosome (Figure 10). In our human equilibrium model, we observe a nominal increase in the false-positive rate when switching from smaller to larger chromosomes (4.7% of BGS windows misclassified as sweeps in small-scale simulations vs. 5% in large-scale simulations), although this is not significant (P = 0.81). In the European model of Tennessen et al. (2012), we actually see a smaller false-positive rate in our large-scale simulations (5.8% vs. 2%), but again this difference is not significant (P = 0.16). Similarly, in our Drosophila simulations, we see no significant difference in false-positive rates between our small- and large-scale simulations (1% vs. 0%; P = 0.61).
Our large-scale simulations allow us to examine another potential source of bias in our results: because we are identifying false-sweep signatures using S/HIC, which looks for the pattern of diversity consistent with a sweep at the center of its focal window, one concern may be that BGS produces sweep-like signatures fairly often, but rarely with the epicenter at the center of the region examined by S/HIC. Were this the case, S/HIC would be underpowered to detect spurious signatures resulting from BGS. To determine whether our above analyses may have underestimated the rate at which false-sweep signatures appear, we adopted a sliding window approach using the large-scale simulations described in the previous section. Specifically, we moved S/HIC across these larger simulated chromosomes with small step sizes, asking whether S/HIC mistakes the focal window for a sweep at each step. Using 10-kb step sizes for our human simulations and 1 kb for Drosophila, we classified 1100 windows for each replicate with S/HIC. Importantly, by using these small step sizes, we allow S/HIC to examine a number of possible sweep locations within each 100-kb window (or 10-kb window in Drosophila). We did not observe a significant increase in the false-positive rate relative to our examination of the central window alone (10): in our human equilibrium scenario the false-positive rate was 6.5%, compared with 5% when examining the central window alone (P = 0.69); in the European model Tennessen et al. (2012) the false-positive rate was 3.6%, compared with 2% when examining the central window (P = 0.59); and 0.93% vs. 0% in Drosophila (P = 1.0) (note that these P-values may be anticonservative due to the autocorrelation of tests of nearby windows within our larger simulated chromosomes) (Hahn 2006). These results imply that our primary approach of simulating larger numbers of small chromosomes experiencing BGS should randomize the location of any spurious sweep-like signatures, such that S/HIC should yield an unbiased estimate of their frequency of occurrence. Taken together, our findings suggest that BGS does not appear to systematically mimic hitchhiking even in larger simulated chromosomes.
BGS increases the false-negative rate for selective sweeps
We have shown that realistic models of BGS do not frequently produce sweep-like signatures. However, BGS could also potentially confound scans for hitchhiking events by eroding the signature of positive selection, thereby increasing the false-negative rate. To address this possibility, we generated forward simulations with recently completed hitchhiking events (see Materials and Methods), and asked whether sweeps occurring in concert with BGS were more difficult to detect than those occurring on an otherwise neutrally evolving background. Because this approach was fairly computationally intensive, we limited our analysis to a single demographic history: the model of human European demography of Tennessen et al. (2012).
We wished to match the selective parameters of our coalescent simulations, namely the selection coefficient, time since fixation, and the initial selected frequency in the case of soft sweeps, all of which were drawn from uniform distributions in our coalescent simulations. Therefore, we subsampled our simulations by dividing them into discrete bins based on these parameter values, uniformly drawing replicates for our final data set from these bins (Materials and Methods). However, because this binning approach was fairly coarse, it may not perfectly match the uniform distributions. Therefore, we assessed the impact of BGS on S/HIC’s false-negative rate by comparing classification results between two sets of forward simulations: those including BGS and those without BGS (Figure 11). Nonetheless, we found that our forward simulations of sweeps without BGS were qualitatively similar to our coalescent simulations in terms of the number of sweeps detected (89.8% and 80.4% of forward- and coalescent-simulated hard sweeps detected, respectively, and 71.1% and 79.2% of soft sweeps detected), although the fraction classified as hard or soft differed more substantially between the two simulated data sets.
We observed a substantial deficit of sweeps detected under BGS vs. an otherwise neutrally evolving chromosome. For example, 71.7% of hard sweeps simulated under BGS were classified as a sweep of either type by S/HIC, significantly lower than the 89.8% of sweeps without BGS that were recovered (P = 6.72 × 10−10; Figure 11). Moreover, hard sweeps under BGS were more likely to be misclassified as soft (46.0% vs. 35.1%; P = 0.0031). Similarly, soft sweeps occurring in the presence of BGS were less likely to be detected than those without BGS (71.1% vs. 53.1%; P = 0.0066). Together, these results suggest that BGS may dull the signatures of completed selective sweeps.
Discussion
Natural selection can shape patterns of genomic diversity in many ways. BGS is a prime example of this, as its expected patterns depend on the DFE, the locations of selected sites, and the recombination landscape. Thus, this model sufficiently flexible that one should not expect a single common signature of BGS. This is the motivation for B-maps, which take the DFE, recombination map, and functional DNA element coordinates into account to predict the reduction in diversity produced by BGS in a genome for which all of this information has been annotated/estimated. Such maps have been touted as an important “baseline” expectation for patterns of diversity across the genome (Comeron 2014). Unfortunately, such maps are only predictive of the impact of BGS on levels of expected heterozygosity (i.e., the degree of reduction in π produced by BGS). To model the full distribution of genealogies yielded by BGS on the basis of a genome annotation, one can use forward population genetic simulations, which are becoming increasingly computationally efficient (Thornton 2014; Kelleher et al. 2018; Haller et al. 2019). The present study attempts to do this by simulating large regions designed to mimic the spatial arrangement of functionally important sites across the genomes of humans and Drosophila, thereby modeling the effects of both direct and linked negative selection in these genomes. More extensive simulation of chromosome-sized segments under this approach could be used to produce analogs of the B-map for any set of summary statistics under an arbitrary demographic history. Such an approach may prove useful as a multidimensional baseline expectation of different summaries of diversity under BGS alone.
The goal of this study was to use forward simulation to investigate the expected patterns of diversity created by BGS in humans and Drosophila, and compare them to expectations under recent hard and soft selective sweeps. We find that some parameterizations of BGS do indeed yield a valley in diversity similar to that expected under a sweep. These results are consistent with a previous finding that simply taking genomic windows that are outliers with respect to π will result in a large number of false positives (Comeron 2014), although this approach is not commonly taken in practice. However, other statistics do not appear to be affected in the same manner as π (compare the central BGS scenario to the hard- and soft-sweep scenarios in Figure 3 and Figures S11 and S22).
Perhaps more importantly, in real genomes few regions have an arrangement of functional elements that coincidentally mirror those designed with the intention of confounding scans for selection. Thus, if we examine the complete landscape of genetic variation across a chromosome, the impact of BGS on the false-positive rate for selective sweeps should be minimal. Instead, our results suggest that the primary impact of BGS on scans for hitchhiking events may instead be an elevated false-negative rate. This effect is probably due to a combination of Hill–Robertson interference (Hill and Robertson 1966) and the shrinking of genealogies across the chromosome, including in regions flanking selective sweeps, caused by BGS. Both of these phenomena will cause the spatial skews in patterns of polymorphism produced by hitchhiking events to be less pronounced. The effect of negative selection on selective sweep detection requires further study, and it is possible that other approaches may be able to detect sweeps in the presence of BGS with greater sensitivity than S/HIC. An alternative strategy to detecting sweeps may also be to attempt to discriminate between selective sweeps and BGS, rather than solely considering neutrality as a baseline (Comeron 2014).
In recent years, several methods have been devised to use spatial patterns of multiple summaries of genetic variation around a focal region to detect selective sweeps (Lin et al. 2011; Schrider and Kern 2016; Mughal and DeGiorgio 2018), and our results suggest that these methods should be robust to realistic scenarios of BGS [consistent with results from Schrider and Kern (2017) and Mughal and DeGiorgio (2018)]. Indeed, a recent method for detecting hitchhiking using trend-filtered regression appears to be fairly robust even to a BGS scenario concocted to resemble selective sweeps (Mughal and DeGiorgio 2018). Thus, our conclusions about the separability between models of BGS and hitchhiking events could be viewed as conservative. In contrast to BGS, demographic history is likely to be an important confounding factor for detecting natural selection in practice (Simonsen et al. 1995; Jensen et al. 2005; Nielsen et al. 2005). Researchers should thus continue to focus on the development of methods that are robust to nonequilibrium demographic histories, especially in cases where the true history is unknown (Schrider and Kern 2016; Mughal and DeGiorgio 2018); this is likely to often be the case in practice given that demographic estimates will themselves be biased by the impact of natural selection on polymorphism (Ewing and Jensen 2016; Schrider et al. 2016).
We modeled our BGS scenarios after two very different genome architectures: the human genome, in which only 5% of sites are found within either coding or CNEs, and the D. melanogaster genome, in which the majority of sites are under direct purifying selection. Thus, we can ask whether genome structure appears to affect the degree to which BGS mimics selective sweeps. In both our human and Drosophila simulations, we see that in the presence of purifying selection and BGS the majority of genomic regions would not be expected to produce patterns of diversity consistent with a selective sweep. This is evidenced by the fact that we observe no elevation in our human simulations in the rate at which regions with BGS are misclassified as sweeps by S/HIC relative to neutrally evolving regions, and that although there is an elevation in this rate in Drosophila, it is quite subtle (≤1%) and limited to low-recombining regions as discussed below. This suggests that in both gene-dense and gene-poor genomes, the “gene oasis” scenario modeled in our central BGS simulations is relatively rare. However, we note that in both our human- and Drosophila-based simulations we found that S/HIC classifies regions as soft-linked at an increased rate; this may imply that unmodeled sources of heterogeneity of patterns of diversity can make it more difficult to discriminate between neutral evolution and linkage to nearby sweeps. Indeed, we previously observed a similar bias for S/HIC in the case of demographic misspecification (Schrider and Kern 2016).
Our study has some important limitations in that we only examined two different DFEs (Boyko et al. 2008; Huber et al. 2017), four fixed dominance coefficients, and three different demographic models that contain population size changes but no migration. There are infinite possible DFEs, distributions of dominance values, and demographic histories, so we cannot rule out the possibility that our results could change qualitatively under particular models and parameterizations. However under each of the three different combinations of demographic history, DFE, dominance, and genome annotation examined here there is no evidence that BGS systematically resembles selective sweeps substantially more often than purely neutral models do. Thus, we expect that our conclusions will hold in most genomes where the layout of selected sites does not frequently resemble that of our central BGS scenario. Our results do suggest that low-recombining regions with a high density of selected sites flanked by primarily nonfunctional DNA may be somewhat more likely to be misclassified as sweeps and thus should be treated with greater caution (Table 1), although our classification results imply that such confounding examples are uncommon. Indeed, it is worth stressing that the elevated false-positive rate under BGS in Drosophila seems to be confined to regions with little to no recombination, implying that such regions should perhaps be omitted from sweep scans based on spatial patterns of variation; this is a logical step given that such scans search for signatures produced by the interplay between selection and recombination, and if the latter is absent there is no reason to expect such a signature. Another type of region that may be problematic is that where the recombination rate is low but only in the central portion of the window to be classified. Although we did not examine this possibility here, Mughal and DeGiorgio (2018) previously showed that dramatically decreasing the recombination rate only in the central portion of a region while keeping the locations of negatively selected sites constant produces a modest increase in the false-positive rate.
Our analysis also focused primarily on relatively small simulated chromosomal regions, although we did simulate a number of larger chromosomes and found no evidence that they produce spurious BGS signatures at a higher rate. This result is intuitive because, although larger chromosomes result in more linked selection influencing a given focal window, there is no reason to expect that it would produce a valley of diversity near the center of this window or create other spatial signatures of a sweep; indeed, including more distant flanking selected sites should reduce diversity more on the edge of the focal window than in its center.
It is also important to note that because our strategy was to base our simulations entirely on empirical genome annotations and DFEs, we are limited to considering the effect of single-nucleotide mutations. It appears to be the case that the DFE in noncoding regions is skewed toward weaker selection coefficients (Racimo and Schraiber 2014), and this is a feature of our real BGS–weak-CNE model. However, our simulations ignore additional mutation types such as insertions/deletions (indel) and transposable element insertions and other structural variants (SVs) that are probably skewed toward stronger selection coefficients. In the standard real BGS model we have the same DFE for both coding and conserved noncoding DNA, and thus this model produces more strongly deleterious mutations than the weak-CNE model. The fact that both models produced very similar results for all three species could suggest that increases to the rates and fitness effects of deleterious mutations may not cause BGS to resemble sweeps more closely. However, the impact of indels and SVs warrants further investigation, and future efforts should incorporate both the rates and DFEs of additional mutation types once they are known more precisely.
We have also only considered scans for recent completed selective sweeps. Thus, we have not examined other selection scenarios such as balancing selection or partial selective sweeps, although we have no reason to believe that BGS will systemically mirror either of these selective scenarios, with the exception of very low-frequency partial sweeps that may be indistinguishable from drifting deleterious mutations (Maruyama 1974). We also note that a recent study examining local adaptation in populations with gene flow concluded that BGS is unlikely to increase the fraction of false positives produced by scans for FST outliers (Matthey-Doret and Whitlock 2019). However, scans for other selective scenarios, including much older selective sweeps where the signature may have degraded considerably (Schrider et al. 2015), or polygenic selection that in some scenarios is expected to produce more subtle shifts in allele frequencies (Jain and Stephan 2017; Höllinger et al. 2019; Thornton 2019), may have different propensities to be mistaken for BGS than the hitchhiking models examined here.
Although we show that BGS affects the mean values of several population genetic summary statistics, for most of the statistics we examined it does not create spatial patterns qualitatively similar to those expected under hitchhiking. Thus, our results demonstrate that efforts to detect recent positive selection should utilize the broader genomic spatial context of high-dimensional summaries of variation. Importantly, sweep-detection methods that use this information (Lin et al. 2011; Schrider and Kern 2016; Mughal and DeGiorgio 2018) rather than relying on univariate summaries or examining a narrowly defined genomic region can readily detect sweeps in the presence of purifying selection and BGS. Moreover, our findings imply that attempts to disentangle the relative effects of hitchhiking and BGS on levels of diversity genome-wide (Elyashiv et al. 2016; Booker and Keightley 2018) could be made even more effective by incorporating additional summaries of variation, although this may necessitate a reliance on simulated data rather than the use of likelihood estimation. Such efforts could also help to answer the question of to what extent hitchhiking and BGS are responsible for the limited range of neutral diversity observed across species (Lewontin 1974; Leffler et al. 2013; Corbett-Detig et al. 2015; Coop 2016).
Acknowledgments
The author thanks Matt Hahn and Andy Kern for feedback on the manuscript, and Kevin Thornton for help with fwdpy11. This work was funded by the National Institutes of Health National Human Genome Research Institute under award number R00HG008696.
Footnotes
Supplemental material available at figshare: https://doi.org/10.25386/genetics.12863981.
Communicating editor: R. Nielsen
Literature Cited
- Adams M. D., Celniker S. E., Holt R. A., Evans C. A., Gocayne J. D. et al. , 2000. The genome sequence of Drosophila melanogaster. Science 287: 2185–2195. 10.1126/science.287.5461.2185 [DOI] [PubMed] [Google Scholar]
- Agrawal A. F., and Whitlock M. C., 2011. Inferences about the distribution of dominance drawn from yeast gene knockout data. Genetics 187: 553–566. 10.1534/genetics.110.124560 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andolfatto P., 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437: 1149–1152. 10.1038/nature04107 [DOI] [PubMed] [Google Scholar]
- Andolfatto P., 2007. Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. Genome Res. 17: 1755–1762. 10.1101/gr.6691007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Assaf Z. J., Tilk S., Park J., Siegal M. L., and Petrov D. A., 2017. Deep sequencing of natural and experimental populations of Drosophila melanogaster reveals biases in the spectrum of new mutations. Genome Res. 27: 1988–2000. 10.1101/gr.219956.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Begun D. J., and Aquadro C. F., 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356: 519–520. 10.1038/356519a0 [DOI] [PubMed] [Google Scholar]
- Begun D. J., Holloway A. K., Stevens K., Hillier L. W., Poh Y.-P. et al. , 2007. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol. 5: e310 10.1371/journal.pbio.0050310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Booker T. R., and Keightley P. D., 2018. Understanding the factors that shape patterns of nucleotide diversity in the house mouse genome. Mol. Biol. Evol. 35: 2971–2988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyko A. R., Williamson S. H., Indap A. R., Degenhardt J. D., Hernandez R. D. et al. , 2008. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 4: e1000083 10.1371/journal.pgen.1000083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braverman J. M., Hudson R. R., Kaplan N. L., Langley C. H., and Stephan W., 1995. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140: 783–796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B., Morgan M., and Charlesworth D., 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134: 1289–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth D., Charlesworth B., and Morgan M., 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141: 1619–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth J., and Eyre-Walker A., 2006. The rate of adaptive evolution in enteric bacteria. Mol. Biol. Evol. 23: 1348–1356. 10.1093/molbev/msk025 [DOI] [PubMed] [Google Scholar]
- Comeron J. M., 2014. Background selection as baseline for nucleotide variation across the Drosophila genome. PLoS Genet. 10: e1004434 10.1371/journal.pgen.1004434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Comeron J. M., Ratnappan R., and Bailin S., 2012. The many landscapes of recombination in Drosophila melanogaster. PLoS Genet. 8: e1002905 10.1371/journal.pgen.1002905 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coop G., 2016. Does linked selection explain the narrow range of genetic diversity across species? bioRxiv (Preprint posted March 7, 2016). 10.1101/042598 [DOI] [Google Scholar]
- Corbett-Detig R. B., Hartl D. L., and Sackton T. B., 2015. Natural selection constrains neutral diversity across a wide range of species. PLoS Biol. 13: e1002112 10.1371/journal.pbio.1002112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeGiorgio M., Huber C. D., Hubisz M. J., Hellmann I., and Nielsen R., 2016. SweepFinder2: increased sensitivity, robustness and flexibility. Bioinformatics 32: 1895–1897. 10.1093/bioinformatics/btw051 [DOI] [PubMed] [Google Scholar]
- Elyashiv E., Sattath S., Hu T. T., Strutsovsky A., McVicker G. et al. , 2016. A genomic map of the effects of linked selection in Drosophila. PLoS Genet. 12: e1006130 10.1371/journal.pgen.1006130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enard D., Cai L., Gwennap C., and Petrov D. A., 2016. Viruses are a dominant driver of protein adaptation in mammals. Elife 5: e12469 10.7554/eLife.12469 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewing G. B., and Jensen J. D., 2016. The consequences of not accounting for background selection in demographic inference. Mol. Ecol. 25: 135–141. 10.1111/mec.13390 [DOI] [PubMed] [Google Scholar]
- Fay J. C., and Wu C.-I., 2000. Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrer-Admetlla A., Liang M., Korneliussen T., and Nielsen R., 2014. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol. Biol. Evol. 31: 1275–1291. 10.1093/molbev/msu077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galtier N., 2016. Adaptive protein evolution in animals and the effective population size hypothesis. PLoS Genet. 12: e1005774 10.1371/journal.pgen.1005774 [DOI] [PMC free article] [PubMed] [Google Scholar]
- García-Dorado A., and Caballero A., 2000. On the average coefficient of dominance of deleterious spontaneous mutations. Genetics 155: 1991–2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garud N. R., Messer P. W., Buzbas E. O., and Petrov D. A., 2015. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 11: e1005004 10.1371/journal.pgen.1005004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gillespie J. H., 1984. The status of the neutral theory: the neutral theory of molecular evolution. Science 224: 732–733. 10.1126/science.224.4650.732 [DOI] [PubMed] [Google Scholar]
- Gillespie J. H., 1991. The Causes of Molecular Evolution. Oxford University Press, Oxford. [Google Scholar]
- Gravel S., Henn B. M., Gutenkunst R. N., Indap A. R., Marth G. T. et al. , 2011. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108: 11983–11988. 10.1073/pnas.1019276108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn M. W., 2006. Accurate inference and estimation in population genomics. Mol. Biol. Evol. 23: 911–918. 10.1093/molbev/msj094 [DOI] [PubMed] [Google Scholar]
- Hahn M. W., 2018. Molecular Population Genetics. Oxford University Press, Oxford. [Google Scholar]
- Haller B. C., Galloway J., Kelleher J., Messer P. W., and Ralph P. L., 2019. Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes. Mol. Ecol. Resour. 19: 552–566. 10.1111/1755-0998.12968 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Halligan D. L., and Keightley P. D., 2006. Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 16: 875–884. 10.1101/gr.5022906 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris R., Sackman A., and Jensen J. D., 2018. On the unfounded enthusiasm for soft selective sweeps II: examining recent evidence from humans, flies, and viruses. PLoS Genet. 14: e1007859 10.1371/journal.pgen.1007859 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hermisson J., and Pennings P. S., 2005. Soft sweeps molecular population genetics of adaptation from standing genetic variation. Genetics 169: 2335–2352. 10.1534/genetics.104.036947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hermisson J., and Pennings P. S., 2017. Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods Ecol. Evol. 8: 700–716. 10.1111/2041-210X.12808 [DOI] [Google Scholar]
- Hill W. G., and Robertson A., 1966. The effect of linkage on limits to artificial selection. Genet. Res. 8: 269–294. 10.1017/S0016672300010156 [DOI] [PubMed] [Google Scholar]
- Höllinger I., Pennings P. S., and Hermisson J., 2019. Polygenic adaptation: from sweeps to subtle frequency shifts. PLoS Genet. 15: e1008035 10.1371/journal.pgen.1008035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber C. D., Kim B. Y., Marsden C. D., and Lohmueller K. E., 2017. Determining the factors driving selective effects of new nonsynonymous mutations. Proc. Natl. Acad. Sci. USA 114: 4465–4470. 10.1073/pnas.1619508114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson R. R., and Kaplan N. L., 1994. Gene Trees with Background Selection. Springer-Verlag, NY. [Google Scholar]
- Hudson R. R., and Kaplan N. L., 1995. Deleterious background selection with recombination. Genetics 141: 1605–1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson R. R., Bailey K., Skarecky D., Kwiatowski J., and Ayala F. J., 1994. Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics 136: 1329–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain K., and Stephan W., 2017. Rapid adaptation of a polygenic trait after a sudden environmental shift. Genetics 206: 389–406. 10.1534/genetics.116.196972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jensen J. D., Kim Y., DuMont V. B., Aquadro C. F., and Bustamante C. D., 2005. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170: 1401–1410. 10.1534/genetics.104.038224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jensen J. D., Thornton K. R., Bustamante C. D., and Aquadro C. F., 2007. On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations. Genetics 176: 2371–2379. 10.1534/genetics.106.069450 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jensen J. D., Payseur B. A., Stephan W., Aquadro C. F., Lynch M. et al. , 2019. The importance of the neutral theory in 1968 and 50 years on: a response to kern and Hahn 2018. Evolution 73: 111–114. 10.1111/evo.13650 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan N. L., Hudson R., and Langley C., 1989. The” hitchhiking effect” revisited. Genetics 123: 887–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karolchik D., Hinrichs A. S., Furey T. S., Roskin K. M., Sugnet C. W. et al. , 2004. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32: D493–D496. 10.1093/nar/gkh103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J., Thornton K., Ashander J., and Ralph P., 2018. Efficient pedigree recording for fast population genetics simulation. PLoS Comput. Biol. 14: e1006581 10.1371/journal.pcbi.1006581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelly J. K., 1997. A test of neutrality based on interlocus associations. Genetics 146: 1197–1206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern A. D., and Hahn M. W., 2018. The neutral theory in light of natural selection. Mol. Biol. Evol. 35: 1366–1371. 10.1093/molbev/msy092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern A. D., and Schrider D. R., 2016. Discoal: flexible coalescent simulations with selection. Bioinformatics 32: 3839–3841. 10.1093/bioinformatics/btw556 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern A. D., and Schrider D. R., 2018. diploS/HIC: an updated approach to classifying selective sweeps. G3 (Bethesda) 8: 1959–1970. 10.1101/267229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y., and Nielsen R., 2004. Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 1513–1524. 10.1534/genetics.103.025387 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y., and Stephan W., 2002. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160: 765–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M., 1968. Evolutionary rate at the molecular level. Nature 217: 624–626. 10.1038/217624a0 [DOI] [PubMed] [Google Scholar]
- Kimura M., 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge: 10.1017/CBO9780511623486 [DOI] [Google Scholar]
- Kong A., Thorleifsson G., Gudbjartsson D. F., Masson G., Sigurdsson A. et al. , 2010. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467: 1099–1103. 10.1038/nature09525 [DOI] [PubMed] [Google Scholar]
- Kong A., Frigge M. L., Masson G., Besenbacher S., Sulem P. et al. , 2012. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488: 471–475. 10.1038/nature11396 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lack J. B., Cardeno C. M., Crepeau M. W., Taylor W., Corbett-Detig R. B. et al. , 2015. The Drosophila Genome Nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199: 1229–1241. 10.1534/genetics.115.174664 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lander E. S., Linton L. M., Birren B., Nusbaum C., Zody M. C. et al. , 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921. 10.1038/35057062 [DOI] [PubMed] [Google Scholar]
- Langley C. H., Stevens K., Cardeno C., Lee Y. C. G., Schrider D. R. et al. , 2012. Genomic variation in natural populations of Drosophila melanogaster. Genetics 192: 533–598. 10.1534/genetics.112.142018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leffler E. M., Gao Z., Pfeifer S., Ségurel L., Auton A. et al. , 2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science 339: 1578–1582. 10.1126/science.1234070 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewontin, R. 1974. The Genetic Basis of Evolutionary Change. Columbia University Press, New York. [Google Scholar]
- Li H., 2011. A new test for detecting recent positive selection that is free from the confounding impacts of demography. Mol. Biol. Evol. 28: 365–375. 10.1093/molbev/msq211 [DOI] [PubMed] [Google Scholar]
- Li H., and Stephan W., 2006. Inferring the demographic history and rate of adaptive substitution in Drosophila. PLoS Genet. 2: e166 10.1371/journal.pgen.0020166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin K., Li H., Schlötterer C., and Futschik A., 2011. Distinguishing positive selection from neutral evolution: boosting the performance of summary statistics. Genetics 187: 229–244. 10.1534/genetics.110.122614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lohmueller K. E., Albrechtsen A., Li Y., Kim S. Y., Korneliussen T. et al. , 2011. Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome. PLoS Genet. 7: e1002326 10.1371/journal.pgen.1002326 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marth G. T., Czabarka E., Murvai J., and Sherry S. T., 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351–372. 10.1534/genetics.166.1.351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maruyama T., 1974. The age of a rare mutant gene in a large population. Am. J. Hum. Genet. 26: 669–673. [PMC free article] [PubMed] [Google Scholar]
- Matthey-Doret R., and Whitlock M. C., 2019. Background selection and FST: consequences for detecting local adaptation. Mol. Ecol. 28: 3902–3914. 10.1111/mec.15197 [DOI] [PubMed] [Google Scholar]
- Maynard Smith J., and Haigh J., 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23–35. 10.1017/S0016672300014634 [DOI] [PubMed] [Google Scholar]
- McGaugh S. E., Heil C. S., Manzano-Winkler B., Loewe L., Goldstein S. et al. , 2012. Recombination modulates how selection affects linked sites in Drosophila. PLoS Biol. 10: e1001422 10.1371/journal.pbio.1001422 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McVicker G., Gordon D., Davis C., and Green P., 2009. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 5: e1000471 10.1371/journal.pgen.1000471 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mughal M. R., and DeGiorgio M., 2018. Localizing and classifying adaptive targets with trend filtered regression. Mol. Biol. Evol. 36: 252–270. . 10.1101/320523 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nei M., and Li W.-H., 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76: 5269–5273. 10.1073/pnas.76.10.5269 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R., Williamson S., Kim Y., Hubisz M. J., Clark A. G. et al. , 2005. Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575. 10.1101/gr.4252305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nordborg M., Charlesworth B., and Charlesworth D., 1996. The effect of recombination on background selection. Genet. Res. 67: 159–174. 10.1017/S0016672300033619 [DOI] [PubMed] [Google Scholar]
- Pennings P. S., and Hermisson J., 2006. Soft sweeps II—molecular population genetics of adaptation from recurrent mutation or migration. Mol. Biol. Evol. 23: 1076–1084. 10.1093/molbev/msj117 [DOI] [PubMed] [Google Scholar]
- Peters A., Halligan D., Whitlock M., and Keightley P., 2003. Dominance and overdominance of mildly deleterious induced mutations for fitness traits in Caenorhabditis elegans. Genetics 165: 589–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pybus M., Luisi P., Dall’Olio G. M., Uzkudun M., Laayouni H. et al. , 2015. Hierarchical boosting: a machine-learning framework to detect and classify hard selective sweeps in human populations. Bioinformatics 31: 3946–3952. [DOI] [PubMed] [Google Scholar]
- Racimo F., and Schraiber J. G., 2014. Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms. PLoS Genet. 10: e1004697 10.1371/journal.pgen.1004697 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ronen R., Udpa N., Halperin E., and Bafna V., 2013. Learning natural selection from the site frequency spectrum. Genetics 195: 181–193. 10.1534/genetics.113.152587 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sabeti P. C., Reich D. E., Higgins J. M., Levine H. Z., Richter D. J. et al. , 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837. 10.1038/nature01140 [DOI] [PubMed] [Google Scholar]
- Schrider D. R., and Kern A. D., 2016. S/HIC: robust identification of soft and hard sweeps using machine learning. PLoS Genet. 12: e1005928 10.1371/journal.pgen.1005928 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrider D. R., and Kern A. D., 2017. Soft sweeps are the dominant mode of adaptation in the human genome. Mol. Biol. Evol. 34: 1863–1877. 10.1093/molbev/msx154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrider D. R., and Kern A. D., 2018. On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen, Zenodo; 10.5281/zenodo.1473856 [DOI] [Google Scholar]
- Schrider D. R., Houle D., Lynch M., and Hahn M. W., 2013. Rates and genomic consequences of spontaneous mutational events in Drosophila melanogaster. Genetics 194: 937–954. 10.1534/genetics.113.151670 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrider D. R., Mendes F. K., Hahn M. W., and Kern A. D., 2015. Soft shoulders ahead: spurious signatures of soft and partial selective sweeps result from linked hard sweeps. Genetics 200: 267–284. 10.1534/genetics.115.174912 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrider D. R., Shanku A. G., and Kern A. D., 2016. Effects of linked selective sweeps on demographic inference and model selection. Genetics 204: 1207–1223. 10.1534/genetics.116.190223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheehan S., and Song Y. S., 2016. Deep learning for population genetic inference. PLoS Comput. Biol. 12: e1004845 10.1371/journal.pcbi.1004845 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siepel A., Bejerano G., Pedersen J. S., Hinrichs A. S., Hou M. et al. , 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15: 1034–1050. 10.1101/gr.3715005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simonsen K. L., Churchill G. A., and Aquadro C. F., 1995. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141: 413–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith N. G., and Eyre-Walker A., 2002. Adaptive protein evolution in Drosophila. Nature 415: 1022–1024. 10.1038/4151022a [DOI] [PubMed] [Google Scholar]
- Smukowski C., and Noor M., 2011. Recombination rate variation in closely related species. Heredity 107: 496–508. 10.1038/hdy.2011.44 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephan W., 2010. Genetic hitchhiking vs. background selection: the controversy and its implications. Philos. Trans. R. Soc. Lond. B Biol. Sci. 365: 1245–1253. 10.1098/rstb.2009.0278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tachida H., 2000. DNA evolution under weak selection. Gene 261: 3–9. 10.1016/S0378-1119(00)00475-3 [DOI] [PubMed] [Google Scholar]
- Tajima F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tennessen J. A., Bigham A. W., O’Connor T. D., Fu W., Kenny E. E., et al. , 2012. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337: 64–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teshima K. M., Coop G., and Przeworski M., 2006. How reliable are empirical genomic scans for selective sweeps? Genome Res. 16: 702–712. 10.1101/gr.5105206 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton K. R., 2014. A C++ template library for efficient forward-time population genetic simulation of large populations. Genetics 198: 157–166. 10.1534/genetics.114.165019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton K. R., 2019. Polygenic adaptation to an environmental shift: temporal dynamics of variation under Gaussian stabilizing selection and additive effects on a single trait. Genetics 213: 1513–1530. 10.1534/genetics.119.302662 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres R., Szpiech Z. A., and Hernandez R. D., 2018. Human demographic history has amplified the effects of background selection across the genome. PLoS Genet. 14: e1007387 [corrigenda: PLoS Genet. 15: e1007898 (2019)]. 10.1371/journal.pgen.1007387 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres R., Stetter M. G., Hernandez R. D., and Ross-Ibarra J., 2019. The temporal dynamics of background selection in non-equilibrium populations. bioRxiv (Preprint posted October 17, 2019). 10.1101/505750 [DOI] [Google Scholar]
- Voight B. F., Kudaravalli S., Wen X., and Pritchard J. K., 2006. A map of recent positive selection in the human genome. PLoS Biol. 4: e72 (corrigenda: PLoS Biol 5: e14 (2007)]. 10.1371/journal.pbio.0040072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watterson G., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. 10.1016/0040-5809(75)90020-9 [DOI] [PubMed] [Google Scholar]
- Wilson B. A., Petrov D. A., and Messer P. W., 2014. Soft selective sweeps in complex demographic scenarios. Genetics 198: 669–684. 10.1534/genetics.114.165571 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng K., 2013. A coalescent model of background selection with recombination, demography and variation in selection coefficients. Heredity 110: 363–371. 10.1038/hdy.2012.102 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All code for generating forward and coalescent simulations, calculating and visualizing summary statistics, training, and applying diploS/HIC are available at https://github.com/SchriderLab/posSelVsBgs. In addition, all simulated data, along with statistics from each replicate in both text and graph form, are available at https://figshare.com/projects/posSelVsBgs/72209. Supplemental material available at figshare: https://doi.org/10.25386/genetics.12863981.