Abstract
Selective sweeps are frequent and varied signatures in the genomes of natural populations, and detecting them is consequently important in understanding mechanisms of adaptation by natural selection. Following a selective sweep, haplotypic diversity surrounding the site under selection decreases, and this deviation from the background pattern of variation can be applied to identify sweeps. Multiple methods exist to locate selective sweeps in the genome from haplotype data, but none leverages the power of a model-based approach to make their inference. Here, we propose a likelihood ratio test statistic T to probe whole-genome polymorphism data sets for selective sweep signatures. Our framework uses a simple but powerful model of haplotype frequency spectrum distortion to find sweeps and additionally make an inference on the number of presently sweeping haplotypes in a population. We found that the T statistic is suitable for detecting both hard and soft sweeps across a variety of demographic models, selection strengths, and ages of the beneficial allele. Accordingly, we applied the T statistic to variant calls from European and sub-Saharan African human populations, yielding primarily literature-supported candidates, including LCT, RSPH3, and ZNF211 in CEU, SYT1, RGS18, and NNT in YRI, and HLA genes in both populations. We also searched for sweep signatures in Drosophila melanogaster, finding expected candidates at Ace, Uhg1, and Pimet. Finally, we provide open-source software to compute the T statistic and the inferred number of presently sweeping haplotypes from whole-genome data.
Keywords: maximum likelihood, selective sweep, haplotype
Introduction
A selective sweep is a genomic signature resulting from positive selection in which the linked variants surrounding the site under selection rise to high frequency together in a population, thereby yielding a footprint of reduced diversity that can span across megabases (Przeworski 2002; Gillespie 2004; Kim and Nielsen 2004; Garud et al. 2015; Hermisson and Pennings 2017). Thus, a recent selective event is identifiable in polymorphism data from a region of extended haplotype homozygosity, and the signal of a selective sweep accordingly decays over time as mutation and recombination break up long haplotypes (Sabeti et al. 2002; Schweinsberg and Durrett 2005; Voight et al. 2006). Selective sweeps can arise from multiple processes, including the de novo emergence of a selectively advantageous allele, selection on standing population haplotypic variation, and recurrent mutation to a selectively advantageous allele (Hermisson and Pennings 2005; Pennings and Hermisson 2006a, 2006b). The former scenario is a hard sweep, in which a single haplotype rises to high population frequency, gradually replacing all other haplotypes as the sweep proceeds to fixation. The latter two scenarios are soft sweeps, in which multiple haplotypes simultaneously rise to high population frequency, and a greater haplotypic diversity underlies the sweep.
Identifying selective sweeps is important because sweeps serve as indicators of recent rapid adaptation in a population, providing insight into the pressures that shaped its present-day levels of genetic diversity (Vatsiou et al. 2016; Librado et al. 2017). These pressures can vary considerably in their intensity and duration, resulting in selection signals of varying magnitude ranging from prominent, such as LCT in Europeans (Bersaglieri et al. 2004), to the more subtle ASPM, implicated in the development of human brain size (Kouprina et al. 2004; Peter et al. 2012). Whereas strong sweeps are typically easy to detect, weaker sweeps typically require a large sample size for detection (Jensen et al. 2007; Pavlidis et al. 2013) and may only be identifiable through sophisticated approaches (Chen et al. 2010). Selective sweeps, although not the only signature of adaptation in natural populations, are likely to occur at loci where mutations have a large effect size, little negative pleiotropic effects, and contribute to phenotypes that are either monogenic or controlled by few genes (Pritchard and Di Rienzo 2010). In addition, identifying selective sweeps is important to make inferences about the relative contributions of hard and soft sweeps to adaptive events in study organisms (Garud et al. 2015; Schrider and Kern 2016; Harris et al. 2018), which is a topic of continued debate (Hernandez et al. 2011; Jensen 2014; Schrider and Kern 2017; Harris, Sackman, et al. 2018; Mughal and DeGiorgio 2019).
Multiple powerful methods have been proposed to characterize selective sweeps, and well established among these are composite likelihood ratio (CLR) methods (Kim and Stephan 2002; Kim and Nielsen 2004; Nielsen et al. 2005; Chen et al. 2010; Pavlidis et al. 2013; Vy and Kim 2015; DeGiorgio et al. 2016; Huber et al. 2016; Racimo 2016) and haplotype homozygosity-based methods (Voight et al. 2006; Ferrer-Admetlla et al. 2014; Garud et al. 2015; Harris et al. 2018). The former category of methods represents approaches in which the probability of neutrality in a genomic region under analysis is compared with the probability of a selective sweep in that region, based on a model of distortion in the site frequency spectrum (SFS) expected under a sweep. A CLR statistic quantifies support for the alternative hypothesis of selection, with larger values indicating greater support. Although CLR methods make simplifying assumptions in their models (Beaumont et al. 2010; Pavlidis and Alachiotis 2017), they have demonstrated a powerful capacity for identifying multiple different signatures of selection without the need for computationally intense calculations of full likelihood functions (Kim and Stephan 2002; DeGiorgio et al. 2014; Huber et al. 2016). However, because they are typically allele frequency-based approaches, the CLR methods may lack in power to detect soft sweeps in comparison to haplotype-based methods, which can generally detect both (Pennings and Hermisson 2006b; Ferrer-Admetlla et al. 2014). Accordingly, the need exists for methods that leverage the power and efficiency of CLR approaches, while providing the sensitivity of haplotype-based approaches.
We introduce an approach for identifying selective sweep signatures using a likelihood ratio framework T that is the first haplotype-based method of its kind, intended to address the limitations of previous methods. Our T statistic (see Definition of Statistic) has high power to detect recent sweeps from genome-wide polymorphism data and additionally infers the number of presently sweeping haplotypes as a model parameter, providing an additional layer of insight not shared with other CLR methods. This attribute is especially important because it eliminates the need for time- and computation-heavy alternatives, such as training a machine-learning classifier (Lin et al. 2011; Kern and Schrider 2018; Mughal and DeGiorgio 2019), or drawing inferences from a posterior distribution by approximate Bayesian computation (Garud et al. 2015; Harris et al. 2018; Harris and DeGiorgio 2020). We demonstrate with simulated data that the T statistic identifies recent hard and soft sweeps and performs especially well for population size expansion models. As such, our application of the T statistic to human and Drosophila melanogaster data sets recovered multiple previously characterized candidate sweeps in both organisms, allowing us to corroborate and enhance our understanding of adaptation in each of their histories. We implement the T statistic in an open-source software package termed LASSI (Likelihood-based Approach for Selective Sweep Inference), which can be downloaded at http://degiorgiogroup.fau.edu/LASSI.html (last accessed May 25, 2020).
Definition of Statistic
The goal of our approach is to identify genomic signatures of selective sweeps. We achieve this by assigning a T statistic to each single-nucleotide polymorphism (SNP)-delimited window of analysis in the genome. The T statistic is a measure of the likelihood that an analysis window is consistent with a selective sweep rather than neutrality. We base this inference on the sample haplotype frequency spectrum, reasoning that a spectrum with few high-frequency haplotypes indicates a sweep, and a spectrum with no moderate- or high-frequency haplotypes indicates neutrality. For this reason, genomic regions with low mutation and recombination rates can resemble selective sweep regions owing to their reduced nucleotide and haplotypic diversity (Pollinger et al. 2005; Wiehe et al. 2007; O’Reilly et al. 2008; Pavlidis and Alachiotis 2017), and so caution is warranted as with any approach. The T statistic is a likelihood ratio test in which the model of neutrality, based on the genome-wide haplotype frequency spectrum, is nested within the model of selection, based on a distortion of the genome-wide haplotype frequency spectrum toward few moderate- or high-frequency haplotypes. We illustrate examples of haplotype frequency spectra for neutrality and sweeps in figure 1 and also provide a schematic showing how key model parameters relate to distortions in the haplotype frequency spectrum.
To begin, we must first define the haplotype spectrum on which we will base our neutral expectation. That is, the spectrum that we will assign as representative for a genomic window under neutrality. For all genomic windows in the sample, we extract the haplotype frequency spectrum, arrange frequencies in descending order, and truncate the spectrum at an arbitrary value K most frequent haplotypes (compare top and middle panels of fig. 1, first column). Thus, for each window for L windows, we have a truncated spectrum , where , and normalized such that . Next, we define the vector , such that for . We now use p as our neutral expectation for likelihood computations.
From the vector p, we define the vector , which represents a hypothetical distorted frequency spectrum consistent with a model of m sweeping haplotypes in an analysis window, with . Accordingly, the choice of K, although flexible for any analysis, most directly affects the resolution with which we can classify soft sweeps. For example, identifying a sweep on nine distinct, presently sweeping haplotypes requires at minimum a K = 10 truncation, whereas a sweep on 14 haplotypes requires K = 15 and would likely appear as neutral under smaller truncations. We generate by increasing the frequency of sweeping haplotype classes at the expense of nonsweeping haplotype classes in a heuristic manner. We note therefore that our approach is purely statistical and does not feature an underlying population-genetic model but is an attempt to capture the features in the haplotype frequency spectrum consistent with those expected from a sweep. The vector is related to p by
(1) |
where fi, with and for , is a term defining the manner in which the mass associated with haplotype frequencies in the neutral frequency spectrum is distributed among to generate the sweep frequency spectrum of the alternative model; U and ε are respectively the frequencies of the most and least frequent nonsweeping haplotype classes, and .
We can define fi in multiple ways. Choosing (model A) generates an alternative model in which value is uniformly added to each of . We can also specify a distortion in which value is added proportionally to each sweeping haplotype frequency, where , such as (model B), (model C), (model D), or (model E). The latter nonuniform models may provide a more accurate representation of the haplotype frequency spectrum following a sweep, as sweeping haplotypes, in contrast to neutral haplotypes, may not exist at similar frequencies to one another (see fig. 1, right column). As such, we primarily use model D for inferences in the Results. The choices of U and ε determine the frequency of the nonsweeping haplotype classes in the alternative model. For , we make the simplifying assumption that the value of decreases linearly for , whereas constrains all to equal ε for . Regardless of the choice of U and ε, their relationship with each other and is necessarily . We also note that by definition, illustrating that the null (neutral) model is nested within the alternative (sweep distortion) model.
For each analysis window, we must finally obtain a vector of counts x, observed for the most frequent K haplotypes. We define , where elements are once again arranged in descending order, with . We normalize each xi to satisfy the constraint , where n is the number of haplotypes in the sample.
Using the model haplotype frequency spectra p and in conjunction with the observed vector of counts x for the most frequent K haplotypes in a particular genomic window, we define likelihood functions, which are based on the multinomial distribution. Our use of the multinomial distribution is reasonable as it describes the probability of observing the vector of haplotype counts (x) across K haplotype categories given the vector of respective haplotype frequencies (p or ). The likelihood of the model parameters under the null hypothesis (neutrality) given the haplotype counts in an analysis window, equivalent to the probability of obtaining the observed haplotype counts x given p and K, is
(2) |
whereas the likelihood under the alternative hypothesis (sweep distortion) is
(3) |
Therefore, the log-likelihoods are
(4) |
and
(5) |
We optimize over and , keeping U fixed, to find
Thus, our test statistic is defined as
(6) |
Each analysis window in the genome is assigned a test statistic in this manner, and larger test statistics indicate greater support for a sweep in the window (i.e., greater distortion toward few moderate- or high-frequency haplotypes). Because in the process we also infer the most likely number of presently sweeping haplotypes to yield the underlying distorted haplotype spectrum, our approach can also be used to quantify the softness of an identified sweep.
Results
We first performed experiments with simulated data in which we generated populations based on nonequilibrium human demographic models inferred with smc++ (Terhorst et al. 2017), covering a variety of neutral and selection scenarios. These demographic models consisted of a history based on that of the CEU European population, featuring a prominent bottleneck about 2,000 generations prior to sampling, and a sub-Saharan African history resembling that of the YRI population, characterized by large population size with a recent expansion; these attributes of both models are consistent with previous estimates (Gravel et al. 2011; Gronau et al. 2011). Using these simulations, we measured the power of the T statistic and contextualized our results by comparing T to other popular methods, comprising (in order of increasing sophistication) H12 (Garud et al. 2015), nSL (Ferrer-Admetlla et al. 2014), SweepFinder2 (Nielsen et al. 2005; DeGiorgio et al. 2016; Huber et al. 2016), and Trendsetter (Mughal and DeGiorgio 2019), across hard and soft sweep scenarios. We applied Trendsetter, a machine-learning method that uses information on the spatial autocorrelation of statistics, using both its standard six-statistic approach—incorporating pairwise sequence difference , squared correlation coefficient r2, number of distinct haplotypes , and the expected haplotype homozygosity statistics H1, H12, and H2/H1—and using contiguous T statistic analysis windows as input (“T-Trendsetter”).
To test the versatility of the T statistic, we probed the effects of various confounding factors on T statistic inferences. Foremost among these was admixture, which can mimic the signature of a sweep when a donor population of small effective size contributes ancestry to the sampled population (Lohmueller et al. 2009; Harris et al. 2018). We also computed the value of the T statistic in regions of low mutation and recombination rates to evaluate whether their associated reductions in haplotypic diversity could be mistaken for sweeps. Due to its pervasiveness in genomes, we then generated models of background selection to determine whether it can affect the value of the T statistic, as background selection has been implicated as a confounding factor when searching for selective sweeps (Charlesworth et al. 1993, 1995; Seger et al. 2010; Cutter and Payseur 2013; Nicolaisen and Desai 2013; Huber et al. 2016). Additionally, we evaluated the effects of data set confounding factors, exploring the impact of missing data and small sample size on power. Complementing the power analyses, we evaluated the performance of our method in terms of its ability to infer the number of sweeping haplotypes at the time of sampling (). We use as a proxy for the number of distinct presently sweeping haplotypes in the population (model parameter m), which itself is a proxy for the true number of initially sweeping haplotypes (ν), an unknown parameter. Finally, we applied our method to data from the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015) and the Drosophila Genetic Reference Panel (DGRP; Mackay et al. 2012) to measure our ability to properly identify and classify selective sweep candidates.
Detection and Characterization of Selective Sweeps
We measured the power of our likelihood ratio test statistic (T) to differentiate selective sweeps from neutrality across diverse simulated scenarios using a sliding analysis window approach. For hard sweeps, as well as soft sweeps on ν = 4 initially sweeping haplotypes, we compare the power of T with that of the four alternate methods. Larger values of the T statistic for an analysis window indicate a greater departure from the neutral haplotype frequency spectrum, and therefore a greater probability of a sweep within that genomic region. To measure power, we first simulated 1,000 neutral replicates of 1-Mb chromosomes under the CEU and YRI demographic models. From these simulations, we obtained each model’s expected truncated neutral haplotype frequency spectrum , which was the basis of our likelihood computations (see Definition of Statistic). The spectrum p for a model represents the mean across all genomic windows of all replicates, truncated at a particular value of K. Thus, K = 10 indicates the spectrum of the most frequent ten haplotypes in a genomic window, whose frequencies are labeled p1–p10. To assess power, we computed the T statistic for each 117-SNP (see Materials and Methods) genomic window of each simulated neutral replicate. We solely retained the maximum value of the T statistic across all windows for each neutral replicate, and similarly retained the maximum T statistic across each replicate of each selection scenario we tested. In our experiments, we assessed power at the 1% and 5% false positive rates (FPRs), meaning that we measured the proportion of selection replicates respectively exceeding the top 1% or 5% of T statistics within the neutral distribution.
The T statistic has high power to detect a recent hard sweep (ν = 1 sweeping haplotype) affecting the CEU-based demographic history, provided that the selection coefficient is at least s = 0.005 (fig. 2, top). At both the 1% and 5% (supplementary fig. S1, Supplementary Material online) FPRs, the T statistic reliably detects hard sweeps beginning between 1,000 and 1,500 generations before sampling, with the strongest sweeps extending the lower bound of this range to 200 generations (fig. 2, rightmost column). The power of the T statistic attenuates for more ancient sweep events because haplotype identity surrounding the selected site decays over time in the population as mutation and recombination generate new haplotypes. Additionally, power to detect the most recent and weakest sweeps is low because sufficient time has not elapsed for the selected haplotype to reach high frequency. Sweeps on s < 0.005 are specifically difficult to detect due to their smaller footprint and shorter time over which elevated haplotype homozygosity persists as selection proceeds. For these reasons, there is no point in time at which the T statistic can detect these sweeps (supplementary fig. S3, Supplementary Material online).
Across simulated CEU selection scenarios, each of the alternate methods we examined was subject to the same power limitations as the T statistic, which outperformed all except for the more sophisticated Trendsetter. The relative performance of all methods indicates that pairwise sequence identity tract length, which nSL measures, is the most volatile sweep signal, decaying more rapidly than others. nSL consistently had the lowest power of all tested methods and reached high power only for the most recent strong hard sweeps (fig. 2 and supplementary fig. S1, top-right, Supplementary Material online), quickly losing power as the sweep footprint eroded. Likewise, H12 never matched the power of the T statistic except in detecting the strongest sweeps, but in using a fixed window size retained somewhat more power than did nSL. SweepFinder2 displayed greater power than H12, with higher maxima and longer signal duration. Despite not using haplotype information, SweepFinder2 incorporates a population-genetic model of a recent hard sweep, which results in more power than methods which do not. Finally, Trendsetter had easily the greatest power to detect hard sweeps under the CEU model, losing little resolution for sweeps up to 4,000 generations old. Using evidence from multiple signals may therefore be necessary to maximize power, as the strengths of each component statistic can complement the others’ weaknesses across different parameter configurations.
The T statistic achieves greater power for hard sweeps on simulated YRI demographic models than for CEU models across all tested scenarios (fig. 2 and supplementary fig. S1, bottom, Supplementary Material online). This increased power is due to the greater effective size of African relative to European human populations, which results in greater background haplotype diversity and therefore increased prominence of selective sweeps. Furthermore, because the size of the YRI population is an order of magnitude greater than that of the CEU for hundreds of generations, the population-scaled selection coefficient for YRI remains much larger, resulting in a stronger sweep. Accordingly, power declines more slowly for older sweeps and remains for sweeps as old as 4,000 generations before sampling. All methods therefore show greater power when applied to simulated YRI data. Notably, although Trendsetter and the T statistic display excellent performance once again, SweepFinder2 demonstrates consistently superior power for older sweeps under the YRI model, and this power scarcely decays. This suggests that SweepFinder2 may be more susceptible to demographic history than other methods, losing considerable power under the CEU bottleneck (Jensen et al. 2005; Huber et al. 2016). Meanwhile, the choice of K truncation—15, 20, or 25 (supplementary fig. S5, Supplementary Material online)—yielded little change in power to detect simulated sweeps from relative to our highlighted value of K = 10 for either population model (fig. 2 and supplementary fig. S1, fourth column, Supplementary Material online). However, we note that power is slightly greater as K decreases, and so we recommend the use of smaller K truncations where possible.
For soft sweeps from selection on standing genetic variation (; supplementary fig. S4, Supplementary Material online), the power of the T statistic attenuates more rapidly than for hard sweeps, and T rarely reaches values as large, especially for weaker sweeps. Under both CEU (supplementary fig. S4, top, Supplementary Material online) and YRI (supplementary fig. S4, bottom, Supplementary Material online) demographic histories, trends in power remain consistent regardless of the number of sweeping haplotypes, with maximum power of T achieved once again for sweeps between 1,000 and 1,500 generations old (or up to 3,000 for YRI); however, power declines as the number of sweeping haplotypes increases. Assessing power at the 5% FPR indicates that we nonetheless maintain sufficient differentiation between sweeps and neutrality for up to ν = 4 distinct initially sweeping haplotypes for CEU models, or up to ν = 8 for YRI models. To better understand the relationship between power and ν, we tracked the mean number of distinct sweeping haplotypes through time for simulated soft sweeps across each selection strength range and choice of ν using an in silico barcoding approach (see Materials and Methods). We found that weak soft sweeps frequently lose most of their sweeping haplotypes by the time of sampling, undergoing a hardening (Wilson et al. 2014) during the early stages of the sweep when the beneficial allele’s frequency is low, and still subject to genetic drift (supplementary figs. S13 and S14, Supplementary Material online). After the beneficial allele becomes established in the population, the number of sweeping haplotypes remains generally stable. Because weaker sweeps require more time to establish, this provides more time for haplotypes to be lost, and for fewer sweeping haplotypes to be sampled, relative to stronger sweeps. Thus, the T statistic can have greater power for our weaker simulated soft sweeps from larger ν than for stronger soft sweeps because the former case ultimately yields a more distinct sweep signal with fewer high-frequency haplotypes, whereas the latter often results in scenarios of high haplotypic diversity that are difficult to distinguish from neutrality.
The power of each alternative method responded to soft sweep (ν = 4) scenarios in the same manner as that of T. Methods generally had poor to middling performance at the 1% FPR for the CEU history (fig. 3, top), but decent power at the 5% FPR, especially for sweep strengths between 0.005 and 0.1 (supplementary fig. S2, top, Supplementary Material online), whereas the power of all methods was improved for the YRI model (fig. 3 and supplementary fig. S2, bottom, Supplementary Material online). However, SweepFinder2 retains little power to detect soft sweeps, and lost power proportionally to the number of sweeping haplotypes at the time of sampling, as it is specifically formulated to detect hard sweeps through distortions in the SFS, and soft sweeps do not dramatically alter the SFS (Pennings and Hermisson 2006b). Expectedly, Trendsetter was still the most powerful method, with T and H12 following closely behind for recent sweeps, and nSL lagging once again. Thus, the demographic and selective histories of the sampled population play an important role in the power of the T statistic and other sweep detection methods. Our results nonetheless indicate that the T statistic is flexible as to the selection scenarios it can distinguish from neutrality and detects recent sweeps especially well for the relatively little computation time it requires.
Selective sweeps produce elevated values of the T statistic along the simulated chromosome that on average peaks in the region surrounding the site under selection (supplementary figs. S6 and S7, first and third rows, Supplementary Material online). Furthermore, T remains elevated beyond the 900-kb bounds that we examined, indicating that on average, the shape of its distribution in a genomic region, as well as its overall elevated value, may be used to distinguish selection from neutrality. A signal peak often exists even for scenarios in which we do not have high power, though its maximum associated value remains small on average. Because neutral regions are likely to feature plateaus rather than peaks in the value of the T statistic, our observations illustrate the potential importance of considering the correlation in signal between windows to identify more subtle selection signatures. This is especially important for soft sweeps, which lose prominence proportionally to the number of sweeping haplotypes but still produce a peak-like distortion of local T statistic values.
To motivate this point, we reevaluated the power of our T statistic for sweeps on ν = 1 and ν = 4 haplotypes with the penalized multinomial regression method, Trendsetter (Mughal and DeGiorgio 2019), as it directly incorporates the genomic spatial distribution of sweep (summary) statistics in its inferences (supplementary figs. S8 and S9, Supplementary Material online). Using T as the sole statistic in the statistical learning protocol across 201 11-SNP windows (Trendsetter’s standard approach; see Materials and Methods), we considerably improved our ability to detect more ancient sweep signatures than in our standard application of T, especially under the CEU model, yielding power comparable to the full six-statistic approach. Moreover, we saw little change in power to detect recent sweeps, and power was improved overall for soft sweeps. When we instead used our typical 117-SNP windows for T, and 75 windows total, the T-Trendsetter approach yielded a uniformly enhanced power profile whose trends matched our original results with the standard T statistic, but with greater overall power, especially for soft sweeps (supplementary fig. S9, Supplementary Material online). Thus, the spatial distribution of the T statistic provides an informative sweep signature that can be used to isolate sweep regions from neutrality. By learning this spatial distribution, we can enhance the power of T to detect sweeps that may be overlooked using an isolated per-window approach.
In addition to evaluating the power of the T statistic, we measured the ability of our approach to infer the number of presently sweeping haplotypes () at the site under selection. The ability to infer is a result of optimizing the likelihood function over all possible m distortion models for the chosen truncation K (see Definition of Statistic). In figure 4, we show the distribution of T statistics with their associated haplotype frequency spectra and , for each of 1,000 neutral, mixed hard sweep (, ν = 1), and mixed soft sweep (ν = 4) replicates, under both the CEU and YRI models (same data as figs. 2 and 3; t = 1,500 for CEU and t = 2,500 for YRI, representing points of maximum power). Relative to neutrality (fig. 4, left), we more often assign smaller to sweep simulations (fig. 4, center and right). This result fits with the expectation that under a sweep, the first few haplotype classes exist at elevated frequency relative to the remaining classes, and this also translates to larger values of T for those replicates. Accordingly, sweeps that have weaker signatures due to their age or selection coefficient are not only difficult to distinguish from neutrality but also difficult to accurately classify with , yielding patterns that fit within the neutral distribution. We found that trends were highly congruent between the CEU and YRI sweep models, but the large neutral background diversity for YRI made it less likely that we would infer a small in the absence of a sweep.
To further understand the sweep classification properties of the T statistic, we generated box plots summarizing the distribution of across the more prominent strong sweep scenarios () we previously analyzed (figs. 2 and 3 and supplementary fig. S4, Supplementary Material online). In this way, we were able to better understand our ability to correctly classify sweeps as hard or soft, especially because our trajectory results (supplementary figs. S13 and S14, Supplementary Material online) provided us with an expectation of the number of remaining sweeping haplotypes at the time of sampling for prominent sweep scenarios. We found that sweeps initiated from larger ν were more likely to be classified as soft using our K = 10 truncated spectra, but frequently we found that the median inferred for prominent soft sweeps was one, consistent with a hard sweep (supplementary figs. S10 and S11, Supplementary Material online). Regardless of ν, the spatial signature of along the chromosome forms a valley surrounding the site of selection that mirrors the signal peak when a sweep is detectable (supplementary figs. S6 and S7, Supplementary Material online). These results suggest that our present approach may therefore be more accurate as a binary classifier (hard versus soft), though we still assign soft sweeps on a continuum.
Because phasing haplotypes may not be possible in all cases, such as in the study of nonmodel organisms, we sought to expand our application of the T statistic to unphased multilocus genotype (MLG) data. To evaluate power for MLGs, we used the previous simulated human demographic model replicates of prior experiments (represented in figs. 2 and 3), merging each individual’s two haplotypes. Whereas haplotypes are character strings indicating the state of a biallelic SNP as either reference or alternate along a region of one copy of an individual’s genome, MLGs have three possible states for each biallelic SNP—homozygous reference, homozygous alternate, or heterozygous—and half the sample size of phased haplotypes. We found that, as with the transition between phased and unphased data for haplotype homozygosity statistics (Harris et al. 2018; Harris and DeGiorgio 2020), trends in power for the unphased application of the T statistic were wholly consistent with those of the phased application, for both hard and soft sweeps (supplementary fig. S15, Supplementary Material online). The smaller size of the MLG samples resulted in slight decreases in power for each sweep scenario, as well as smaller values of the T statistic relative to the phased application, but our results indicate that selective sweeps may be reliably identified nonetheless without the need to phase haplotypes. Likewise, we found that the T statistic applied to MLGs could generate inferences of sweep softness from that matched those of haplotype data, further underscoring the parallel performance of our approach on unphased data (supplementary fig. S16, Supplementary Material online).
Effects of Confounding Factors of the T Statistic
There are multiple genetic and nongenetic factors that may affect the ability of sweep statistics to properly localize a selection signature. Though these factors are varied in origin, they may each reduce genetic diversity locally and spuriously generate patterns similar to those of selective sweeps. Accordingly, we examined the effects of introducing these confounding factors to our simulated data, allowing us to understand the scenarios for which T is robust and susceptible to misclassifying sweeps. Among genomic factors, we observed the impact of admixture into the sampled population, reductions in mutation and recombination rates, and background selection. Common nongenomic factors that can change the interpretation of sweep statistics are missing data, small sample sizes, and reliance on a misspecified demographic model.
We begin with admixture, which can be pervasive in natural populations (Hudjashov et al. 2017; Kopatz et al. 2017; Browning et al. 2018; Barría et al. 2019). Previous work has shown that under certain scenarios, admixture from an unsampled donor population can lead to reductions in haplotypic diversity across the genome of a study population (Harris et al. 2018). Specifically, admixture from a diverged donor population of small effective size into the sampled study population may introduce large tracts of homozygous sequences that methods may interpret as a selection signal. To assess the extent to which the T statistic—which makes use of the study genome’s background haplotypic patterns—is affected by admixture, we simulated neutral replicates under the CEU and YRI models receiving pulse admixture from a highly diverged donor ( generations prior to sampling). Our tested admixture proportions were at generations prior to sampling, and we examined donor population effective sizes of diploids (roughly 1/10, equal to, and 10-fold the effective size of the sampled populations).
We found that admixture had a considerable effect on the haplotype frequency spectrum of the target population and assessed this effect in two ways. First, we computed the T statistic for each admixture scenario, but using an unadmixed background p spectrum (supplementary figs. S17 and S19, Supplementary Material online). Our results demonstrate that admixture from a donor with small effective size yields the expected haplotypic diversity reduction in the sampled population, producing inflated T statistics. In contrast, gene flow from medium- and large-sized donors rarely produced large values of T, except for large-donor admixture at . We attempted to address the confounding effects of admixture by computing T from an appropriately matched admixed background spectrum for each tested value of α (supplementary figs. S18 and S20, Supplementary Material online). Using a matched resulted in T statistic distributions under admixture that more closely resembled unadmixed distributions. Regardless of scenario, the T distribution deriving from a spectrum informed by admixture resulted in median values of T for admixed scenarios closer to the median for unadmixed replicates, especially for small-donor admixture. Although this was uniformly beneficial for the CEU model, overcorrected for large-donor scenarios under the YRI model. The susceptibility of our T statistic to confounding admixture is a consequence of using a model that does not account for mixed ancestry in the target population. Instead, the reliance of our approach on an average neutral background spectrum means that we are not capturing the higher moment effects of admixture on the admixed T statistic distribution, such as its variance. Even so, we expect that in human populations, admixture is unlikely to be as extreme as in our simulated examples and is likely to feature populations that are less diverged from one another, and of closer effective size to one another, reducing its overall detrimental impact when searching for sweeps.
Because natural genomes may feature wide variation in recombination and mutation rates, we sought to determine the effect of such variation on the value of T in the absence of selection in order to quantify its potential misleading effect on T. In comparison to standard simulations featuring and mean drawn from an exponential distribution, we generated simulated replicates with and mean , reduced by one order of magnitude (see Materials and Methods). Reducing the mean recombination rate had the anticipated effect of slightly inflating the distribution of T relative to normal values of r, a result of the reduction in genetic diversity in regions where new haplotypes rarely arise (supplementary fig. S21, Supplementary Material online). In contrast, reducing the mutation rate by an order of magnitude resulted in a deflation of T statistic values relative to the original rate. This is because our SNP-delimited windows become physically wider when SNP density is reduced, leading to the incorporation of SNPs in lower mean linkage disequilibrium (LD), and therefore more haplotypic diversity. Reducing both μ and mean r led to an intermediately deflated T statistic distribution. This result may suggest that mutation rate variation is more important than recombination rate variation in determining the T statistic value under SNP-delimited windows (supplementary fig. S21, Supplementary Material online). However, because we already draw recombination rates from a distribution in our default protocol, it is possible that we observed a greater effect by changing μ because our reduced-μ scenarios represent a greater departure from standard simulations. Thus, we caution that recombination rate variation should not be ignored as a source of false signals in analyses with any sweep statistic.
We next examined background selection scenarios for both haplotype and MLG data to determine whether the loss of genetic diversity associated with linked purifying selection could spuriously yield elevated values of the T statistic. Simulating 1-Mb chromosomes as previously under both human demographic models, we found that background selection had no discernible effect on the distribution of T relative to neutrality, even as we reduced mean r by two orders of magnitude to across the central gene. We determined this by observing the receiver operating characteristic curves comparing neutral scenarios to those in which a central gene experiences strong () background selection for the duration of the simulation (see Materials and Methods). For both the CEU (supplementary fig. S22, top, Supplementary Material online) and YRI (supplementary fig. S22, bottom, Supplementary Material online) populations, across central genes of size 11 kb (supplementary fig. S22, left, Supplementary Material online), 55 kb (supplementary fig. S22, center, Supplementary Material online), and 110 kb (supplementary fig. S22, right, Supplementary Material online), we see that all curves fit tightly along the diagonal, indicating no difference between compared replicate sets. Therefore, we expect that the presence of background selection, for which we do not explicitly account in our model, should not affect inferences with the T statistic.
Among the most common nongenomic confounders that may be encountered in analyses of natural populations is the presence of missing data. That is, sites for which the allelic state is indeterminate. We evaluated the performance of the T statistic for sequences with missing data at polymorphic sites by randomly removing alleles from our existing neutral replicates (see Materials and Methods). To address missingness in our data, we modified our scan using two corrective approaches. First, we removed polymorphic sites with >5% missing data, and second, we incorporated all remaining missing alleles as a new character “N,” thereby conservatively diversifying the remaining set of haplotypes relative to no missing data. Following this approach, we compared the distributions of T with and without missing data. We found that introducing missing data had little effect on the distribution of T statistic values under neutrality (supplementary fig. S23, Supplementary Material online). We expect that as long as sufficient polymorphism remains in the sample, that missing data are unlikely to yield false sweep inferences, and in extreme cases, are likely to resemble the effect of decreased mutation rate due to the depletion of allelic information.
A limitation of haplotype-based approaches is that their power derives from sample size (n). That is, sufficient diversity must be captured in a sample in order to distinguish between the unique signals of neutrality and selection. As sample size decreases, subtle signatures of selection recede into the genomic background and become imperceptible. To determine the minimum sample size to which the T statistic should be confidently applied, we resampled our existing hard sweep replicates (see fig. 2) for reduced sample sizes haplotypes, corresponding to 1/10, 1/4, and 1/2 the original sample size of n = 200. Reducing sample sizes led to a reduction in the minimum detectable range of selection coefficients s, and in the range of sweep ages t over which the T statistic had power (supplementary fig. S24, Supplementary Material online). Although power scarcely changed for larger samples of size relative to n = 200 (as in our application to MLG data), we are unable to reliably detect even strong sweeps older than t = 500 generations for the CEU model when only haplotypes are sampled (supplementary fig. S24, top-right, Supplementary Material online). Analysis of the YRI model was more accommodating to sample size reduction owing to the preexisting greater ease of detecting sweeps for populations with large effective sizes, but power quickly drops when small selection coefficients (s < 0.005) are included (supplementary fig. S24, Supplementary Material online).
A practical consideration when applying sweep statistics is inferring an accurate demographic model. A proper model can be used to generate simulated replicates from which P value cutoffs and false discovery rate (FDR) thresholds may be assigned. We subsequently demonstrate this in our own Application to Empirical Data Sets. To motivate the selection of an appropriate model, we show the effect on the neutral T statistic distribution of using nonideal demographic models. First, we generated T statistic distributions for CEU and YRI under constant-size models. Here, the constant sizes were equal to the effective size of the populations under each model (see Materials and Methods). Because these models included no population size fluctuations, they provided uniformly deflated T statistic distributions relative to smc++ models, with the constant-size CEU model especially underestimating values relative to its more accurate counterpart (supplementary fig. S25, Supplementary Material online). We also examined T statistic distributions resulting from the popular Gravel et al. (2011) model, which is based on the SFS. Relative to the smc++ model, the T statistic distribution for the Gravel et al. (2011) CEU model was comparable, but the YRI model, consisting of only two phases (constant size followed by 2-fold expansion), resulted in much smaller values of T (supplementary fig. S26, Supplementary Material online). Choosing a demographic model that captures both recent and ancient history is therefore important (Beichman et al. 2017), and model choice should be approached with caution.
Application to Empirical Data Sets
We searched for candidate selective sweeps in human and D. melanogaster data sets using the T statistic, choosing these data sets because of their high quality, size, and availability of phased haplotypes. Specifically, the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015) data set contains no missing data, as all allelic states have been imputed. Meanwhile, the DGRP data set (Mackay et al. 2012) provides a classic invertebrate model whose properties deviate considerably in history and genomic architecture from the mammalian model of humans. For each protein- and RNA-coding gene in each study population, we obtained values of T using inferences from a K = 10 truncation and assigned a P value to each of the top 40 candidate genes based on the window of maximum T overlapping that gene (supplementary tables S1–S3, Supplementary Material online). For windows to be associated with a gene, their central SNP must lie between the transcription start and stop sites of the gene. Additionally, we assigned an value to each gene using both K = 10 and K = 20 truncations. Analysis windows for scans of human data were of size 117 SNPs, advancing by 12-SNP increments, whereas windows for D. melanogaster analysis were 400 SNPs in size (as in the analyses of Garud et al. [2015] and Harris, Sackman, et al. [2018]) with a step size of 40 SNPs. To eliminate the effect of background LD on inferences, window sizes were based on the minimum physical interval across which LD decayed beyond one-third of its value between SNPs separated by 1 kb (see Materials and Methods). Following our successful application of T to simulated unphased MLG data, we analyzed the human 1000 Genomes Project data as MLGs by manually merging individuals’ two haplotypes. This was unnecessary for the DGRP data, as individuals are inbred. For haplotype data, we also determined FDR thresholds for each population based on simulated replicates, inferring T statistic cutoffs at a 5% FDR for all study populations (supplementary table S4, Supplementary Material online).
For human data, we examined the CEU and YRI populations (supplementary tables S1 and S2, Supplementary Material online), matching the demographic models used in our simulations. Though few of our top 40 sweep candidates produced a significant P value, all easily exceeded their population’s 5% FDR (supplementary table S4, Supplementary Material online). Among these candidates, hard sweeps predominated within either population, and we found that this depended somewhat on our K truncation. For K = 20, hard sweeps comprised all but two top candidates among the CEU, and 67.5% of top candidates among the YRI, whereas for K = 10, we did not classify any top CEU or YRI candidates as soft from phased haplotypes. Additionally, K = 20 candidate soft sweeps, except for BTNL2 in YRI (), featured only three or fewer sweeping haplotypes. These results indicate that the T statistic is more sensitive to harder sweeps than to softer ones, which is a consequence of the greater distortion in the haplotype frequency spectrum of hard sweeps relative to soft sweeps. This finding matches our simulated results, in which the value of T was proportional to the number of sweeping haplotypes in the population. Moreover, we find that our choice of K truncation impacts our ability to classify sweeps as soft, with a greater distortion of haplotype classes two through m required for a sweep in a K = 10 truncated spectrum to be classified as soft relative to K = 20. Regardless of truncation, the increased presence of candidate soft sweeps in YRI relative to CEU mirrors our observation from simulated data that the T statistic has greater power to detect softer sweeps for populations that have not experienced a bottleneck in their history. Furthermore, these patterns corroborate results from the H12 analysis of this data set (Harris et al. 2018), which found more hard sweeps than soft in the CEU population, and among top candidates generally.
Across both the CEU and YRI populations, we were able to recover most of the top 40 candidates from the haplotype data within the MLG data, indicating the reliability of using MLGs for inference with the T statistic in natural populations when phased data are unavailable. The MLG results primarily deviated from the haplotype results when classifying candidates as hard or soft. Multiple candidates that were inferred to be hard sweeps from the haplotype data were classified as soft from their MLG spectra, particularly for the K = 20 truncation. These candidates include XIRP2 and BCAS3 in the CEU population, as well as ITGAE, SUGCT, NNT, and HLA-DPB2 in the YRI population. We examine the latter candidate more closely in figure 5 and supplementary figure S30, Supplementary Material online. These differing inferences may arise from the slightly different interpretation of between phased and unphased data. For phased data, refers to the number of sweeping haplotypes, whereas for unphased, it measures the number of MLGs involved in the sweep, which may be different for the same genomic window between the different data types if MLG frequencies are at or near their expected proportions under Hardy–Weinberg equilibrium. We also note that multiple top candidates in the MLG data inferred as soft are simply not present among top haplotype candidates, indicated by the absence of a turquoise-colored background in supplementary tables S1 and S2, Supplementary Material online. We consider the application of the T statistic to MLGs further in the Discussion.
Among top sweep candidates in human data were expected results, including a hard sweep () at the cluster of genes on CEU chromosome 2 comprising LCT, MCM6, DARS, and ZRANB3 (minimum P value ), related to a well-documented adaptation to milk-based diets in European populations (Bersaglieri et al. 2004). Additionally, we found two noteworthy top candidates in CEU that have not been explicitly described as sweeps previously, RSPH3 and ZNF211 (both ). RSPH3 encodes a radial spoke protein that is integral in the structure of 9 + 2 motile cilia across diverse cell types, including flagellated cells (Teves et al. 2016), and so we speculate that selection here could be related to ancient sperm competition in humans (Leivers et al. 2014). ZNF211 is among a diverse set of zinc-finger genes whose products are believed to participate in the inactivation of endogenous retroviruses, parasitic mobile DNA whose effects can be deleterious to their hosts (Lukic et al. 2014). We recovered SYT1 ( for K = 20, for K = 10; ), NNT (), HEMGN (), and RGS18 ( for K = 20, for K = 10) in YRI, which have all received attention as potential adaptive targets (Voight et al. 2006; Pickrell et al. 2009; Fagny et al. 2014; Harris et al. 2018). Our significant candidates, SPRED3 and ITGAE have also been previously identified (Granka et al. 2012; Ayub et al. 2013; Grossman et al. 2013), though the effect of selection at these genes has not yet been elucidated. Both populations yielded HLA genes as top sweep candidates, overlapping at HLA-DRB5 (), whereas HLA-DPB1 () was exclusive to CEU and HLA-DPB2 () was exclusive to YRI. This shared signal supports the recent evidence (Albrechtsen et al. 2010; Goeury et al. 2018) that sweeps at HLA loci, including those which we describe here, were important in the development of modern genetic diversity in human immune-related genes.
In figure 5 and supplementary figure S30, Supplementary Material online, we take a closer look at top candidate sweeps uncovered in our scan of the 1000 Genomes data set (1000 Genomes Project Consortium et al. 2015) for both K = 10 and K = 20 truncations, across both haplotypes and MLGs. Each top candidate fell within a well-defined T statistic peak region surrounded by regions of low signal, and T statistic spatial signatures were consistent between all data types. First, we found SYT1 as a near-significant () top sweep candidate in the YRI population, featuring both sweeping haplotypes and MLGs involved in the sweep at the window of maximum signal for K = 20 (fig. 5 and supplementary fig. S30, first row of first column, Supplementary Material online). For the K = 10 truncation, the window of maximum signal contains only a single sweeping haplotype and is located within an adjacent, upstream subpeak. SYT1 is the cell surface receptor through which the type B botulinum neurotoxin of Clostridium botulinum bacteria enters human neurons (Connan et al. 2017), and so a sweep here may be involved in resistance to this infection (Harris et al. 2018). Next, we identified HLA-DPB2 as an outlying hard sweep in YRI () based on haplotypes and K = 10 data, but featuring three elevated MLGs within the window of maximum signal for K = 20 (fig. 5 and supplementary fig. S30, second row, Supplementary Material online). Looking at the K = 20 haplotype frequency spectrum, it is clear that one haplotype predominates, and equivalently, only one MLG predominates, but individuals heterozygous for the first haplotype and either the second or third comprise just under 20% of the population, leading to an inference of . COL5A2 was the most outlying soft sweep candidate we identified in CEU using the K = 20 truncation, harboring inferred sweeping haplotypes, but with a 7-fold disparity between their frequencies, which could occur due to a recombination event during the sweep, or a recurrent selected mutation (Hermisson and Pennings 2017). This gene has received little attention but is located within a significantly overrepresented run of homozygosity associated with schizophrenia (Lencz et al. 2007). Additionally, selection on collagen-related genes has been implicated in cold adaptations in European populations (Yudin et al. 2017). Finally, we propose the spermatogenesis-associated protein SPATA6L as a hard sweep candidate in CEU. Our finding here of an isolated T peak fits with existing evidence of selection at other spermatogenesis proteins (Schrider and Kern 2017), and with the result that European and sub-Saharan African populations are diverged at this locus, with selection in the hunter-gatherer Batwa population inferred here (Bergey et al. 2018).
We contextualize our results for outlying sweep candidates by illustrating the background haplotype frequency spectrum patterns we observed in regions of low T. In supplementary figure S31, Supplementary Material online, we highlight four regions each in the CEU and YRI populations with , 10, 20, or 30. In accordance with the expectation that classic selective sweep patterns are rare in the human genome (Hernandez et al. 2011), we observe that the majority of analysis windows had a small associated T, and accordingly resembled our example windows. We see from these examples that small peaks in the T statistic are common, and associated with haplotype frequency spectra that are distinct from those under selection, containing no high-frequency classes and an abundance of low-frequency classes of similar size. As we increase from to , we see the spectra begin to distort and contain higher frequency classes, but this distortion is far from what we expect under a sweep.
Our scan of the North American DGRP population of D. melanogaster also identified expected sweep candidates among the top genic T statistic peaks. We note that although we were unable to establish statistical significance against a neutral model based on the DGRP demographic history of Duchen et al. (2013) (see Materials and Methods), and only our top candidate, CG11902, exceeded the 5% FDR threshold (supplementary table S4, Supplementary Material online), our top candidates have literature support as potential adaptive targets. Foremost among functionally characterized candidates was Ace, which encodes the acetylcholinesterase enzyme and has long been implicated in the development of resistance to organophosphate and carbamate insecticides within D. melanogaster (Menozzi et al. 2004; Karasov et al. 2010; Garud et al. 2015). Contrary to previous studies alleging a soft sweep at Ace (Karasov et al. 2010; Garud et al. 2015), we found the greatest support for a model of only one sweeping haplotype. We identified another candidate hard sweep of similar magnitude at Uhg1, which also contributes to insecticide resistance, but to the organochlorine dichlorodiphenyltrichloroethane (Pedra et al. 2004). The methyltransferase-encoding gene Pimet emerged as the most prominent candidate soft sweep () in our search using K = 20 ( for K = 10) and is central to the viral RNA degradation pathway that is subject to ongoing coevolution against pathogen incursion and deleterious transposable element activity (Kolaczkowski et al. 2011; Lee and Langley 2012). We finally highlight ana3 as a candidate for adaptation in D. melanogaster. This prospective hard sweep affects a highly conserved gene encoding a centriole protein fundamental to the structural integrity of basal bodies within cells (Stevens et al. 2009). A sweep here may contribute to enhanced success in sperm competition and fits with the expectation that sperm gene evolution is an ongoing and central part of positive selection in D. melanogaster (Nurminsky et al. 1998; Dorus et al. 2008; Wong et al. 2008; Yeh et al. 2012).
Discussion
We have proposed a likelihood-based approach to detect selective sweeps in whole-genome polymorphism data that is applicable to a variety of different demographic scenarios, classifies detected sweeps as hard or soft without relying on additional analyses or statistics, and is the first likelihood-based method to leverage distortions in the haplotype frequency spectrum to make these inferences. Each of these attributes is important because selective sweeps are multifaceted genomic signatures that are not always characterized by the presence of a single high-frequency haplotype (Jones et al. 2013; Wilson et al. 2017), may be ongoing or incomplete at the time of sampling (Vy and Kim 2015; Vy et al. 2017), and may range in strength across multiple orders of magnitude (Messer and Neher 2012; Nam et al. 2017). Thus, our simulation experiments probed a realistically diverse complement of sweep scenarios likely to be relevant in a variety of study systems. Most importantly, the T statistic demonstrated high and consistent power and classification ability across examined parameters, highlighting its suitability to make inferences within variable contexts.
Expectedly, the T statistic achieved its maximum power for recent selective sweeps on fewer haplotypes and lost power proportional to the extent of departure from these ideal conditions (figs. 2 and 3 and supplementary fig. S4, Supplementary Material online). Because it is haplotype based, the T statistic captures distortions in the haplotype frequency spectrum relative to neutral expectations. These distortions require time to establish and decay over time as well. Thus, we found that for human demographic models, the T statistic could reliably identify sweeps that initiated between 500 and 2,000 generations before sampling. For stronger sweeps (), power was consistently elevated across this range, but because weaker sweeps require more time to establish, this range narrows and power peaks for older sweeps as s decreases. Additionally, we uniformly had more power to detect sweeps under the YRI demographic model than the CEU. This is due to the severe bottleneck underlying the history of the CEU, as well as all non-African human populations. Bottlenecks may reduce the diversity of haplotypes within a population, reducing the distinctiveness of sweeps relative to neutrality, whereas population expansions have the opposite effect (Jensen et al. 2005; Campbell and Tishkoff 2008). Nonetheless, the T statistic could generally detect sweep strengths across all but our weakest selection coefficient range for sweeps aged between 1,000 and 2,000 generations under either demographic model, comprising events that in humans cover the period from 25,000 to 58,000 years ago, between the out-of-Africa event and the spread of agriculture (Lukić and Hey 2012; Nakagome et al. 2016; Haber et al. 2019).
Importantly, ours is a powerful single-statistic approach that provides an ideal balance of detection capability and computational efficiency. Compared with popular recent methods, we found that the T statistic is generally more powerful than other single-statistic approaches and is also sensitive to the same range of sweep times as they are (figs. 2 and 3 and supplementary figs. S1 and S2, Supplementary Material online). This highlights the usefulness of T as a substitute for other single-statistic approaches, which may miss sweeps that the T statistic can detect. Although T is likely to be underpowered relative to machine-learning methods, such as Trendsetter, analyses with T have no need to train a classifier, which may be computationally intensive when training a composite of multiple signals, and must be undertaken for each study scenario. However, using the T statistic within a machine-learning framework can greatly enhance its performance. By learning the spatial distribution of T statistic signals within our T-Trendsetter construct, we were able to enhance the performance of our method considerably. Using a large number of small windows, we could extend our range of sweep detection to identify simulated selective events up to 4,000 generations in age, reflecting ancient sweeps far older than the out-of-Africa expansion. In contrast, training a classifier from a smaller amount of standard 117-SNP windows greatly improved performance for recent soft sweeps relative to the unassisted T statistic (supplementary figs. S8 and S9, Supplementary Material online). Our results suggest that optimizing the power of a machine-learning approach relies not only on the choice of input statistics but also on the manner in which those statistics are applied to make inferences.
The choice of simulated human demographic history did not impact our inferences on the number of currently sweeping haplotypes () in a population. Under equivalent sweep scenarios (s, t, and ν), our analyses yielded similar distributions of for both the CEU and YRI models and could accurately identify soft sweeps, provided that at least two sweeping haplotypes remained in the population at the time of sampling (fig. 4 and supplementary figs. S10 and S11, Supplementary Material online). We found that simulated soft sweeps were frequently assigned an inferred that was smaller than the ν with which we initiated the sweep. Furthermore, we observed that T statistic signal peaks were on average associated with valleys in regardless of ν (supplementary figs. S6 and S7, Supplementary Material online). To better understand these results, we devised an in silico barcoding approach to complement our existing simulations. We found that soft sweeps from selection on standing variation frequently lose the majority of their selected haplotypes within generations of selection start time t, meaning that many soft sweeps, especially for smaller s, appear hard at the time of sampling (supplementary figs. S13 and S14, Supplementary Material online). This hardening effect (Wilson et al. 2014), due to genetic drift at the early stage of selection, affected both simulated CEU and YRI populations equally, corroborating our consistently similar observations between the two populations. This means that even if soft sweeps are the dominant mode of adaptation in human history (Schrider and Kern 2017), there may be considerably more that can never be identified as soft.
As an attempt to improve the performance of the T statistic, we sought to examine whether the choice of sweep distortion model, based on the choice of fi (see Definition of Statistic), would affect our inferences. Ultimately, using our YRI simulations as a basis, we found that all of our tested models yielded little difference in the power of T to identify sweeps (supplementary fig. S28, Supplementary Material online). The five models we examined, consisting of (A) , (B) , (C) , (D, our primary model for analyses) , and (E) , differ in the amount of weight allocated to the secondary sweeping haplotypes for relative to when distorting p. In model A, each sweeping haplotype gains the same amount of weight after distortion, ensuring that each is prominent within spectrum . Models B–E represent increasingly uneven weight distributions that favor frequency at the expense of . We believe this is reasonable based on the observation in simulated data that soft sweeps do not affect each sweeping haplotype evenly, and one sweeping haplotype may still be considerably more prominent than the rest (fig. 1; see also, fig. 3 of Garud et al. [2015]). Furthermore, the different T statistic variants demonstrated little difference in sweep classification with (supplementary fig. S29, Supplementary Material online), suggesting that the most important consideration in constructing our sweep models lies in simply distinguishing sweeping from nonsweeping haplotype classes, and not the manner in which they sweep.
An important feature of the T statistic is its ability to detect sweeps from unphased MLG data. Our ability to extend the power of our approach to MLGs is meaningful because it provides the ability to interrogate polymorphism data from nonmodel organisms for which phased haplotypes are unavailable, difficult to obtain, or unreliable (Browning SR and Browning BL 2011; O’Connell et al. 2014; Castel et al. 2016; Laver et al. 2016; Zhang et al. 2017; Harris et al. 2018). Overall, we found no difference in power trends between the two data types, such that scenarios under which we have high power with phased data are scenarios of high power with unphased data (compare figs. 2 and 3 with supplementary fig. S15, Supplementary Material online). However, we find that the T statistic applied to haplotypes always matched or exceeded power for MLG data. This is to be expected because MLGs are a more diverse data type drawn from a smaller sample size. Under a random mating assumption, the presence of a single high-frequency haplotype implies that only one MLG should exist at high frequency, but in the case of two high-frequency haplotypes, both homozygotes, as well as their heterozygote, will be prominent in the population. In this way, a sweep on two haplotypes can appear as a sweep on three MLGs, and sweeps on larger numbers of haplotypes will result in even larger numbers of elevated MLGs, which may be more difficult to separate from neutrality proportional to their . Likewise, one high-frequency haplotype and one medium frequency haplotype can yield two high-frequency MLGs, meaning that an inferred in MLG data could underlie a true hard sweep. Our results indicate that this may be a common occurrence (compare fig. 4 and supplementary fig. S16, Supplementary Material online), and so we recommend scrutinizing results obtained from MLG data more carefully.
Our extensive testing revealed that T is overall robust to the most common confounding scenarios that affect sweep statistics. Making use of the sample average background haplotype frequency spectrum for inference allows T to account for the effects of mutation rate, recombination rate, and sample size on inferences by creating an expectation specific to the study data (supplementary figs. S21 and S24, Supplementary Material online). Using haplotype information provides complete resistance to the effect of background selection on nucleotide diversity (supplementary fig. S22, Supplementary Material online), as background selection does not cause haplotypes to rise to high frequency (Enard et al. 2014). The choice of a SNP-delimited window, meanwhile, is ideal for analyzing data sets with missing sites (supplementary fig. S23, Supplementary Material online) or low polymorphism density because by fixing the number of SNPs included in a window, we avoid generating windows that have low diversity simply because they contain few SNPs. SNP-delimited windows may also be more robust to the effect of population bottlenecks on inferences (Harris et al. 2018). Despite these strengths, we found that T may be misled by certain admixture scenarios (supplementary figs. S17–S20, Supplementary Material online). Although the admixture we simulated was extreme, our results are still informative as to the limits of our model. We found that, as expected, admixture from a donor of small size could inflate the neutral T statistic distribution because small-sized donors have a smaller haplotypic diversity, but we also found that low-level admixture from a large-sized donor also had this effect. We attribute this to the lack of admixture parameter in our model and expect that models directly incorporating admixture could overcome the confounding effects that we have observed.
The results from our empirical analyses with T served as a validation of our method, yielding expected sweep candidate genes in agreement with previous investigations (supplementary tables S1–S3, Supplementary Material online). More specifically, our top candidates matched extensively with those inferred with H12 and G123 (Garud et al. 2015; Harris et al. 2018). Our top candidates in the European-descent CEU population were centered on the LCT locus, associated with adaptation to dairy consumption (Bersaglieri et al. 2004). In the sub-Saharan African YRI population, we saw commonly recurring candidates at SYT1, NNT, LONP2, and HEMGN (Voight et al. 2006; Pickrell et al. 2009; Fagny et al. 2014; Pierron et al. 2014). The largest discrepancy between the H12 and T statistic analyses of the 1000 Genomes Project data we observed was the absence of SLC12A1 from the top 40 candidates in CEU haplotype data. SLC12A1 is a proxy for SLC24A5 (which was filtered out), a solute transporter has been implicated in the shift toward lighter skin pigmentation among Indo-Europeans (Lamason et al. 2005), and still yields an outlying value of T = 163.35, but there are dozens of genes with larger T, whereas only ten genes produced a larger H12 (Harris et al. 2018). Even so, SLC12A1 easily passes our 5% FDR threshold, and even our more stringent 1% threshold (supplementary table S4, Supplementary Material online; all top YRI candidates also pass a 1% threshold). Though our simulation experiments show that T has greater power than H12 for the same data, each method prioritizes sweep signatures differently. H12 is most sensitive to the sum of the two most frequent haplotypes in the frequency spectrum, whereas T places high emphasis on the relative values of all haplotypes in a truncated spectrum. Similarly, SweepFinder2 is unlikely to find any soft sweeps but will find older hard sweeps than what H12 and T could find, particularly for the YRI population (fig. 2).
All of our top candidates for the DGRP scan overlapped with one of the top 10 H12 peaks that Garud et al. (2015) identified, except for Uhg1, CG8552, Skeletor/CG14681 (which were in lower-ranked peaks), and corn. However, none of our top candidates was statistically significant when compared against the neutral distribution we generated under the Duchen et al. (2013) model of North American D. melanogaster population history, and only our top candidate, CG11902 (located within the second-largest signal peak of Garud et al. [2015]), passed the 5% FDR threshold. This result matches that of Harris, Sackman, et al. (2018), who also found that they could not reject neutrality when using a model that incorporates uncertainty in key parameters (supplementary fig. S27, Supplementary Material online). In contrast, the original interpretation of Garud et al. (2015) found the top signal peaks to be significant but used fixed model parameters rather than drawing them from a posterior distribution. Even so, the top D. melanogaster candidates uncovered in recent analyses show literature support for selection, especially in response to pressure from pesticide application (Menozzi et al. 2004; Pedra et al. 2004; Karasov et al. 2010). Furthermore, future analyses with more certain demographic parameter estimates may aid in ultimately rejecting neutrality for these genes.
Finally, empirical analysis allowed us to understand the practical effect of using different K truncations to generate inferences. The most apparent difference between scans with K = 10 and K = 20 truncations was each configuration’s inference on . Using a larger K truncation had the effect of classifying a greater number of candidate sweeps as soft (), and we observed this for both phased and unphased data, and all study populations. Without exception, for K = 10 truncations was less than or identical to for K = 20. We find that when changes, it is for borderline cases, such as that of COL5A2 in CEU. For this candidate, one high-frequency haplotype predominates, but there is clearly a second haplotype at an elevated frequency as well. In the K = 20 spectrum, the contrast in size between the second haplotype frequency and the others is sufficient to assign an inference of , whereas this is not the case for the K = 10 truncation, where haplotype frequencies 3–10 are relatively large enough to the second that a hard sweep serves as a better explanation of the data (fig. 5). The MLG spectrum underlying COL5A2, in contrast, reflects regardless of truncation, and that is because we no longer have a borderline case, and two high-frequency MLGs are evident. Ultimately, T is more sensitive to hard sweeps, and any scan with T is likely to yield a greater number of hard sweeps, especially when choosing a smaller K. This is not to say that soft sweeps are uncommon or rarer than hard sweeps (Hernandez et al. 2011; Jensen 2014; Schrider and Kern 2017), but that we may simply be overlooking these more often due to the nature of our approach.
We believe that our T statistic will serve as an important contribution to the field of selective sweep detection methods, providing the first maximum-likelihood approach that exploits a haplotype and MLG frequency spectrum distortion model. As such, the T statistic offers high power for recent selective sweeps with little computation time and can additionally assign an value to candidates with no additional analysis required. This makes it an appropriate complement to methods such as the singleton density score (Field et al. 2016), which detects sweeps occurring within the past 100 generations of human history (outside the range of detection of T). The T statistic also complements machine-learning methods (Lin et al. 2011; Schrider and Kern 2016; Sheehan and Song 2016; Kern and Schrider 2018; Sugden et al. 2018; Mughal and DeGiorgio 2019; Mughal et al. 2019), which are more powerful in exchange for more computation time (and can also incorporate T into their algorithms). Our lack of dependence on phased data provides the opportunity to search for sweep signatures in any nonmodel organism for which whole-genome polymorphism data exist. We expect that our simple yet powerful statistical model of selective sweeps will yield novel insights into the adaptive histories of diverse populations. Even in well-studied species such as humans, there are yet understudied populations for which future analyses of selection signatures will provide important insights about population history that is missing from the current literature. To motivate this point, we highlight that insights into local adaptation within human populations continue to emerge (Buckley et al. 2017; Hu et al. 2017; Fan et al. 2019), more than a decade after the first investigations began (Bustamante et al. 2005; Ronald and Akey 2005; Sabeti et al. 2006). Finally, we make available the open-source software package LASSI, which implements the T statistic protocol in a single efficient pipeline.
Materials and Methods
General Simulation Protocol
We applied the T statistic to simulated data based on demographic models consistent with recent estimates of human (Terhorst et al. 2017) and D. melanogaster (Duchen et al. 2013) population history. For some experiments evaluating power under human models, we also applied the T statistic to unphased MLG data, which we produced by manually merging each simulated individual’s two haplotypes. We generated these data using the population-genetic simulation software SLiM (Haller and Messer 2017), as well as with the coalescent simulator ms (Hudson 2002). For power experiments based on human models, we exclusively performed simulations forward in time using a Wright–Fisher model implemented in SLiM (Fisher 1930; Wright 1931; Hartl and Clark 2007). These simulations lasted for a total of 200,000 generations, of which the former 100,000 (equivalent to 10N, where is the diploid effective population size of the simulated population) was a burn-in period to achieve equilibrium values of neutral variation, and the latter 100,000 was the period over which population size variation occurred. To speed up run time, human-modeled simulations were scaled by a factor of λ = 20, whereas D. melanogaster simulations, which featured larger population sizes, were scaled by λ = 100. In order to scale, we multiplied mutation rates, recombination rates, and selection coefficients by λ, whereas the size of the simulated population and the duration of the simulation in generations were divided by λ. Thus, simulation duration was reduced by a factor of 400 (λ = 20) or 104 (.
For all simulations other than for the above human-model power experiments, we generated data for each replicate population using ms, either for use as input into SLiM (burn-in), or to simulate neutrality. We used the former approach to generate FDR thresholds under the human and D. melanogaster models (see Results and Selection Protocols below). Here, we used ms to run the majority of each simulation. For the D. melanogaster model, we ran ms up to the earliest point in time that selection could occur, which was the time of admixture between the African and European populations (supplementary fig. S27, Supplementary Material online; see also below). For human models, we simulated the first 10N generations prior to sampling. Simulations then proceeded forward in time with SLiM, and selection was allowed to take place. Meanwhile, we outputted neutral simulations to compute P values (see below) for both human and D. melanogaster models directly from ms. For human simulations, we chose a mutation rate of per site per generation (Narasimhan et al. 2017), and an exponentially distributed recombination rate with mean per site per generation, with maximum value truncated at , as in Schrider and Kern (2017) and Mughal and DeGiorgio (2019). For D. melanogaster, our recombination rate was a uniform per site per generation (equivalent to cM per base), and our mutation rate was per site per generation, specifically chosen to match Garud et al. (2015) and Harris, Sackman, et al. (2018), but smaller than the value inferred by Keightley et al. (2009).
Our simulated human demographic histories consisted of a European-descended CEU model (individuals of northern and western European descent sampled in Utah, USA) and a sub-Saharan African YRI model (Yoruba individuals from Ibadan, Nigeria). Both models were inferred by Terhorst et al. (2017) using smc++. The CEU model features a severe bottleneck reducing population effective size by an order of magnitude ∼2,000 generations before sampling, followed by a population expansion over two orders of magnitude leading to present day. The YRI model contains population size fluctuations, but with no severe bottlenecks, and also includes an expansion similarly to the CEU model (see fig. 5 of Terhorst et al. [2017]). Thus, the simulated CEU population has an ∼2-fold reduction in its level of background genetic diversity relative to the YRI. For every replicate within each simulated human-model selection scenario (see below), we outputted a simulated chromosome in SLiM of length 1 megabase (Mb) and scanned it with a sliding analysis window of size 117 SNPs, advancing by 12 SNPs per iteration. A window of 117 SNPs roughly corresponds to the number of SNPs expected in a physical window of size 40 kb for our sample size of 100 European diploid individuals, or 20 kb for 100 sub-Saharan African diploid individuals (Watterson 1975). We selected this window size because it is over this interval that pairwise LD between SNPs decays by more than one-third on average in the human genome (Jakobsson et al. 2008). This makes it unlikely that elevated values of the T statistic are due to background LD.
We simulated the DGRP (Mackay et al. 2012) D. melanogaster demographic history following the protocol of Harris, Sackman, et al. (2018), adapting the model of Duchen et al. (2013) (supplementary fig. S27, Supplementary Material online). Here, an ancestral African population (effective size N1) experiences a bottleneck at time , contracting to size for 1,000 generations before expanding to size . After the bottleneck, the ancestral European population diverges from the ancestral African population at time τ1 and begins with an effective size N2. The European population grows exponentially to its modern size, . At time τ2, the North American population ancestral to the modern DGRP sample is generated with initial size N3 from the admixture of the European and African populations, modeled as a single event, with a proportion α of North American genomes deriving from African ancestors, and a proportion deriving from European ancestors, where . The North American population grows exponentially to its final size, . We draw each of the aforementioned model parameters according to their posterior probability density (Harris, Sackman, et al. 2018; supplementary table S1, Supplementary Material online), thereby incorporating uncertainty into the demographic model. We only output data from the North American population, and selection only occurs in that population. We used analysis windows of size 400 SNPs and a step size of 40 SNPs for D. melanogaster simulations. This represents the expected number of SNPs in a 10-kb window, over which a pairwise LD decay of greater than one-third occurs for the DGRP data set (Garud et al. 2015).
Across all experiments, we primarily used a haplotype frequency spectrum truncation of K = 10, but for human-model analyses, we also examined . As described in the Definition of Statistic section, we generate an averaged truncated haplotype frequency spectrum from each neutral replicate analysis window and use this as an estimate of the baseline variation in the absence of a selective sweep. Except where otherwise mentioned, we used weight allocation scheme (D) from the Definition of Statistic section, with , favoring a greater increase in the frequency of the first sweeping haplotype class relative to the neutral baseline. In practice, all schemes showed similar power, however (supplementary fig. S28, Supplementary Material online). Additionally, we only optimized models with U = pK and did so over a grid of (with intervals of ), providing us with a range of different U and ε combinations. Fixing U = pK ensured that we would under all circumstances have frequencies smaller than their equivalents in the p spectrum, though in doing so we often underestimate the empirically observed putatively nonsweeping classes. We emphasize, however, that due to the structure of our likelihood equations, the value of the T statistic depends more on the fit of the model’s sweeping classes than its nonsweeping classes and thus did not include an optimization of U because it would be unlikely to provide additional power to T but could potentially require considerably more computational burden.
Selection Protocols
To assess the power of the T statistic to differentiate between neutrality and sweeps, we simulated a variety of human selective sweep scenarios, defined primarily by their combination of selection time t, selection strength s, and simulated number of initially sweeping haplotypes ν. We simulated selection on de novo mutations arising at times generations prior to sampling. We chose this range of t because the T statistic is suited to detecting recent sweeps that have established in a population, and so this range spans sweeps that are too recent to detect (as they are unlikely to have established), too ancient to detect (as their footprints have likely eroded), and optimal to detect. We simulated sweeps on distinct sweeping haplotypes, drawn uniformly at random across the set of haplotypes in the population at time t, and conditioned that at least one copy of the selected allele would remain in the population for the duration of the simulation. We chose selection coefficients uniformly at random following five schemes to illustrate the effect of the selection coefficient on sweep detection. Weak selection protocols covered and . Strong selection comprised and . We drew mixed selection coefficients as a combination of the weak and strong ranges, with drawn uniformly at random specifically from a log scale. We computed the T statistic for each simulated genomic analysis window using its truncated counts spectrum as input for the likelihood functions (eq. 4) and (eq. 5) and retained the largest T for a replicate as its score. We also computed the score in this manner for each existing neutral replicate. We assessed the power of our approach for each parameter set at 1% and 5% FPRs as the proportion of sweep replicates whose scores exceeded the top 1% or 5% of scores under neutrality, respectively.
For experiments to identify FDR thresholds, selection parameters were drawn at random from distributions of all parameters described above. Sweeps under the human model were initiated uniformly at random across generations prior to sampling, and selection coefficients uniformly at random from on a log scale. We raised the lower bound on s relative to power experiments because we found that the T statistic has little ability to identify sweeps on s < 0.005. Likewise, we drew to account for the range over which T has the most power. For the D. melanogaster model, we simulated hard and soft sweeps once again from but chose and . The constraint on selection time here derives from the requirement that sweeps must occur in the North American D. melanogaster population, which we have fixed to arise 7,943 unscaled generations before sampling (this is the sole parameter that we do not draw from the posterior distribution of Duchen et al. [2013]), whereas we expect that the larger size of the D. melanogaster population should allow us to detect sweeps from smaller s due to the larger population-scaled selection coefficient .
Because we were interested in tracking the number of haplotypes carrying the selected allele in a soft sweep over the course of a sweep, we implemented an in silico barcoding protocol in SLiM. This approach allowed us to observe the effect of hardening on soft sweeps due to genetic drift (Wilson et al. 2014), as well as the relationship among selection coefficient, , and method power. We augmented the simulation procedure of previous power experiments by additionally introducing a unique neutral mutation adjacent to the site of selection for each selected haplotype at time t. Thus, no two selected haplotypes could be identical at time t. Then, at the end of each generation, we measured the frequency of both the selected allele and the count of each unique mutation, which served as a proxy for each unique haplotype’s count in the population. We quantified selected allele frequency and selected haplotype count trajectories by taking the mean of 1,000 replicates for each scenario. Supplementary figure S12, Supplementary Material online, summarizes our in silico barcoding protocol for ν = 4 initially selected haplotypes. For haplotype tracking experiments, we focused specifically on human scenarios across our five selection strength schemes and but only studied the selection times for which we had the greatest power. For CEU, these were t = 2,000 (), t = 1,500 ( and ), t = 1,000 (), and t = 500 (). Across the same s ranges for YRI, we respectively used t = 2,500, 2,000, 2,500, 1,500, and 500.
Scans of Simulated Data with Multiple Methods
In addition to the T statistic, we also applied other popular and powerful methods to our simulated data in order to thoroughly contextualize the power of our approach. Here, we reused the data generated for power experiments across all ranges of s for hard (ν = 1) and soft (ν = 4) sweeps. We elected to compare H12 (Garud et al. 2015), nSL (Ferrer-Admetlla et al. 2014), SweepFinder2 (Nielsen et al. 2005; DeGiorgio et al. 2016; Huber et al. 2016), and Trendsetter (Mughal and DeGiorgio 2019), as each presents a unique approach in identifying signatures of selective sweeps. H12 and nSL represent perhaps the most similar alternatives to T, as they are summary statistics that also leverage measures of haplotype frequency to make inferences. H12 is a haplotype homozygosity-based method that detects sweeps based on their reduced haplotypic diversity, whereas nSL identifies sweeps based on their large tracts of sequence identity. Both approaches also have power to detect soft sweeps in addition to hard sweeps (this being the primary purpose of H12). SweepFinder is a likelihood method that detects sweeps from distortions in the SFS and has high power to detect hard sweeps but is underpowered for soft sweeps because it does not use haplotype information. Finally, Trendsetter is a sophisticated machine-learning approach that leverages the spatial autocorrelation of summary statistic signals along the chromosome to identify and classify sweeps and is therefore likely to outperform any single-statistic approach.
We applied these statistics in ways that allowed us to most directly compare their performance with that of T. For H12, we used the same SNP-based window and step sizes as with the T statistic, meaning that we computed H12 using the exact data as for T. Next, we implemented unstandardized nSL with selscan (Szpiech and Hernandez 2014) running default options and obtained a value of the statistic for each SNP within a replicate. For SweepFinder2, we scanned with a step size of 1,000 nucleotides to ensure a dense grid of analysis windows. We generated a helper file from our neutral replicates, which served a similar purpose to our neutral haplotype frequency spectrum p, and only included polymorphic sites in computations (Huber et al. 2016). Finally, we implemented two versions of Trendsetter, each trained on simulated data across three classes—neutrality, hard sweeps, and soft sweeps ( drawn uniformly at random)—and 5,000 replicates per training class, with and as previously. First, we used the standard approach, which studies the spatial signatures of six statistics—mean pairwise sequence difference , squared correlation coefficient of LD r2, the number of distinct haplotypes , and expected haplotype homozygosity H1, H12, and H2/H1 (Garud et al. 2015). Second, we applied Trendsetter to employ the spatial signature of only the T statistic (“T-Trendsetter”). This experiment served to highlight the power gain from incorporating spatial autocorrelation of T signals across genomic sequence tracts. In both applications, we used the default behavior of Trendsetter, drawing inferences across 201 windows of size 11 SNPs spaced apart by 5 SNPs for a total of 1,010 SNPs. For T-Trendsetter, we also trained a classifier on 75 117-SNP windows to better recapitulate our approach with the unaided solitary T statistic. To measure power, we retained the maximum value of each summary statistic (or for Trendsetter, the probability of assigning the replicate as a sweep, which is the sum of the hard and soft sweep class probabilities) as the score.
Modeling Confounding Scenarios
We considered admixture as a confounding scenario because its effect can mimic that of a selective sweep under certain circumstances (Harris et al. 2018). To determine the impact of admixture on the T statistic, we implemented contrived models overlaid onto our existing CEU and YRI models in which a distantly related donor population (diverged generations before sampling) unidirectionally admixes into the sampled CEU or YRI population as a single pulse 200 generations before sampling. We varied the size of the admixture pulse from 0.05 to 0.4, meaning between 5% and 40% of the subsequent generation derive their ancestry from the admixing donor population, at increments of 0.05, and tested admixing population sizes of , 104, and 105. For all scenarios, we generated neutral background haplotype frequency spectra matching the admixture scenario but additionally computed the T statistic using the unadmixed p spectra to demonstrate the effect of not accounting for admixture. For these experiments, all scenarios included no selection in order to highlight the range of T statistics emerging from admixture in the absence of a sweep.
To evaluate background selection as a potential confounding factor that may also produce spurious sweep signals, we performed simulations in which we allowed for deleterious mutations to arise within the simulated chromosome throughout the simulation while maintaining all other parameters identical to neutrality. Our protocol was similar to that of Harris et al. (2018) and covered the human CEU and YRI models. As with our previous simulations, we generated a genomic region of length 500 kb with identical mutation rate and population sizes as previously, evolving once again for a duration of 20N generations ( diploids, the effective size during the burn-in period). At the center of the simulated sequence, we introduced a gene of length either 11 kb (small), 55 kb (medium), or 110 kb (large) consisting of a 5′ untranslated region (UTR, length 200 bases), either 10 kb (small), 50 kb (medium), or 100 kb (large) exons (100 bases each) and 9 kb (small), 49 kb (medium), or 99 kb (large) introns (1 kb each) alternating for 10 kb (small), 54 kb (medium), or 109 kb (large), and a 3′ UTR (800 bases). We based the sizes of genetic elements on human genome-wide mean values (Mignone et al. 2002; Sakharkar et al. 2004). Within the gene, strongly deleterious mutations (; gamma distribution of fitness effects with shape parameter 0.2) arose at rates of 50% within the UTRs, 75% within the exons, and 10% within the introns, whereas all other mutations within the gene and across the rest of the chromosome were selectively neutral. To enhance the effect of background selection under this scenario, we reduced the mean recombination rate from to per site per generation within the central gene.
Within natural genomes, and across different study systems, there can be considerable variability in mutation and recombination rates, which ultimately affects the density of SNPs and number of distinct haplotypes within a genomic analysis window. To understand this effect with respect to the value of the T statistic, we performed neutral simulations under both of our human models in which we reduced the mutation and mean recombination rates by an order of magnitude, both separately and simultaneously. Thus, altered mutation rates were lowered to , and recombination rates were drawn as previously but centered on a mean of . We performed 1,000 simulations of 1-Mb sequences once again, recorded the value of the T statistic to generate a distribution, and measured the proportion of false signals deriving from the above scenarios as a function of the true positive rate.
Similarly, analyses of data sets with missing sites can also reduce the number of SNPs and therefore haplotypes within an analysis window relative to ideal data, and so we performed a similar analysis measuring the distribution of the T statistic under neutrality after removing polymorphic sites with >5% missing data. To generate a data set with missing sites, we followed two types of approaches. First, we selected a number of SNPs, drawn uniformly at random for each existing replicate, and removed data from these SNPs in 5–20 diploid individuals (both haplotypes were treated identically). Second, we performed five iterations of data removal, with SNPs from between one and four individuals removed per iteration. For each approach, we performed three different intensities of data removal. The lowest intensity involved the removal of between 200 and 500 SNPs for the single-iteration approach or between 40 and 100 SNPs per iteration for the five-iteration approach. The middle intensity approach removed or SNPs, and the most intense approach removed or . The single-iteration approach is more likely to force the removal of sites with missing data, since the possibility is greater that enough missing alleles will be present at an affected site. The five-iteration approach results in a scattering of missing alleles but a lower chance of any SNP having >5% missing data. Sites with ≤5% missing data were incorporated into haplotypes, encoded as a third character state “N” (in contrast to the binary 0/1 for nonmissing sites). Doing so allowed us to conservatively lower the value of the T statistic in such cases by introducing greater haplotypic diversity to samples with missing data.
An important component of analysis with the T statistic is the computation of statistical thresholds, such as P value and FDR cutoffs for determining candidate significance. In order to compute these thresholds, it is necessary to perform simulations under an appropriate demographic history and compare the T statistics of sweep candidates with the distribution of T statistics under simulated replicates. If the demographic history is improperly inferred, then P values and FDR cutoffs will also be incorrect, and may result in unwarranted emphasis on nonadaptive regions, or spurious disregard for true selective events. To demonstrate the effect of misspecifying the demographic history of a study population, we show the impact of generating neutral replicates from CEU and YRI demographic histories wherein population size remains constant and equal to the effective size of smc++-derived models, computed as the harmonic mean of the population size through the simulation. We also compared distributions of the T statistic generated under the Gravel et al. (2011) CEU and YRI models, which were inferred solely from SFS information in contrast to the potentially more accurate hybrid approach of smc++, which uses the SFS and whole-genome sequence information (Beichman et al. 2017).
Finally, we also examined the power of the T statistic for variably small sample sizes. Haplotype-based methods are sensitive to the number of sampled individuals because sufficient variation needs to be captured in a sample for the difference between neutral and sweep haplotypic diversity to become apparent (Harris et al. 2018). That is, reductions in diversity following a sweep become more apparent as more individuals are sampled, and more subtle signatures of sweeps can be elucidated. Accordingly, we resampled all of our existing sweep replicates used in the previous power analysis by drawing n = 100, 50, or 20 haplotypes uniformly at random and applying the T statistic as previously. We note that we computed a size-adjusted neutral background haplotype frequency spectrum p for each reduced-size sample as well.
Application of T Statistic to Empirical Data
We applied the T statistic to human empirical data from the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015), as well as to the DGRP inbred D. melanogaster data set (Mackay et al. 2012). The former application served primarily as a validation of our approach, as positive selection in the human genome has been widely explored. The latter application represented a typical insect model system that has also been well studied and diverges in size, genome architecture, and population history from humans. The complete outputs for our scans of these data sets are available at http://degiorgiogroup.fau.edu/LASSI.html. Our protocols for analyzing either data set were identical in approach. For each, we searched for candidate peaks by applying a sliding window to all autosomes in the subject genome, basing window size on the interval over which LD, measured as r2, decayed below one-third of its original value relative to pairs of loci separated by 1 kb. This matched the prior approaches of Garud et al. (2015) and Harris et al. (2018). A candidate peak simply refers to an elevated instance of T within an RNA- or protein-coding region, and we retained the most prominent peak for each gene. For humans, our window size was 117 SNPs, and for D. melanogaster, it was 400 SNPs, both matching our values for simulation experiments.
After performing each scan of the CEU and YRI data sets, we filtered windows overlapping chromosomal regions of low alignability and mappability, removing windows overlapping with chromosomal regions of mean CRG100 score <0.9. For D. melanogaster, we removed strains 49, 85, 101, 109, 136, 153, 237, 309, 317, 325, 338, 352, 377, 386, 426, 563, and 802 from our analysis due to their high number of heterozygous sites and treated remaining heterozygous sites as missing data, as in Garud et al. (2015). We also only used SNPs that had a quality score (reported in the DGRP data) between 1 and 30. We computed T statistic for the K = 10 truncation and assigned values based on both K = 10 and K = 20 to examine the practical effect of truncation on candidate classifications.
We intersected the locations of computed T statistic values with the coordinates for protein- and RNA-coding genes based on hg19 and Dmel 5.13 annotations for humans and D. melanogaster, respectively. We assigned P values based on K = 10 truncations to the 40 genes with the largest associated values of T by generating 106 neutral replicates simulated in ms (Hudson 2002). For humans, we generated neutral replicates under demographic models inferred with smc++ (Terhorst et al. 2017), and for D. melanogaster, neutral replicates were based on the Duchen et al. (2013) model, drawing parameters as previously. For each replicate, we simulated a sequence of length drawn uniformly at random from the set of all gene lengths, appended with the minimum number of nucleotides necessary to allow the application of full analysis windows centered across the entire length of the simulated gene. As an example, for a simulated human gene of length L nucleotides, we appended additional sequence length guaranteeing that 117-SNP windows centered at the first SNP and the last SNP of the simulated gene could be constructed. This allowed us to obtain a T statistic for at least one whole analysis window centered on the simulated gene during each replicate.
The P value for a selection candidate is the proportion of T statistics across all neutral replicates (using the maximum value for a replicate if there was more than one analysis window) that exceeded the maximum T associated with the candidate. All P values were Bonferroni corrected for multiple testing (Neyman and Pearson 1928), where a significant P value was and where G is the number of genes for which we assigned a score in the organism. Accordingly, we have for humans , and , whereas and for D. melanogaster. We ultimately examined fewer than the total number of genes for each study system as a consequence of our filtering and sliding window step size. Filtering had the effect of removing SNPs, which can lead to genes losing representation in the final data set, whereas the choice of window step size may result in genes being skipped over. The total number of protein- and RNA-coding autosomal genes in hg19 is 23,735, of which ∼20% were uncounted, whereas the total number of autosomal genes in Dmel 5.13 is 12,215, meaning that once again ∼20% were omitted. Finally, we computed FDR cutoffs for each population by generating simulated 106 neutral and sweep replicates as described in Selection Protocols, generating a sample of size . The 5% FDR cutoff, which we assigned to all populations, was the T statistic value for which 5% of the replicates exceeding that value were neutral, and 95% were sweeps. There was no value of T which served as a 1% cutoff for the D. melanogaster model (i.e., there was no T value for which only 1% of replicates were neutral), but we did assign a 1% FDR cutoff to human populations (supplementary table S4, Supplementary Material online).
Supplementary Material
Acknowledgments
We thank two anonymous reviewers for their insightful comments that helped improve this manuscript. We also thank Mehreen Mughal for her help with Trendsetter. This work was funded by the National Institutes of Health (grant R35-GM128590), by the National Science Foundation (grants DEB-1753489, DEB-1949268, and BCS-2001063), and by the Alfred P. Sloan Foundation. Computations for this research were performed on the Pennsylvania State University’s Institute for Computational and Data Sciences Advanced CyberInfrastructure (ICDS-ACI).
References
- 1000 Genomes Project ConsortiumAuton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. 2015. A global reference for human genetic variation. Nature 526(7571):68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Albrechtsen A, Moltke I, Nielsen R.. 2010. Natural selection and the distribution of identity-by-descent in the human genome. Genetics 186(1):295–308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ayub Q, Yngvadottir B, Chen Y, Xue Y, Hu M, Vernes SC, Fisher SE, Tyler-Smith C.. 2013. FOXP2 targets show evidence of positive selection in European populations. Am J Hum Genet. 92(5):696–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barría A, Christensen KA, Yoshida G, Jedlicki A, Leong JS, Rondeau EB, Lhorente JP, Koop BF, Davidson WS, Yáñez JM.. 2019. Whole genome linkage disequilibrium and effective population size in a coho salmon (Oncorhynchus kisutch) breeding population using a high-density SNP array. Front Genet. 10: 498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont MA, Nielsen R, Robert C, Hey J, Gaggiotti O, Knowles L, Estoup A, Panchal M, Corander J, Hickerson M, et al. 2010. In defence of model-based inference in phylogeography. Mol Ecol. 19(3):436–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beichman AC, Phung TN, Lohmueller KE.. 2017. Comparison of single genome and allele frequency data reveals discordant demographic histories. G3 (Bethesda) 7:3605–3620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergey CM, Lopez M, Harrison GF, Patin E, Cohen JA, Quintana-Murci L, Barreiro LB, Perry GH.. 2018. Polygenic adaptation and convergent evolution on growth and cardiac genetic pathways in African and Asian rainforest hunter-gatherers. Proc Natl Acad Sci U S A. 115(48):E11256–E11263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN.. 2004. Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet. 74(6):1111–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Browning BL.. 2011. Haplotype phasing: existing methods and new developments. Nat Rev Genet. 12(10):703–714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Browning BL, Daviglus ML, Durazo-Arvizu RA, Schneiderman N, Kaplan RC, Laurie CC.. 2018. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 14(5):e1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buckley MT, Racimo F, Allentoft ME, Jensen MK, Jonsson A, Huang H, Hormozdiari F, Sikora M, Marnetto D, Eskin E, et al. 2017. Selection in Europeans on fatty acid desaturases associated with dietary changes. Mol Biol Evol. 34(6):1307–1318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD, et al. 2005. Natural selection on protein-coding genes in the human genome. Nature 437(7062):1153–1157. [DOI] [PubMed] [Google Scholar]
- Campbell MC, Tishkoff SA.. 2008. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu Rev Genom Hum Genet. 9(1):403–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castel SE, Mohammadi P, Chung WK, Shen Y, Lappalainen T.. 2016. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nat Commun. 7:12817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B, Charlesworth D, Morgan MT.. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141(4):1619–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B, Morgan MT, Charlesworth D.. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134(4):1289–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Patterson NJ, Reich DE.. 2010. Population differentiation as a test for selective sweeps. Genome Res. 20(3):393–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connan C, Voillequin M, Chavez CV, Mazuet C, Leveque C, Vitry S, Vandewalle A, Popoff MR.. 2017. Botulinum neurotoxin type B uses a distinct entry pathway mediated by CDC42 into intestinal cells versus neuronal cells. Cell Microbiol. 19(8):e12738. [DOI] [PubMed] [Google Scholar]
- Cutter AD, Payseur BA.. 2013. Genomic signatures of selection at linked sites: unifying the disparity among species. Nat Rev Genet. 14(4):262–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeGiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R.. 2016. SWEEPFINDER2: increased sensitivity, robustness and flexibility. Bioinformatics 32(12):1895–1897. [DOI] [PubMed] [Google Scholar]
- DeGiorgio M, Lohmueller KE, Nielsen R.. 2014. A model-based approach for identifying signatures of ancient balancing selection in genetic data. PLoS Genet. 10(8):e1004561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dorus S, Freeman ZN, Parker ER, Heath BD, Karr TL.. 2008. Recent origins of sperm genes in Drosophila. Mol Biol Evol. 25(10):2157–2166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duchen P, Živković D, Hutter S, Stephan W, Laurent S.. 2013. Demographic inference reveals African and European admixture in the North American Drosophila melanogaster population. Genetics 193(1):291–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enard D, Messer PW, Petrov DA.. 2014. Genome-wide signals of positive selection in human evolution. Genome Res. 24(6):885–895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fagny M, Patin E, Enard D, Barreiro LB, Quintana-Murci L, Laval G.. 2014. Exploring the occurrence of classic selective sweeps in humans using whole-genome sequencing data sets. Mol Biol Evol. 31(7):1850–1868. [DOI] [PubMed] [Google Scholar]
- Fan S, Kelly DE, Beltrame MH, Hansen MEB, Mallick S, Ranciaro A, Hirbo J, Thompson S, Beggs W, Nyambo T, et al. 2019. African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations. Genome Biol. 20(1):82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R.. 2014. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol Biol Evol. 31(5):1275–1291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, Yengo L, Rocheleau G, Froguel P, McCarthy MI, et al. 2016. Detection of human adaptation during the past 2000 years. Science 354(6313):760–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher RA. 1930. The genetical theory of natural selection. 1st ed.Oxford: Oxford University Press, Inc. and Clarendon [Google Scholar]
- Garud NR, Messer PW, Buzbas EO, Petrov DA.. 2015. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 11(2):e1005004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gillespie JH. 2004. Population genetics: a concise guide. 2nd ed Baltimore (MD: ): The Johns Hopkins University Press. [Google Scholar]
- Goeury T, Creary LE, Brunet L, Galan M, Pasquier M, Kervaire B, Langaney A, Tiercy J-M, Fernández-Viña MA, Nunes JM, et al. 2018. Deciphering the fine nucleotide diversity of full HLA class I and class II genes in a well-documented population from sub-Saharan Africa. HLA 91(1):36–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Granka JM, Henn BM, Gignoux CR, Kidd JM, Bustamante CD, Feldman MW.. 2012. Limited evidence for classic selective sweeps in African populations. Genetics 192(3):1049–1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RAThe 1000 Genomes ProjectBustamante CD.. 2011. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A. 108(29):11983–11988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gronau I, Hubisz MJ, Gulko B, Danko CG, Siepel A.. 2011. Bayesian inference of ancient human demography from individual genome sequences. Nat Genet. 43(10):1031–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grossman SR, Andersen KG, Shlyakhter I, Tabrizi S, Winnicki S, Yen A, Park DJ, Griesemer D, Karlsson EK, Wong SH, et al. 2013. Identifying recent adaptations in large-scale genomic data. Cell 152(4):703–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haber M, Jones AL, Connell BA, Asan Asan, Arciero E, Yang H, Thomas MG, Xue Y, Tyler-Smith C.. 2019. A rare deep-rooting D0 African Y-chromosomal haplogroup and its implications for the expansion of modern humans out of Africa. Genetics 212:1421–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller BC, Messer PW.. 2017. SLiM 2: flexible, interactive forward genetic simulations. Mol Biol Evol. 34(1):230–240. [DOI] [PubMed] [Google Scholar]
- Harris AM, DeGiorgio M.. 2020. Identifying and classifying shared selective sweeps from multilocus data. Genetics 215:143–171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris AM, Garud NR, DeGiorgio M.. 2018. Detection and classification of hard and soft sweeps from unphased genotypes by multilocus genotype identity. Genetics 210(4):1429–1452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris RB, Sackman A, Jensen JD.. 2018. On the unfounded enthusiasm for soft selective sweeps II: examining recent evidence from humans, flies, and viruses. PLoS Genet. 14(12):e1007859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartl DL, Clark AG.. 2007. Principles of population genetics. 4th ed.Sunderland (MA: ): Sinauer Associates, Inc. [Google Scholar]
- Hermisson J, Pennings PS.. 2005. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics 169(4):2335–2352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hermisson J, Pennings PS.. 2017. Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods Ecol Evol. 8(6):700–716. [Google Scholar]
- Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G 1000 Genomes ProjectSella G, Przeworski M.. 2011. Classic selective sweeps were rare in recent human evolution. Science 331(6019):920–924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu H, Petousi N, Glusman G, Yu Y, Bohlender R, Tashi T, Downie JM, Roach JC, Cole AM, Lorenzo FR, et al. 2017. Evolutionary history of Tibetans inferred from whole-genome sequencing. PLoS Genet. 13(4):e1006675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber CD, DeGiorgio M, Hellmann I, Nielsen R.. 2016. Detecting recent selective sweeps while controlling for mutation rate and background selection. Mol Ecol. 25(1):142–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudjashov G, Karafet TM, Lawson DJ, Downey S, Savina O, Sudoyo H, Lansing JS, Hammer MF, Cox MP.. 2017. Complex patterns of admixture across the Indonesian archipelago. Mol Biol Evol. 34(10):2439–2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18(2):337–338. [DOI] [PubMed] [Google Scholar]
- Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung H, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, et al. 2008. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451(7181):998–1003. [DOI] [PubMed] [Google Scholar]
- Jensen JD. 2014. On the unfounded enthusiasm for soft selective sweeps. Nat Commun. 5:5281. [DOI] [PubMed] [Google Scholar]
- Jensen JD, Kim Y, DuMont VB, Aquadro CF, Bustamante CD.. 2005. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170(3):1401–1410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jensen JD, Thornton KD, Bustamante CD, Aquadro CF.. 2007. On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations. Genetics 176(4):2371–2379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones BL, Raga TO, Liebert A, Zmarz P, Bekele E, Danielsen ET, Olsen AK, Bradman N, Troelsen JT, Swallow DM.. 2013. Diversity of lactase persistence alleles in Ethiopia: signature of a soft selective sweep. Am J Hum Genet. 93(3):538–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karasov T, Messer PW, Petrov DA.. 2010. Evidence that adaptation in Drosophila is not limited by mutation at single sites. PLoS Genet. 6(6):e1000924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keightley PD, Trivedi U, Thomson M, Oliver F, Kumar S, Blaxter ML.. 2009. Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines. Genome Res. 19(7):1195–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern AD, Schrider DR.. 2018. diploS/HIC: an updated approach to classifying selective sweeps. G3 (Bethesda) 8:1959–1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y, Nielsen R.. 2004. Linkage disequilibrium as a signature of selective sweeps. Genetics 167(3):1513–1524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Y, Stephan W.. 2002. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160(2):765–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolaczkowski B, Hupalo DN, Kern AD.. 2011. Recurrent adaptation in RNA interference genes across the Drosophila phylogeny. Mol Biol Evol. 28(2):1033–1042. [DOI] [PubMed] [Google Scholar]
- Kopatz A, Eiken HG, Schregel J, Aspi J, Kojola I, Hagen SB.. 2017. Genetic substructure and admixture as important factors in linkage disequilibrium based estimation of effective number of breeders in recovering wildlife populations. Ecol Evol. 7(24):10721–10732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kouprina N, Pavlicek A, Mochida GH, Solomon G, Gersch W, Yoon Y, Collura R, Ruvolo M, Barrett JC, Woods CG, et al. 2004. Accelerated evolution of the ASPM gene controlling brain size begins prior to human brain expansion. PLoS Biol. 2(5):e126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lamason RL, Mohideen MPK, Mest JR, Wong AC, Norton HL, Aros MC, Jurynec MJ, Mao X, Humphreville VR, Humbert JE, et al. 2005. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310(5755):1782–1786. [DOI] [PubMed] [Google Scholar]
- Laver TW, Caswell RC, Moore KA, Poschmann J, Johnson MB, Owens MM, Ellard S, Paszkiewicz KH, Weedon MN.. 2016. Pitfalls of haplotype phasing from amplicon-based long-read sequencing. Sci Rep. 6:21746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee YCG, Langley CH.. 2012. Long-term and short-term evolutionary impacts of transposable elements on Drosophila. Genetics 192(4):1411–1432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leivers S, Rhodes G, Simmons LW.. 2014. Sperm competition in humans: mate guarding behavior negatively correlates with ejaculate quality. PLoS One 9(9):e108099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lencz T, Lambert C, DeRosse P, Burdick KE, Morgan TV, Kane JM, Kucherlapati R, Malhotra AK.. 2007. Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia. Proc Natl Acad Sci U S A. 104(50):19942–19947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Librado P, Gamba C, Gaunitz C, Der Sarkissian C, Pruvost M, Albrechtsen A, Fages A, Khan N, Schubert M, Jagannathan V, et al. 2017. Ancient genomic changes associated with domestication of the horse. Science 356(6336):442–445. [DOI] [PubMed] [Google Scholar]
- Lin K, Li H, Schlötterer C, Futschik A.. 2011. Distinguishing positive selection from neutral evolution: boosting the performance of summary statistics. Genetics 187(1):229–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lohmueller KE, Bustamante CD, Clark AG.. 2009. Methods for human demographic inference using haplotype patterns from genomewide single-nucleotide polymorphism data. Genetics 182(1):217–231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lukić S, Hey J.. 2012. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192(2):619–639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lukic S, Nicolas J-C, Levine AJ.. 2014. The diversity of zinc-finger genes on human chromosome 19 provides an evolutionary mechanism for defense against inherited endogenous retroviruses. Cell Death Differ. 21(3):381–387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, Casillas S, Han Y, Magwire MM, Cridland JM, et al. 2012. The Drosophila melanogaster genetic reference panel. Nature 482(7384):173–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menozzi P, Shi MA, Lougarre A, Tang ZH, Fournier D.. 2004. Mutations of acetylcholinesterase which confer insecticide resistance in Drosophila melanogaster populations. BMC Evol Biol. 4(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Messer PW, Neher RA.. 2012. Estimating the strength of selective sweeps from deep population diversity data. Genetics 191(2):593–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mignone F, Gissi C, Liuni S, Pesole G.. 2002. Untranslated regions of mRNAs. Genome Biol. 3(3):reviews0004-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mughal MR, DeGiorgio M.. 2019. Localizing and classifying adaptive targets with trend filtered regression. Mol Biol Evol. 36(2):252–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mughal MR, Koch H, Huang J, Chiaromonte F, DeGiorgio M.. 2019. Learning the properties of adaptive regions with functional data analysis. bioRxiv. doi: 10.1101/834010. [DOI] [PMC free article] [PubMed]
- Nakagome S, Alkorta-Aranburu G, Amato R, Howie B, Peter BM, Hudson RR, Di Rienzo A.. 2016. Estimating the ages of selection signals from different epochs in human history. Mol Biol Evol. 33(3):657–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nam K, Munch K, Mailund T, Nater A, Greminger MP, Krützen M, Marquès-Bonet T, Schierup MH.. 2017. Evidence that the rate of strong selective sweeps increases with population size in the great apes. Proc Natl Acad Sci U S A. 114(7):1613–1618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Narasimhan VM, Rahbari R, Scally A, Wuster A, Mason D, Xue Y, Wright J, Trembath RC, Maher ER, van Heel DA, et al. 2017. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat Commun. 8: 303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman J, Pearson ES.. 1928. On the use and interpretation of certain test criteria for purposes of statistical inference: part I. Biometrika 20A(1–2):175–240. [Google Scholar]
- Nicolaisen LE, Desai MM.. 2013. Distortions in genealogies due to purifying selection and recombination. Genetics 195(1):221–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C.. 2005. Genomic scans for selective sweeps using SNP data. Genome Res. 15(11):1566–1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nurminsky DI, Nurminskaya MV, De Aguiar D, Hartl DL.. 1998. Selective sweep of a newly evolved sperm-specific gene in Drosophila. Nature 396(6711):572–575. [DOI] [PubMed] [Google Scholar]
- O’Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, Traglia M, Huang J, Huffman JE, Rudan I, et al. 2014. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10(4):e1004234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Reilly PF, Birney E, Balding DJ.. 2008. Confounding between recombination and selection, and the Ped/Pop method for detecting selection. Genome Res. 18(8):1304–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlidis P, Alachiotis N.. 2017. A survey of methods and tools to detect recent and strong positive selection. J Biol Res (Thessalon). 24:7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlidis P, Živković D, Stamatakis A, Alachiotis N.. 2013. SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Mol Biol Evol. 30(9):2224–2234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedra JHF, McIntyre LM, Scharf ME, Pittendrigh BR.. 2004. Genome-wide transcription profile of field- and laboratory-selected dichlorodiphenyltrichloroethane (DDT)-resistant Drosophila. Proc Natl Acad Sci U S A. 101(18):7034–7039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pennings PS, Hermisson J.. 2006. a. Soft Sweeps II: molecular population genetics of adaptation from recurrent mutation or migration. Mol Biol Evol. 23(5):1076–1084. [DOI] [PubMed] [Google Scholar]
- Pennings PS, Hermisson J.. 2006. b. Soft sweeps III: the signature of positive selection from recurrent mutation. PLoS Genet. 2(12):e186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peter BM, Huerta-Sánchez E, Nielsen R.. 2012. Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet. 8(10):e1003011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pickrell JK, Coop G, Novembre J, Kudaravalli S, Li JZ, Absher D, Srinivasan BS, Barsh GS, Myers RM, Feldman MW, et al. 2009. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19(5):826–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pierron D, Razafindrazaka H, Pagani L, Ricaut F, Antao T, Capredon M, Sambo C, Radimilahy C, Rakotoarisoa J, Blench RM, et al. 2014. Genome-wide evidence of Austronesian-Bantu admixture and cultural reversion in a hunter-gatherer group of Madagascar. Proc Natl Acad Sci U S A. 111(3):936–941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pollinger JP, Bustamante CD, Fledel-Alon A, Schmutz S, Gray MM, Wayne RK.. 2005. Selective sweep mapping of genes with large phenotypic effects. Genome Res. 15(12):1809–1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Di Rienzo A.. 2010. Adaptation—not by sweeps alone. Nat Rev Genet. 11(10):665–667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Przeworski M. 2002. The signature of positive selection at randomly chosen loci. Genetics 160(3):1179–1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Racimo F. 2016. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202(2):733–750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ronald J, Akey JM.. 2005. Genome-wide scans for loci under selection in humans. Hum Genomics. 2(2):113–125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ, et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419(6909):832–837. [DOI] [PubMed] [Google Scholar]
- Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A, Mikkelsen TS, Altshuler D, Lander ES.. 2006. Positive natural selection in the human lineage. Science 312(5780):1614–1620. [DOI] [PubMed] [Google Scholar]
- Sakharkar MK, Chow VTK, Kangueane P.. 2004. Distributions of exons and introns in the human genome. In Silico Biol. 4:387–393. [PubMed] [Google Scholar]
- Schrider DR, Kern AD.. 2016. S/HIC: robust identification of soft and hard sweeps using machine learning. PLoS Genet. 12(3):e1005928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrider DR, Kern AD.. 2017. Soft sweeps are the dominant mode of adaptation in the human genome. Mol Biol Evol. 34(8):1863–1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schweinsberg J, Durrett R.. 2005. Random partitions approximating the coalescence of lineages during a selective sweep. Ann Appl Probab. 15(3):1591–1651. [Google Scholar]
- Seger J, Smith WA, Perry JJ, Hunn J, Kaliszewska ZA, La Sala L, Pozzi L, Rowntree VJ, Adler FR.. 2010. Gene genealogies strongly distorted by weakly interfering mutations in constant environments. Genetics 184(2):529–545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheehan S, Song YS.. 2016. Deep learning for population genetic inference. PLoS Comput Biol. 12(3):e1004845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stevens NR, Dobbelaere J, Wainman A, Gergely F, Raff JW.. 2009. Ana3 is a conserved protein required for the structural integrity of centrioles and basal bodies. J Cell Biol. 187(3):355–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sugden LA, Atkinson EG, Fischer AP, Rong S, Henn BM, Ramachandran S.. 2018. Localization of adaptive variants in human genomes using averaged one-dependence estimation. Nat Commun. 9(1):703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szpiech ZA, Hernandez RD.. 2014. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol. 31(10):2824–2827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Terhorst J, Kamm JA, Song YS.. 2017. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet. 49(2):303–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teves ME, Nagarkatti-Gude DR, Zhang Z, Strauss JF III. 2016. Mammalian axoneme central pair complex proteins: broader roles revealed by gene knockout phenotypes. Cytoskeleton 73(1):3–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vatsiou AI, Bazin E, Gaggiotti OE.. 2016. Detection of selective sweeps in structured populations: a comparison of recent methods. Mol Ecol. 25(1):89–103. [DOI] [PubMed] [Google Scholar]
- Voight BF, Kudaravalli S, Wen X, Pritchard JK.. 2006. A map of recent positive selection in the human genome. PLoS Biol. 4(3):e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vy HMT, Kim Y.. 2015. A composite-likelihood method for detecting incomplete selective sweep from population genomic data. Genetics 200(2):633–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vy HMT, Won Y, Kim Y.. 2017. Multiple modes of positive selection shaping the patterns of incomplete selective sweeps over African populations of Drosophila melanogaster. Mol Biol Evol. 34(11):2792–2807. [DOI] [PubMed] [Google Scholar]
- Watterson GA. 1975. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 7(2):256–276. [DOI] [PubMed] [Google Scholar]
- Wiehe T, Nolte V, Zivkovic D, Schlötterer C.. 2007. Identification of selective sweeps using a dynamically adjusted number of linked microsatellites. Genetics 175(1):207–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson BA, Pennings PS, Petrov DA.. 2017. Soft selective sweeps in evolutionary rescue. Genetics 205(4):1573–1586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson BA, Petrov DA, Messer PW.. 2014. Soft selective sweeps in complex demographic scenarios. Genetics 198(2):669–684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong A, Turchin MC, Wolfner MF, Aquadro CF.. 2008. Evidence for positive selection on Drosophila melanogaster seminal fluid protease homologs. Mol Biol Evol. 25(3):497–506. [DOI] [PubMed] [Google Scholar]
- Wright S. 1931. Evolution in Mendelian populations. Genetics 16:97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeh S, Do T, Chan C, Cordova A, Carranza F, Yamamoto EA, Abbassi M, Gandasetiawan KA, Librado P, Damia E, et al. 2012. Functional evidence that a recently evolved Drosophila sperm-specific gene boosts sperm competition. Proc Natl Acad Sci U S A. 109(6):2043–2048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yudin NS, Larkin DM, Ignatieva EV.. 2017. A compendium and functional characterization of mammalian genes involved in adaptation to Arctic or Antarctic environments. BMC Genet. 18(S1):111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Christiansen L, Thomas J, Pokholok D, Jackson R, Morrell N, Zhao Y, Wiley M, Welch E, Jaeger E, et al. 2017. Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube. Nat Biotechnol. 35:852–857. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.