Abstract
Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory—for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.
Author Summary
Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying genomic regions under positive selection. However, methods that detect positive selective sweeps do not typically identify the favored allele, or even the haplotypes carrying the favored allele. The main contribution of this paper is the development and analysis of a new statistic (the HAF score), assigned to individual haplotypes. Using both theoretical analyses and simulations, we describe how the HAF scores differ for carriers and non-carriers of the favored allele, and how they change dynamically during a selective sweep. We also develop an algorithm, PreCIOSS, for separating carriers and non-carriers. Our tool has broad applicability as carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory—for example, in contexts involving drug-resistant pathogen strains or cancer subclones.
Introduction
With genome sequencing, we now have an opportunity to more completely sample genetic diversity in human populations, and probe deeper for signatures of adaptive evolution [1–3]. Genetic data from diverse human populations in recent years have revealed a multitude of genomic regions believed to be evolving under recent positive selection [4–16].
Methods for detecting selective sweeps from DNA sequences have examined a variety of signatures, including patterns represented in variant allele frequencies as well as in haplotype structure. Initially, the problem of detecting selective sweeps was approached primarily by considering variant allele frequencies, exploiting the shift in frequency at ‘hitchhiking’ sites linked to a favored allele relative to non-hitchhiking sites [17, 18]. The site frequency spectrum (SFS) within and across populations is often used as a basis for such inference [4, 6, 19–25]. More recently, methods based on haplotype structure have been developed using a variety of approaches, including the frequency of the most common haplotype [26], the number and diversity of distinct haplotypes [27], the haplotype frequency spectrum [28], and the popular approach of long-range haplotype homozygosity [29–32].
In general, haplotype-based methods seek to characterize the population with summary statistics that capture the frequency and length of different haplotypes. However, the haplotypes are related through a genealogy, and relationships among them are inherently lost in such analyses. In addition, data on the site frequency spectrum can be lost or hidden in analyses focused on haplotype spectra. In this paper, we connect related measures of haplotype frequencies and the site frequency spectrum by merging information describing haplotype relationships with variant allele frequencies. Our main contribution is a statistic that we term the haplotype allele frequency (HAF) score, which captures many of the properties shared by haplotypes carrying a favored allele.
Consider a sample of haplotypes in a genomic region. We assume that all sites are biallelic, and at each site, we denote ancestral alleles by 0 and derived alleles by 1. We also assume that all sites are polymorphic in the sample. The HAF vector of a haplotype h, denoted c, is obtained by taking the binary haplotype vector and replacing non-zero entries (derived alleles carried by the haplotype) with their respective frequencies in the sample (Fig 1A). For parameter ℓ, we define the ℓ-HAF score of c as:
(1) |
where the sum proceeds over all segregating sites j in the genomic region. The 1-HAF score of a haplotype amounts to the sum of frequencies of all derived alleles carried by the haplotype. The ℓ-HAF score is equivalent to the ℓ-norm of c raised to the ℓth power, or . We will show that during a selective sweep, the HAF score of a haplotype serves as a proxy to its relative fitness.
Selective sweeps
The classical model for selection, and the one that has received most attention, is the “hard sweep” model, in which a single mutation conveys higher fitness immediately upon occurrence, and rapidly rises in frequency, eventually reaching fixation [17, 33]. Under this model, we can partition the haplotypes into carriers of the favored allele, and non-carriers. In the absence of recombination, the favored haplotypes form a single clade in the genealogy. As a sweep progresses, HAF scores in the favored clade will rise due to the increasing frequencies of alleles hitchhiking along with the favored allele. HAF scores of non-carrier haplotypes will decrease, as many of the derived alleles they carry become rare (Fig 1B). After fixation of the favored and hitchhiking alleles, HAF scores will decline sharply (Fig 1C), as the selected site and other linked sites are no longer polymorphic. Thus, this reduction in the HAF score results from the sudden loss of many high-frequency derived alleles from the pool of segregating sites [18, 20, 24]. Finally, as the site-frequency spectrum recovers to its neutral state due to new mutations and drift [23], so will the HAF scores.
Recombination is a source of ‘noise’ for the properties of the HAF score, predicted under the assumption of a hard sweep and no recombination, as it allows haplotypes to cross into and out of the favored clade. Recombination can lead to (i) haplotypes that carry the favored allele but little of the hitchhiking variation, thus having relatively low HAF scores despite their high fitness, or (ii) haplotypes that do not carry the favored allele but do carry much of the hitchhiking variation, thus having relatively high HAF scores despite their low fitness. By the same logic, recombination adds ‘noise’ after fixation by making the otherwise sharp decline in HAF scores more subtle and gradual. This more gradual decline occurs due to recombination weakening the linkage between the favored allele and hitchhiking variants.
Recently, “soft sweeps” have generated significant interest [34–36]. A soft sweep occurs when multiple sets of hitchhiking alleles in a given region increase in frequency, rather than a single favored haplotype. Soft sweeps may take place by one or more of the following mechanisms: (i) selection from standing variation: a neutral segregating mutation, which exists on several haplotypic backgrounds, becomes favored due to a change in the environment; (ii) recurrent mutation: the favored mutation arises several times on different haplotypic backgrounds; or, (iii) multiple adaptations: multiple favored mutations occur on multiple haplotypic backgrounds. Several methods have been developed for detecting soft sweeps [37, 38], as well as for distinguishing between soft and hard sweeps [39–41]. In soft sweeps, multiple sets of hitchhiking alleles rise to intermediate frequencies as the favored allele fixes. This could cause the pre-fixation peak and post-fixation trough in HAF scores to be less pronounced and to occur more gradually compared to a hard sweep.
We find (see Results) that the properties of the HAF score remain robust to many soft sweep scenarios. Moreover, the HAF score could potentially be used to detect soft sweeps. However, in this paper, we focus on the foundations, developing theoretical analysis and empirical work that predicts the dynamics of the HAF score. We also develop a single application. Recall that most existing methods for characterizing selective sweeps focus on identifying regions under selection. Here, given a region already identified to be undergoing a selective sweep, we ask if we can accurately predict which haplotypes carry the favored allele, without knowledge of the favored site. Successfully doing so may provide insight into the future evolutionary trajectory of a population. Haplotypes in future generations are more likely to be descended from, and therefore to resemble, extant carriers of a favored allele. This predictive perspective is of particular importance when a sweep is undesirable and measures may be taken to prevent it. For instance, consider a set of tumor haplotypes isolated from single cells, some of which are drug-resistant and therefore favored under drug exposure. Given a genetic sample of the tumor haplotypes, the HAF statistic may be applied to identify the resistant haplotypes—carriers of a favored allele—before they clonally expand and metastasize.
Below, we start with a theoretical explanation of the behavior of the HAF score under different evolutionary scenarios, validating our results using simulation. We then develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to detect carriers of selective sweeps based on the HAF score. We demonstrate the power of PreCIOSS on simulations of both hard and soft sweeps, as well as on real genetic data from well-known sweeps in human populations. While our theoretical derivations make use of coalescent theory, and explicitly use tree-like genealogies, we note that HAF scores can be computed for any haplotype matrix including those with recombination events. Our results on simulated and real data imply that the utility of the HAF score extends to cases with recombination as well as other evolutionary scenarios.
Results
Theoretical and empirical modeling of HAF scores
We consider a sample of n haploid individuals chosen at random from a larger haploid population of size N. Let μ denote the mutation rate per generation per nucleotide, and let θ = 2NμL denote the population-scaled mutation rate in a region of length L bp. We consider both constant-sized and exponentially growing populations. For exponentially growing populations, let N 0 denote the final population size, let r denote the growth rate per generation, and let α = 2 N 0 r the population-scaled growth rate. Let ρ denote the population-scaled recombination rate. In our theoretical calculations, we assume no recombination (ρ = 0), and we derive expressions for the general ℓ-HAF score. We use simulations to demonstrate the concordance of theoretical and empirical values of the ℓ-HAF score, and show that the values are robust to the presence of recombination (see ‘Simulations’ in Methods for parameter choices). Although some of our theoretical calculations below derive expressions for the general ℓ-HAF score, we primarily use 1-HAF in the applied sections. Applications of ℓ-HAF with ℓ > 1 will be explored in future work.
Expected ℓ-HAF score under neutrality, constant population size
First, we assume that the genomic region of interest is evolving neutrally, the population size remains constant at N, and that the ancestral states are known or can be derived. In a sample of size n, let c(v) denote the HAF vector c for the v th haplotype (v ∈ {1, …, n}). Let ξ w be the number of sites with derived allele frequency w. We only consider polymorphic sites in the sample, so the frequency is in the range w ∈ {1, …, n − 1}; a mutation present in all or none of the haplotypes in the sample would not be detectable. Each of the ξ w sites of frequency w contributes w ℓ to the ℓ-HAF score of each of the w haplotypes with the mutation, and contributes 0ℓ = 0 for each of the other n − w haplotypes. The mean of the ℓ-HAF scores of all n haplotypes in the sample is
(2) |
Under the coalescent model, [42, Eq. (22)] shows that 𝔼[ξ w] = θ/w for all 1 ≤ w ≤ n − 1. By averaging over all haplotypes in all genealogies, the expected ℓ-HAF score is computed as
(3) |
The first two cases (ℓ = 1,2) yield
(4) |
Expected ℓ-HAF score, variable population size
Our derivation of expected ℓ-HAF scores for constant, neutrally evolving populations does not immediately extend to other demographic scenarios. We describe a second approach that separates coalescence times from the genealogy, and we apply it to compute the expected ℓ-HAF in an exponentially growing population.
For a sample of size n, partition the time spanning from the present back to the sample MRCA into n − 1 epochs. Let epoch k ∈ {2, …, n} be the span of time during which the genealogy contains exactly k lineages (Fig 1). Note that mutations on a given lineage in a given epoch share the same frequency, as they appear in exactly the same leaves. For example, mutations 2 and 3 in Fig 1A occur on the same lineage in epoch 2, and they share the frequency 3. Consider the path leading from a randomly chosen haplotype back to the sample MRCA. We can write the ℓ-HAF score of the haplotype as
(5) |
where m k is the number of mutations that occurred on the path in the k th epoch, and w k is the frequency or weight of those mutations. For a given genealogy with haplotypes v ∈ {1, …, n}, let c(v) denote c (the HAF vector) for the v th haplotype. Similarly, let m k(v) and w k(v) denote the number of mutations and their frequency in the k th epoch for the v th haplotype. Epoch k splits the haplotypes into k equivalence classes, which we call k-clades. Let m k,i and w k,i denote the corresponding values on the i th lineage of the k th epoch. We compute the expected value by summing over all haplotypes and genealogies and dividing by n. The sum is
(6) |
Let M k,i and W k,i be random variables denoting the number of mutations, and their frequency respectively, on the i th lineage of the k th epoch. As the genealogy of a neutrally evolving sample is independent of branch lengths [43], M k,i and W k,i are independent random variables. Thus, we can compute the expected ℓ-HAF score of a randomly chosen haplotype as
(7) |
To compute , we start with a related quantity. For positive integer ℓ, denote the rising factorial
(8) |
and set w (0) = 1. We show in S1 Text that
(9) |
We have w (1) = w and w (2) = w(w+1) = w 2+w, so w 2 = w (2) − w (1), which leads to:
(10) |
In S1 Text, we generalize this equation to compute 𝔼[(W k,i)ℓ]. In addition, we show that for a constant-sized population, the general form in Eq (7) produces the same result as Eq (3).
Exponential population growth
Eq (7) can potentially be used to obtain 𝔼[ℓ-HAF] under arbitrarily complex demographics. Consider a population of current size N 0 that has been growing exponentially at a rate r. The population size at time t in the past is given by N(t) = N 0 e −rt. Exponential population growth is of particular interest, as it has been used to analyze the state of a population under a selective sweep shortly after fixation. This is a low point (or trough) of observed ℓ-HAF scores, as early hitchhiking sites have fixed by this time point, and the (relatively recent) sample MRCA is a carrier of the favored mutation. Immediately after fixation, the population—all of which are carriers of the favored allele—has been growing for the duration of the sweep at a rate that is approximately exponential (with growth rate related to the selection coefficient s). In addition, all extant and ancestral haplotypes since the sample MRCA are carriers and therefore equally favored, implying that the independence between W k,i and M k,i is kept. While the branch lengths and distribution of M k,i values change under exponential growth, the distribution for W k,i remains unchanged as described in Eq (10). This key insight allows us to use Eq (7) to estimate the expected HAF scores under exponential population growth.
In order to use 𝔼[M k,i] = μ𝔼[T k] under exponential growth, we implement two numerical methods to compute 𝔼[T k]: a ‘cumulative time’ method that uses an approximate distribution of T k given in [44, p. 559], and a ‘conditional expectation’ method (see S1 Text for details). In the conditional expectation method, we compute the expected value of T k conditioned on T k+1, …, T n, as follows (in the order k = n, n − 1, …, 2):
(11) |
where α = 2 N 0 r is the scaled growth rate, τ k = t k+1 + ⋯ + t n (with τ n = 0), and E 1(x) is the exponential integral .
We then use 𝔼[(W k,i)ℓ] (evaluated in Eq. (S21) in S1 Text) to evaluate Eq (7), yielding 𝔼[ℓ-HAF] for exponential population growth as
(12) |
where S(ℓ, q) denotes the Stirling number of the second kind [45, Ch. 6.1]. We describe these procedures fully in S1 Text.
Empirical validation of expected HAF score computation
We tested our theoretical calculations against empirical observations of HAF scores using simulations for neutral evolution with constant population size N = 20000 (see ‘Simulations’ in Methods). For example, for θ = 48 and n = 200, the expected 1-HAF is exactly 4776.0 (Eqs (3) and (7)), whereas the empirically observed mean 1-HAF score of 20000 simulated samples is 4786 ± 3956 (sample mean ± sample standard deviation). Interestingly, the estimates improve when simulating with recombination, with an observed mean of 4780 ± 1684 (S1 Fig).
We also modeled exponential growth in population size using the scaled growth rate α, using the conditional expectation method in Eq (12) (see S1 Text). As expected, the HAF score is much lower than for constant population size. For α = 80, θ = 48, n = 200, the theoretical mean 1-HAF score is 126.9, whereas the empirical mean of 20000 simulations is 128 ± 131.1 for ρ = 0, and 127.6 ± 127.4 for ρ = 25 (S2 Fig).
We compared the simulations with theoretical expected ℓ-HAF scores for multiple values of ℓ ∈ {1, 2, 3, 4} and different choices of the population-genetic parameters: scaled mutation rate θ ∈ {24, 48}, scaled growth rate α ∈ {0, 30, 60, 80}), and scaled recombination rate ρ ∈ {0, 25, 50} (see ‘Simulations’ in Methods). While theoretical expected values were computed assuming ρ = 0, S3 Fig shows the concordance between theoretical and empirical means for each choice of parameters. The concordance improves slightly for increasing values of n, ℓ. In each case, the values are robust to choice of ρ, and the variance even reduces slightly for higher ρ (S3D Fig).
As ℓ increases, the normalized HAF score (ℓ-HAF1/ℓ) distribution (S4 Fig) becomes more left-skewed and has generally smaller values (upper bound of range approaching n − 1), with reduced variance. Increasing ℓ increases the relative weight of ancient mutations. As an extreme example, the normalized ∞-HAF score is simply the weight of the highest frequency mutation on the haplotype, and not very informative. However, very recent mutations, including those that appear post-selection among the carriers of the favored allele add ‘noise’ to the HAF-score, and an appropriate choice of ℓ > 1 may perform better for some applications. We will explore this in future work.
HAF score dynamics in selective sweeps
We now consider the dynamics of HAF scores in a population undergoing a selective sweep. To do this, we use data simulated under several scenarios. Fig 2 illustrates the HAF score dynamics in a single simulated population undergoing a hard sweep, with selection coefficient s = 0.05. See ‘Simulations’ in Methods for a detailed description of the simulation parameters. Initially (leftmost, time 0) the HAF scores of carriers and non-carriers of the favored allele are similar. As the sweep progresses (times 100–450), carrier HAF scores increase to a peak value (HAF-peak). Soon after fixation (time ∼450), we observe a sharp decline in HAF scores (HAF-trough), followed by slow and steady recovery due to new mutation and drift (times 500–50000). We observe similar behavior for the HAF score dynamics in an exponentially growing population, and soft sweep scenarios (S5 and S6 Figs). Though soft sweeps can arise under different circumstances, we restrict our attention to soft sweeps arising from standing variation. While the behavior is similar, we note that during a soft sweep, the HAF scores do not have as sharp a decline as in the hard sweep scenarios.
Below, we provide a theoretical description of these dynamics, as well as empirical validation using simulations. This allows us to predict HAF scores in (a) the post-fixation trough; (b) the pre-fixation peak; and (c) the rate of growth of HAF scores from pre-sweep to peak value.
Empirical validation of the post-fixation HAF-trough
We showed using simulations that the HAF score computations for an exponentially growing population (Eq (12)) also approximate a population evolving under a selective sweep shortly after fixation. This enables prediction of the HAF-trough value.
The HAF-trough of a sweep is the value of 1-HAF at fixation. We took the mean of the HAF-trough values over 200 populations simulated under selective sweeps with coefficients s ∈ [0.005, 0.040] (see ‘Simulations’ in Methods), and compared it to 1-HAF values in simulated neutral populations growing exponentially at rates α ∈ [100, 600]. Fig 3A shows a close similarity between the 1-HAF values under exponential growth (blue) and the selective sweep trough (red).
The pre-fixation 1-HAF-peak
As the selective sweep progresses, the value of the HAF score of haplotypes carrying the favored allele increases, eventually reaching a peak value. Consider n haplotypes sampled from a fixed population of N haploid individuals under a selective sweep. Let μ denote the mutation rate per base per generation in the genomic region of interest, and assume that there is no recombination. The scaled mutation rate is given by θ = 2Nμ.
We let ν denote the fraction of carrier haplotypes in the sample. When ν ≤ 1/n (i.e., 0 or 1 carriers), there is no selection going back in time, and the time to MRCA can be computed using the neutral Wright-Fisher model [46]. The expected 1-HAF scores for carriers and non-carriers are identical (Eq (3)). At the time when ν first equals 1, there are no non-carriers, and the HAF-scores are given by the exponential growth model. In S1 Text, we model the 1-HAF scores for all intermediate values of ν.
Let 1-HAFcar (respectively, 1-HAFnon) denote the 1-HAF score of a random haplotype carrying the favored allele (respectively, a non-carrier) when a fraction ν of the n sampled haplotypes carry the favored allele. In S1 Text, we show that under strong selection (Ns ≫ 1) and no recombination (ρ = 0),
(13) |
(14) |
For any sample of size n, the carrier haplotypes reach a peak value of 1-HAFcar as ν varies along its trajectory. We do not compute the expected value of this peak (𝔼[maxν(1-HAFcar(ν, n))]) directly. Instead, we compute the peak value of 𝔼[1-HAFcar(ν, n)] (maximizing over all ν ∈ [0, 1]) as
(15) |
Note that under strong selection, this peak does not depend on s (see Fig 3B).
The trough for each trajectory is computed as the 1-HAF score at fixation (when ν = 1 is first reached).
Empirical validation
Simulated data under selective sweeps with coefficients s ∈ [0.005, 0.040] show that for strong selection (Ns ≫ 1) (i) the pre-fixation HAF peak scores appear to be independent of the selection coefficient (Fig 3B), and (ii) as predicted by Eq (15), the mean value of the HAF peak score is approximately θn. We also simulated (1-HAFcar)/(nθ) as a function of ν (Fig 3C and S15 Fig). The results show a tight correspondence between theory and empirical observations.
HAF score application: Characterizing carriers and non-carriers
Our understanding of the dynamics of HAF scores of a haplotype during a selective sweep has many potential applications. For example, we could compare the dynamics of hard and soft sweeps to distinguish between the two events. Second, HAF scores of haplotypes in a region under selection might help predict the future MRCA of a population. Finally, by conditioning on known or deduced selective sweeps in a population sample, we can predict the state (carrier/non-carrier) of the favored allele in its haplotypes. Below we explore the last application, leaving the first two to future work.
In Fig 4A, we show the distributions of haplotype 1-HAF scores aggregated from 500 simulated populations undergoing a hard selective sweep (see ‘Simulations’ in Methods for detailed parameter choices). Scores were computed for random samples of n = 200 haplotypes taken at regular time intervals. They are stratified by the frequency of the favored allele at the time of sampling. Further, scores are stratified into carrier and non-carrier classes (of the favored allele). As with a single population, HAF scores of carriers and non-carriers diverge as the sweep progresses in frequency. We note, however, that even close to fixation (frequencies 80–100%) the distributions of HAF scores between carriers and non-carriers maintain considerable overlap. The high variance in HAF scores makes them only weakly informative of sweep carrier status when comparing across population samples (or genomic regions within a single population). Within a single population sample, however, the HAF scores are highly informative of the carrier status. This is illustrated in Fig 4B, showing the distributions of HAF score percentile rank within their respective samples. We observe that the rank distributions have minimal overlap for carriers and non-carriers of the favored allele. Any remaining overlap in the percentile rank distributions in the final stages of a sweep (favored allele frequency ≥ 70%) stems mostly from recombination, which allows the favored allele to recombine onto haplotypes outside the selected clade (creating low HAF score carriers) and vice-versa (creating high HAF score non-carriers). The overall strong separation between carriers and non-carriers is further illustrated by the highly significant P-values of Wilcoxon rank sum tests rejecting the null hypothesis of identically distributed HAF scores among carriers and non-carriers within each population sample (Fig 4C).
Fig 4 does not show how HAF scores are distributed following fixation of the sweep. Starting at fixation, we see a strong decline in HAF scores owing to the loss of many high frequency derived alleles from the pool of segregating sites. However, crossover events may unlink hitchhiking alleles from the favored allele, and they may remain segregating in the population even after fixation of the favored allele. Therefore, the decline in HAF scores may be abrupt or gradual, depending on the linkage between the favored and hitchhiking alleles. Finally, after reaching a trough, HAF scores gradually recover to their neutral levels over time. The post-fixation dynamics of HAF scores are shown in S7 Fig.
PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps
Our simulations suggest that, in a region undergoing a selective sweep, we could use HAF scores to predict whether a haplotype is carrying the favored allele. We implemented a simple algorithm (PreCIOSS) to carry out this prediction by clustering HAF scores in a sample. PreCIOSS takes as input a set of binary haplotypes sampled from a population undergoing a selective sweep. For each haplotype, the ℓ-HAF score is computed (ℓ = 1 by default). We then fit a Gaussian Mixture Model (GMM) with exactly two Gaussians to the haplotype HAF scores. The fit is performed using Expectation Maximization (EM). Finally, we apply the fitted model to assign a label to each haplotype according to the Guassian component to which it is assigned. Haplotypes whose HAF score is higher are denoted as ‘carriers’.
We apply PreCIOSS to data from simulated populations undergoing hard and soft sweeps (see ‘Simulations’ in Methods). The haplotypes predicted as carriers might in fact be carriers (True Positives, TP) or non-carriers (False Positives, FP). Similarly, the haplotypes predicted as non-carriers could be True Negatives (TN) or False Negatives (FN). We measure the balanced accuracy
(16) |
which is more appropriate to use than Rand accuracy (TP+TN)/(total predictions) when the positive and negative classes appear at different proportions in the sample [48].
While there are no tools currently available that directly predict the carrier state of a haplotype, some approaches are relevant. For example, Grossman et al. [49] developed a ‘composite of multiple signals’ (CMS) statistic to reduce the number of candidates for the favored mutation, but CMS cannot directly be used to identify carriers of the favored mutation. Similarly, the iHS statistic uses the dominant haplotype frequency decay in a window centered around each locus, as a test for recent positive selection [30]. As a comparison, we used iHS to distinguish carriers from non-carriers based on segregating alleles at the locus with peak iHS score. The balanced accuracy of PreCIOSS on hard sweeps is shown in Fig 5A for a specific choice of parameters (200 samples with n = 200, θ = 48, ρ = 25, s = 0.01). Once the sweep reaches frequencies above 30%, the balanced accuracy increases (median ∼70%) and remains high (median ∼90%) for the remainder of the sweep. At the beginning of the sweep, the balanced accuracy, despite being asymptotically unbiased, suffers from high variance due to the severe class imbalance (few carriers in the beginning, few non-carriers at the end). The accuracy is reduced for soft sweeps (Fig 5B, run with similar parameters), as increasing the carrier haplotype frequency leads to higher variance in 1-HAF scores.
We tested PreCIOSS under a wide range of population-genetic parameters (S1 Table), and observed consistently high balanced accuracy in carrier-state prediction as the sweeps progressed (S8 Fig). Specifically, PreCIOSS is quite robust to changes in sample size (S8A–S8D Fig). A higher recombination rate has only a limited impact (S8A and S8H Fig), while setting ρ = 0 shows reduced performance at an early stage of the sweep (S8A and S8G Fig). This is consistent with selection acting more efficiently in the presence of recombination.
We tested the effect of the position of the carrier mutation (unknown to PreCIOSS) on the performance of PreCIOSS. We considered different 50 kb windows, with the carrier mutation located at one end (0 kb), and moving towards the middle (25 kb). For each location of the carrier, we simulated 200 samples with n = 200, θ = 48, ρ = 25, s = 0.01 (S9 Fig), but did not observe a marked change in accuracy. However, when the favored allele is in the middle of the window, the median balanced accuracy is generally higher and has lower variance (S10 Fig).
Finally, we tested PreCIOSS on a popular model of European demography [50]. The model (S11 Fig) suggests an Out-of-Africa migration 51 kya (51 thousand years ago), followed by a European and East Asian split 23 kya. It also suggests bottlenecks that reduced the effective population sizes of the European (N Eu0 = 1032), and East-Asian (N As0 = 550) populations, and exponential growth in the populations following the bottleneck events. We simulated populations based on this model, as well as selection events (hard sweep) at different times after the Out-of-Africa migration, and partitioned all samples into two categories depending on whether the selection event happened before or after the bottleneck. These scenarios are challenging for most tests of adaptation (see, e.g., [23]). However, there are still significant differences in the 1-HAF scores of carriers and non-carriers. The balanced accuracy of PreCIOSS is shown in Fig 6A for ancient selection and Fig 6B for recent (after bottleneck) selection. The performance is quite robust, although somewhat worse in the early stages of the sweep. Once the favored allele frequency reaches 60%, the median accuracy is at 0.9. The accuracy is improved for recent adaptation, compared to ancient adaptation. Even for very recent sweeps, where the carrier frequency is 30–40%, the median balanced accuracy is close to 0.8. We used a lower selection coefficient for ancient selection compared to recent selection to ensure that we have sufficient cases of incomplete sweeps. Not surprisingly, the performance of PreCIOSS is worse for ancient selection compared to recent selection.
Our results suggest that for cases of recent adaptation (e.g., lactase adaptation, shown in Fig 7A, which happened between 2 kya and 20 kya and rapidly spread to high frequencies in the European population), PreCIOSS would show good performance in separating the carriers and non-carriers.
Applying PreCIOSS to human selective sweeps
To evaluate the effectiveness of PreCIOSS in distinguishing carriers of a selective sweep from non-carriers, we applied it to several genomic regions (e.g., [39]) where (i) there is strong evidence of a selective sweep, and (ii) the favored allele has been characterized. In applying PreCIOSS to the datasets, we assumed that the region was known, but did not supply the favored allele to PreCIOSS. In each case, we tested if PreCIOSS could separate the haplotypes that carried the favored allele. We use phased haplotypes from the HapMap project [51], setting the ancestral allele to that observed in orthologous Chimpanzee sequence [52].
LCT
We consider the well-known sweep in the lactase (LCT) gene region in Northern Europeans. The best characterized variant is C/T-13910 (rs4988235), for which the T allele was found to be 100% associated with lactase persistence in the Finnish population [53]. T-13910 was further shown to be causal by in-vitro analysis, where it was found to increase enhancer activity [54, 55]. We considered haplotypes from the CEU population, applying PreCIOSS to a 50 kb window centered at C/T-13910 (Fig 7A). This yielded 100% accuracy in classifying carriers from non-carriers. Increasing the window size above 50 kb, the balanced classification accuracy reduced to ∼90% (Fig 7B). LCT shows the highest and most prolonged (with increasing distance from the causal site) statistical significance in separating carrier and non-carrier HAF scores, remaining highly significant for haplotypes of 1.5 Mb (Fig 7B, blue line). Despite this highly significant separation, the classification accuracy is initially unstable, alternating between ∼100% (perfect classification) and 90%. This is due to the pattern of HAF scores observed in the LCT region, where carriers form a tight cluster with the highest scores, but several non-carriers cluster closer to carriers than to the majority of other non-carriers (Fig 7A). These haplotypes are therefore sometimes included (90% accuracy) and sometimes excluded (100% accuracy) from the reported ‘carriers’ class.
TRPV6
Transient Receptor Potential Cation Channel, Subfamily V, Member 6 (TRPV6) is a membrane calcium channel thought to mediate the rate-limiting step of dietary calcium absorption. It is reportedly under strong positive selection in several non-African populations [56, 57]. Following Peter et al. (2012) [39], we focus on the CEU population and set rs4987682 as the favored allele. This site, of the three non-synonymous SNPs with highest allele frequency differentiation among human populations, is the only one located in the the N-terminal region of TRPV6, thought to be the target of selection [58]. Applying PreCIOSS to a 50 kb window centered at this site, we obtain ∼99% balanced classification accuracy (Fig 7C), which gradually decays to ∼80% when considering a 1.5 Mb window (Fig 7D). As with LCT, the separation in HAF scores between carriers and non-carriers of the allele is highly statistically significant (P < 10−7; see Fig 7D). Unlike LCT, accuracy decays stably with distance from the favored site. This appears to be due to the less complex clustering pattern of non-carrier HAF scores in the region (Fig 7C).
PSCA
Prostate Stem Cell Antigen (PSCA) has been proposed to be under selection in a global analysis of allele frequency differentiation [59]. The putative causal site is a non-synonymous SNP (rs2294008) known to be involved in several cancer types [60, 61]. Interestingly, the derived allele is observed in all human populations, but at vastly different frequencies. It is most common in West Africa and East Asia, where it segregates at ∼75% frequency. We consider haplotypes from the YRI population and apply PreCIOSS to a 50 kb window centered at rs2294008, yielding balanced accuracy of 97% (Fig 7E). Unlike LCT and TRPV6, accuracy decays more noticeably with distance, reaching 63% at 1.25 Mb (Fig 7F). This sharper decay in accuracy is even more pronounced when considering the sweep in the CHB population (S12 Fig). Such decay is consistent with a (soft) sweep from the standing variation, which would allow more time for recombination to break the linkage between the favored allele and hitchhiking variation. Indeed, the sweep in PSCA was proposed to be from the standing variation by Bhatia et al. (2011) [59], and further substantiated as such by Peter et al. (2012) [39].
ADH1B
ADH1B encodes one of four subunits of Alcohol dehydrogenase (ADH1), which plays a key role in alcohol degradation. ADH1 genes (including ADH1B) have been studied extensively on both a functional and a population-genetic level, as they are thought to be one of the major drivers of alcoholism risk [62]. These genes have also been suggested to cause the “alcohol flush” phenotype common in Asian populations [63]. A specific non-synonymous mutation in ADH1B (Arg47His, rs1229984) has been proposed to be the target of selection. This is because (i) the derived allele has been shown to cause increased enzymatic activity [64, 65], and (ii) the estimated age of the allele coincides with rice domestication [63, 66] and the availability of fermented beverages [67]. Computing HAF scores for phased haplotypes from East Asian populations (CHB+JPT) and applying PreCIOSS, we obtained balanced classification accuracy of 92% using a 50 kb window centered at rs1229984 (Fig 7G). Both accuracy and statistical significance (of class separation) gradually decay with increasing window size (Fig 7H). As before, statistical significance decays more stably than classification accuracy.
EDAR
EDAR encodes a cell-surface receptor and has been associated with development of distinct hair and teeth morphologies [68, 69]. Specifically, a non-synonymous SNP (rs3827760, V370A) has been associated with these phenotypes [70]. The SNP is located within a DEATH-domain, which is highly conserved in mammals [71], and has been experimentally confirmed (in vitro) to increase EDAR activity [70]. It is found at very high frequencies in East Asian and American populations, while being completely absent from Europeans and Africans [70]. The EDAR gene has been found to be under selection in multiple studies [30, 56, 72], showing one of the strongest signatures of selection genome wide among the 1000 Genomes populations [39]. Applying PreCIOSS to phased CHB+JPT haplotypes in a 50 kb region centered at rs3827760, we obtained 98% balanced accuracy in predicting carriers vs. non-carriers of the allele.
In each case, PreCIOSS was applied to a 50 kb window centered at the favored allele, and separated the carriers and non-carriers with high accuracy of 97–100% (Fig 7A–7F). The accuracy decayed with increasing window size, but in many cases stayed high even for windows of 1.5 Mbp.
Discussion
This paper introduces a new perspective on the genetic signatures of selective sweeps. From identifying and characterizing sweeps in a population sample—the topic of typical studies of selective sweeps—we progress to considering the role of individual haplotypes within an ongoing sweep. Using both simulated and real data, we show that the HAF score is well-correlated with the relative fitness of individual haplotypes, and that our algorithm (PreCIOSS) is highly effective at predicting carriers of selective sweeps.
The HAF framework has many natural extensions and potential applications. On the theoretical side, we have obtained the expected HAF score in both constant-sized and exponentially growing populations evolving neutrally (Eqs (3) and (12)). However, we do not yet know the variance. This quantity would provide a better understanding of the respective distributions, and a means to to statistically test for deviations from neutrality. Moreover, although we have observed in simulation and in practice that our theoretical argument is robust to recombination (genealogies violating a tree structure), a theoretical argument supporting these observations would be valuable.
In terms of application, several additional directions are worth investigating. The HAF framework is potentially useful in distinguishing hard from soft sweeps. Intuitively, hard sweep genealogies will likely have a single hitchhiking branch dominating the HAF scores, and leading to near-uniform scores in favored haplotypes. However, soft sweep genealogies may have several hitchhiking branches, potentially leading to distinct HAF score peaks. Even if the different favored clades happen to have similar scores, the haplotypes within them will not form a highly-related group as expected in hard sweeps.
Our results on known selective sweeps in humans illustrates this idea already (Fig 7). A recent study by Peter et al. (2012) [39] assigned posterior probabilities to hard vs. soft sweeps occurring in the same genes. Peter et al. assigned the highest likelihood of a hard sweep to LCT (0.99), followed by EDAR (0.89), ADH1B (0.78), TRPV (0.45), and finally PSCA (0.24). This is in striking concordance with the spread in HAF scores in Fig 7. The clusters capturing the carriers in LCT and EDAR have tightly distributed HAF scores (Fig 7A and 7I). The cluster for ADH1B (Fig 7G) has more variance by comparison, and the variance increases for TRPV6 (Fig 7C) and PSCA (Fig 7E), with PSCA showing the highest variance of HAF scores in carriers.
Finally, perhaps the highest potential impact of the HAF score could be in predicting the ‘MRCA of the future’. We know that future haplotypes are more likely similar to favored individuals than to unfavored ones, and that HAF scores correlate well with relative fitness in ongoing selective sweeps. Therefore, high HAF haplotypes are more likely to be similar to future generations. This relationship is particularly valuable when action may be taken based on such predictions. For instance, rapid influenza viral evolution is known to change the strain composition from year to year. The mutations are a mix of favored and deleterious mutations. The fitness and frequency of the current year’s strain have been used to predict the next year’s dominant strain [73]. The HAF score may allow for a careful look at the dynamics of the current strain and possibly offer better insight into the problem. As a second example, tumor cells show great heterogeneity and much variation occurs at the single cell level. This intra-tumor variation allows sub-population of cells to resist therapy and proliferate [74]. Once again, HAF scores of haplotypes in cells undergoing treatment can potentially distinguish between carriers and non-carriers of drug resistance mutations, and thereby improve our insight into mechanisms of drug resistance.
Methods
Simulations
We simulated data for various evolutionary scenarios. Neutral samples and sweep samples were generated using the simulator msms [47]. All simulations generated samples of n ∈ [20, 400] haplotypes from a larger effective population of N = 20000 haplotypes, each of length 50 kb. A mutation rate of approximately μ = 2.4⋅10−8 mutations per bp per generation was used [75, 76]. For our simulations, we choose a population-scaled mutation rate θ ∈ {24, 48}. For human recombination events, a population scaled rate of ρ = 1.32θ has been proposed (e.g. [77]). We use simulations either with no recombination, or with ρ ∈ {25, 50} in a 50 kb region to approximate human rates.
For exponential growth, we used N = 20000 as the size of the final (current) population. Let r denote the growth rate per generation, so that at t generations prior to the current generation, the population size was N(t) = Ne −rt. Define the scaled growth rate α = 2Nr. We set α to a range of values in [200, 1600].
For selective sweeps, we used forward simulations assuming a diploid population with recombination and mutation parameters as described above. While diploid populations were simulated to incorporate recombination, we used phased haplotypes for our analysis. We assumed a single favored allele under selection coefficient s ∈ [0.005, 0.050] and heterozygosity 0.5 (haploid carriers get half the fitness advantage of diploid carriers). When s is ‘low’ (0.001 ≤ s ≤ 0.01), the available tests do not detect a selective sweep with reasonable power [21, 23]. Selection with s ≥ 0.08 is considered ‘high’ (e.g., see [21]). For high values of s, the carrier haplotypes are identical or very similar in simulations, making the problem of detecting carriers easy. Therefore, we chose intermediate values (s ∈ [0.005, 0.050]) in our simulations.
Soft sweeps can arise either due to standing variation or due to multiple favored alleles. Here, we focus on the former, where the favored allele is present in at least one carrier in the population (ν 0 ∈ [1/N, 1]), and drifting at the onset of selection. In our simulations, we set ν 0 at the beginning of the sweep to ν 0 = 1/N = 5⋅10−5 for hard sweeps and ν 0 ∈ {0.001, 0.02} for soft sweeps, corresponding to 20–400 carrier haplotypes at the onset of selection.
In comparing the performance of PreCIOSS against iHS, we used the software selscan [78] to compute iHS scores.
To investigate the performance of HAF scores on human populations, we used a popular demographic model (S11 Fig) with parameters suggested from Gravel et al. (2011) [50]. Among the different properties, the model assumes an out-of-Africa migration at 51 kya, and a European, East Asian split 23 kya. The European Asian split was accompanied by a bottleneck event that reduced the effective population sizes of the European and Asian populations, and was followed by an exponential growth in these populations. We used msms to simulate populations according to this model.
In modeling selection, we partitioned the onset of selection into two epochs: ‘pre-bottleneck’ events between 51 kya and 23 kya, and 23 kya, and ‘post-bottleneck’ epoch between 23 kya, and the current generation. For each epoch, we picked 10000 times for onset of selection chosen uniformly from the time interval, and performed forward simulations with a sample size of n = 200. Samples were chosen during the sweeps, and partitioned according to carrier allele frequency, with 200 samples randomly chosen for each bin. Samples in the pre-bottleneck epoch were simulated with s = 0.005 to reduce the chance of fixation, and s = 0.02 in the post-bottleneck epoch. The balanced accuracy measurements were done independently for the two epochs.
Data preprocessing
We downloaded pre-phased haplotype data from the HapMap [51] project website. Both HapMap 3 [51] and HapMap 2 [79] project data were used depending on whether the causal allele was sampled or not. For LCT (rs4988235), we used 88 CEU individuals haplotypes from HapMap 3; for PSCA (rs2294008), we used 60 YRI individuals from HapMap 2; for TRPV6, 88 CEU individuals from HapMap 3; for ADH1B (rs1229984), 90 CHB+JPT individuals from HapMap 2; and, for EDAR (rs3827760), we chose 90 CHB+JPT individuals from HapMap 2. The number of phased haplotypes was twice the number of individuals in each case.
We downloaded Chimpanzee genome alignments [52] to identify the ancestral allele. A total of ∼93% of the sites analyzed had were covered by the Chimpanzee data. For these sites, we set the ancestral allele to the Chimpanzee allele, and we discarded sites that were not covered.
Software
The PreCIOSS software is available from the website http://bix.ucsd.edu/projects/precioss/.
Supporting Information
Data Availability
All relevant data are within the paper and its Supporting Information files.
Funding Statement
This work was supported in part by National Science Foundation grants CCF-1115206, IIS-1318386, DBI-1458557, and DBI-1458059. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Fu W, Akey JM. Selection and adaptation in the human genome. Annu Rev Genomics Hum Genet. 2013;14:467–489. 10.1146/annurev-genom-091212-153509 [DOI] [PubMed] [Google Scholar]
- 2. Lachance J, Tishkoff SA. Population Genomics of Human Adaptation. Annu Rev Ecol Evol Syst. 2013. November;44:123–143. 10.1146/annurev-ecolsys-110512-135833 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Vitti JJ, Grossman SR, Sabeti PC. Detecting natural selection in genomic data. Annu Rev Genet. 2013;47:97–120. 10.1146/annurev-genet-111212-133526 [DOI] [PubMed] [Google Scholar]
- 4. Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C. Genomic scans for selective sweeps using SNP data. Genome Res. 2005. November;15(11):1566–1575. 10.1101/gr.4252305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Pickrell JK, Coop G, Novembre J, Kudaravalli S, Li JZ, Absher D, et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 2009. May;19(5):826–837. 10.1101/gr.087577.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chen H, Patterson N, Reich D. Population differentiation as a test for selective sweeps. Genome Res. 2010. March;20(3):393–402. 10.1101/gr.100545.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Berg JJ, Coop G. A population genetic signal of polygenic adaptation. PLoS Genet. 2014. August;10(8):e1004412 10.1371/journal.pgen.1004412 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Jeong C, Di Rienzo A. Adaptations to local environments in modern human populations. Curr Opin Genet Dev. 2014. December;29C:1–8. 10.1016/j.gde.2014.06.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Tekola-Ayele F, Adeyemo A, Chen G, Hailu E, Aseffa A, Davey G, et al. Novel genomic signals of recent selection in an Ethiopian population. Eur J Hum Genet. 2014. November; advance online publication. 10.1038/ejhg.2014.233 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZXP, Pool JE, et al. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science. 2010;329(5987):75–78. Available from: http://www.sciencemag.org/content/329/5987/75.abstract. 10.1126/science.1190371 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Simonson TS, Yang Y, Huff CD, Yun H, Qin G, Witherspoon DJ, et al. Genetic evidence for high-altitude adaptation in Tibet. Science. 2010. July;329(5987):72–75. 10.1126/science.1189406 [DOI] [PubMed] [Google Scholar]
- 12. Scheinfeldt LB, Soi S, Thompson S, Ranciaro A, Woldemeskel D, Beggs W, et al. Genetic adaptation to high altitude in the Ethiopian highlands. Genome Biol. 2012;13(1):R1 10.1186/gb-2012-13-1-r1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, Di Rienzo A. The genetic architecture of adaptations to high altitude in Ethiopia. PLoS Genet. 2012;8(12):e1003110 10.1371/journal.pgen.1003110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Huerta-Sanchez E, Degiorgio M, Pagani L, Tarekegn A, Ekong R, Antao T, et al. Genetic signatures reveal high-altitude adaptation in a set of ethiopian populations. Mol Biol Evol. 2013. August;30(8):1877–1888. 10.1093/molbev/mst089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Udpa N, Ronen R, Zhou D, Liang J, Stobdan T, Appenzeller O, et al. Whole genome sequencing of Ethiopian highlanders reveals conserved hypoxia tolerance genes. Genome Biol. 2014. February;15(2):R36 10.1186/gb-2014-15-2-r36 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Zhou D, Udpa N, Ronen R, Stobdan T, Liang J, Appenzeller O, et al. Whole-genome sequencing uncovers the genetic basis of chronic mountain sickness in Andean highlanders. Am J Hum Genet. 2013. September;93(3):452–462. 10.1016/j.ajhg.2013.07.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kaplan NL, Hudson RR, Langley CH. The “hitchhiking effect” revisited. Genetics. 1989. December;123(4):887–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Smith JM, Haigh J. The hitch-hiking effect of a favourable gene. Genet Res. 1974. February;23(1):23–35. 10.1017/S0016672300014634 [DOI] [PubMed] [Google Scholar]
- 19. Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989. November;123(3):585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000. July;155:1405–1413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Pavlidis P, Jensen JD, Stephan W. Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations. Genetics. 2010. July;185(3):907–922. 10.1534/genetics.110.116459 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lin K, Li H, Schlotterer C, Futschik A. Distinguishing positive selection from neutral evolution: boosting the performance of summary statistics. Genetics. 2011. January;187(1):229–244. 10.1534/genetics.110.122614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Ronen R, Udpa N, Halperin E, Bafna V. Learning natural selection from the site frequency spectrum. Genetics. 2013. September;195(1):181–193. 10.1534/genetics.113.152587 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Simonsen KL, Churchill GA, Aquadro CF. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics. 1995. September;141(1):413–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics. 1995. June;140(2):783–796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hudson RR, Bailey K, Skarecky D, Kwiatowski J, Ayala FJ. Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics. 1994. April;136(4):1329–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Depaulis F, Mousset S, Veuille M. Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol Biol Evol. 2001. June;18(6):1136–1138. 10.1093/oxfordjournals.molbev.a003885 [DOI] [PubMed] [Google Scholar]
- 28. Innan H, Zhang K, Marjoram P, Tavare S, Rosenberg NA. Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics. 2005. March;169(3):1763–1777. 10.1534/genetics.104.032219 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, Schaffner SF, et al. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002. October;419(6909):832–837. 10.1038/nature01140 [DOI] [PubMed] [Google Scholar]
- 30. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006. March;4(3):e72 10.1371/journal.pbio.0040072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Toomajian C, Hu TT, Aranzana MJ, Lister C, Tang C, Zheng H, et al. A nonparametric test reveals selection for rapid flowering in the Arabidopsis genome. PLoS Biol. 2006. May;4(5):e137 10.1371/journal.pbio.0040137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007. October;449(7164):913–918. 10.1038/nature06250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kim Y, Stephan W. Selective sweeps in the presence of interference among partially linked loci. Genetics. 2003. May;164(1):389–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Messer PW, Petrov DA. Population genomics of rapid adaptation by soft selective sweeps. Trends Ecol Evol (Amst). 2013. November;28(11):659–669. 10.1016/j.tree.2013.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Hermisson J, Pennings PS. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics. 2005. April;169(4):2335–2352. 10.1534/genetics.104.036947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Pennings PS, Hermisson J. Soft sweeps II–molecular population genetics of adaptation from recurrent mutation or migration. Mol Biol Evol. 2006. May;23(5):1076–1084. 10.1093/molbev/msj117 [DOI] [PubMed] [Google Scholar]
- 37. Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol Biol Evol. 2014. May;31(5):1275–1291. 10.1093/molbev/msu077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Garud NR, Messer PW, Buzbas EO, Petrov DA. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 2015. February;11(2):e1005004 10.1371/journal.pgen.1005004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Peter BM, Huerta-Sanchez E, Nielsen R. Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet. 2012;8(10):e1003011 10.1371/journal.pgen.1003011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Schrider DR, Mendes FK, Hahn MW, Kern AD. Soft Shoulders Ahead: Spurious Signatures of Soft and Partial Selective Sweeps Result from Linked Hard Sweeps. Genetics. 2015 Feb; advance online publication. [DOI] [PMC free article] [PubMed]
- 41. Wilson BA, Petrov DA, Messer PW. Soft selective sweeps in complex demographic scenarios. Genetics. 2014. October;198(2):669–684. 10.1534/genetics.114.165571 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995. October;48(2):172–197. 10.1006/tpbi.1995.1025 [DOI] [PubMed] [Google Scholar]
- 43. Hudson RR. Gene genealogies and the coalescent process In: Futuyma D, Antonovics J, editors. Oxford Surveys in Evolutionary Biology. Oxford: Oxford University Press; 1990. p. 1–44. [Google Scholar]
- 44. Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991. October;129(2):555–562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Graham R, Knuth DE, Patashnik O. Concrete Mathematics: A Foundation for Computer Science. 2nd ed. Reading, Mass: Addison-Wesley; 1994. [Google Scholar]
- 46. Nordborg M. Coalescent Theory In: Balding DJ, Bishop M, Cannings C, editors. Handbook of statistical genetics. 3rd ed. John Wiley & Sons, Ltd; 2008. p. 843–877. [Google Scholar]
- 47. Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010. August;26(16):2064–2065. 10.1093/bioinformatics/btq322 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The Balanced Accuracy and Its Posterior Distribution. In: Pattern Recognition (ICPR), 2010 20th International Conference on; 2010. p. 3121–3124.
- 49. Grossman SR, Shlyakhter I, Shylakhter I, Karlsson EK, Byrne EH, Morales S, et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science. 2010. February;327(5967):883–886. 10.1126/science.1183863 [DOI] [PubMed] [Google Scholar]
- 50. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, et al. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 2011. July;108(29):11983–11988. 10.1073/pnas.1019276108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Altshuler DM, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010. September;467(7311):52–58. 10.1038/nature09298 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Sequencing TC, Consortium A. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005. September;437(7055):69–87. 10.1038/nature04072 [DOI] [PubMed] [Google Scholar]
- 53. Kuokkanen M, Enattah NS, Oksanen A, Savilahti E, Orpana A, Jarvela I. Transcriptional regulation of the lactase-phlorizin hydrolase gene by polymorphisms associated with adult-type hypolactasia. Gut. 2003. May;52(5):647–652. 10.1136/gut.52.5.647 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Olds LC, Sibley E. Lactase persistence DNA variant enhances lactase promoter activity in vitro: functional role as a cis regulatory element. Hum Mol Genet. 2003. September;12(18):2333–2340. 10.1093/hmg/ddg244 [DOI] [PubMed] [Google Scholar]
- 55. Troelsen JT, Olsen J, Møller J, Sjöström H. An upstream polymorphism associated with lactase persistence has increased enhancer activity. Gastroenterology. 2003. December;125(6):1686–1694. 10.1053/j.gastro.2003.09.031 [DOI] [PubMed] [Google Scholar]
- 56. Akey JM, Eberle MA, Rieder MJ, Carlson CS, Shriver MD, Nickerson DA, et al. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2004. October;2(10):e286 10.1371/journal.pbio.0020286 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Stajich JE, Hahn MW. Disentangling the effects of demography and selection in human history. Mol Biol Evol. 2005. January;22(1):63–73. 10.1093/molbev/msh252 [DOI] [PubMed] [Google Scholar]
- 58. Akey JM, Swanson WJ, Madeoy J, Eberle M, Shriver MD. TRPV6 exhibits unusual patterns of polymorphism and divergence in worldwide populations. Hum Mol Genet. 2006. July;15(13):2106–2113. 10.1093/hmg/ddl134 [DOI] [PubMed] [Google Scholar]
- 59. Bhatia G, Patterson N, Pasaniuc B, Zaitlen N, Genovese G, Pollack S, et al. Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. Am J Hum Genet. 2011. September;89(3):368–381. 10.1016/j.ajhg.2011.07.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Sakamoto H, Yoshimura K, Saeki N, Katai H, Shimoda T, Matsuno Y, et al. Genetic variation in PSCA is associated with susceptibility to diffuse-type gastric cancer. Nat Genet. 2008. June;40(6):730–740. 10.1038/ng.152 [DOI] [PubMed] [Google Scholar]
- 61. Wu X, Ye Y, Kiemeney LA, Sulem P, Rafnar T, Matullo G, et al. Genetic variation in the prostate stem cell antigen gene PSCA confers susceptibility to urinary bladder cancer. Nat Genet. 2009. September;41(9):991–995. 10.1038/ng.421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Whitfield JB. Alcohol dehydrogenase and alcohol dependence: variation in genotype-associated risk between populations. Am J Hum Genet. 2002. November;71(5):1247–1250. 10.1086/344287 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Peng Y, Shi H, Qi XB, Xiao CJ, Zhong H, Ma RL, et al. The ADH1B Arg47His polymorphism in east Asian populations and expansion of rice domestication in history. BMC Evol Biol. 2010;10:15 10.1186/1471-2148-10-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Osier MV, Pakstis AJ, Soodyall H, Comas D, Goldman D, Odunsi A, et al. A global perspective on genetic variation at the ADH genes reveals unusual patterns of linkage disequilibrium and diversity. Am J Hum Genet. 2002. July;71(1):84–99. 10.1086/341290 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Eng MY, Luczak SE, Wall TL. ALDH2, ADH1B, and ADH1C genotypes in Asians: a literature review. Alcohol Res Health. 2007;30(1):22–27. [PMC free article] [PubMed] [Google Scholar]
- 66. Li H, Mukherjee N, Soundararajan U, Tarnok Z, Barta C, Khaliq S, et al. Geographically separate increases in the frequency of the derived ADH1B*47His allele in eastern and western Asia. Am J Hum Genet. 2007. October;81(4):842–846. 10.1086/521201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. McGovern PE, Zhang J, Tang J, Zhang Z, Hall GR, Moreau RA, et al. Fermented beverages of pre- and proto-historic China. Proc Natl Acad Sci USA. 2004. December;101(51):17593–17598. 10.1073/pnas.0407921102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Fujimoto A, Ohashi J, Nishida N, Miyagawa T, Morishita Y, Tsunoda T, et al. A replication study confirmed the EDAR gene to be a major contributor to population differentiation regarding head hair thickness in Asia. Hum Genet. 2008. September;124(2):179–185. 10.1007/s00439-008-0537-1 [DOI] [PubMed] [Google Scholar]
- 69. Kimura R, Yamaguchi T, Takeda M, Kondo O, Toma T, Haneji K, et al. A common variation in EDAR is a genetic determinant of shovel-shaped incisors. Am J Hum Genet. 2009. October;85(4):528–535. 10.1016/j.ajhg.2009.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Bryk J, Hardouin E, Pugach I, Hughes D, Strotmann R, Stoneking M, et al. Positive selection in East Asians for an EDAR allele that enhances NF-kappaB activation. PLoS ONE. 2008;3(5):e2209 10.1371/journal.pone.0002209 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007. October;449(7164):913–918. 10.1038/nature06250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA. 2005. May;102(22):7882–7887. 10.1073/pnas.0502300102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Luksza M, Lassig M. A predictive fitness model for influenza. Nature. 2014. March;507(7490):57–61. 10.1038/nature13087 [DOI] [PubMed] [Google Scholar]
- 74. Lee MC, Lopez-Diaz FJ, Khan SY, Tariq MA, Dayn Y, Vaske CJ, et al. Single-cell analyses of transcriptional heterogeneity during drug tolerance transition in cancer cells by RNA sequencing. Proc Natl Acad Sci USA. 2014. November;111(44):E4726–4735. 10.1073/pnas.1404656111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000. September;156(1):297–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Campbell CD, Chong JX, Malig M, Ko A, Dumont BL, Han L, et al. Estimating the human mutation rate using autozygosity in a founder population. Nat Genet. 2012. November;44(11):1277–1281. 10.1038/ng.2418 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Hey J, Wakeley J. A coalescent estimator of the population recombination rate. Genetics. 1997. March;145(3):833–846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Szpiech ZA, Hernandez RD. selscan: An Efficient Multithreaded Program to Perform EHH-Based Scans for Positive Selection. Mol Biol Evol. 2014. October;31(10):2824–2827. 10.1093/molbev/msu211 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Frazer KA, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007. October;449(7164):851–861. 10.1038/nature06258 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data are within the paper and its Supporting Information files.