Abstract
Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson’s D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole-genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic , a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and outliers tend to cluster in regions of low absolute divergence (dXY), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci.
Keywords: ABBA–BABA, gene flow, introgression, population structure, Heliconius, simulation
Introduction
Hybridization and gene flow between taxa play a major role in evolution, acting as a force against divergence, and as a potential source of adaptive novelty (Abbott et al. 2013). Although identifying gene flow between species has been a long-standing problem in population genetics, the issue has received considerable recent attention with the analysis of shared ancestry between humans and Neanderthals (e.g., Yang et al. 2012; Wall et al. 2013; Sankararaman et al. 2014). With genomic data sets becoming available in a wide variety of other taxonomic groups, there is a need for reliable, computationally tractable methods that identify, quantify, and date gene flow between species in large data sets.
A sensitive and widely used approach to test for gene flow is to fit coalescent models using maximum-likelihood or Bayesian methods (Pinho and Hey 2010). However, simulation and model fitting are computationally intensive tasks and are not easily applied on a genomic scale. A simpler and more computationally efficient approach that is gaining in popularity is to test for an excess of shared derived variants using a four-taxon test (Kulathinal et al. 2009; Green et al. 2010; Durand et al. 2011). The test considers ancestral (“A”) and derived (“B”) alleles and is based on the prediction that two particular single nucleotide polymorphism (SNP) patterns, termed “ABBA” and “BABA” (see Materials and Methods), should be equally frequent under a scenario of incomplete lineage sorting without gene flow. An excess of ABBA or BABA patterns is indicative of gene flow between two of the taxa and can be detected using Patterson’s D statistic (Green et al. 2010; Durand et al. 2011; see Materials and Methods for details). However, an excess of shared derived variants can arise from factors other than recent introgression, in particular nonrandom mating in the ancestral population due to population structure (Eriksson and Manica 2012). It is therefore important to make use of additional means to distinguish between these alternative hypotheses, for example, by examining the size of introgressed tracts (Wall et al. 2013), or the level of absolute divergence in introgressed regions (Smith and Kronforst 2013).
The D statistic was originally designed to be applied on a genome-wide or chromosome-wide scale, with block-jackknifing used to overcome the problem of nonindependence between loci (Green et al. 2010). However, many researchers are interested in identifying particular genomic regions subject to gene flow, rather than simply estimating a genome-wide parameter. Theory predicts that the rate of gene flow should vary across the genome, both in the case of secondary contact after isolation (Barton and Gale 1993) as well as continuous gene flow during speciation (Wu 2001). Indeed, a maximum-likelihood test for speciation with gene flow devised by Yang (2010) is based on detecting this underlying heterogeneity. Moreover, adaptive introgression might lead to highly localized signals of introgression, limited to the particular loci under selection.
Many methods for characterizing heterogeneity in patterns of introgression across the genome have been proposed. Several genomic studies have used FST to characterize heterogeneity in divergence across the genome, often interpreting the variation in FST as indicative of variation in rates of gene flow (e.g., Ellegren et al. 2012). However, it is well established that, as a relative measure of divergence, FST is dependent on within-population genetic diversity (Charlesworth 1998), and is therefore an unreliable indicator of how migration rates vary across the genome. In particular, heterogeneity in purifying selection and recombination rate could confound FST-based studies (Noor and Bennett 2009; Hahn et al. 2012; Roesti et al. 2012; Cruickshank and Hahn 2014). Various studies of admixture among human populations, or between humans and Neanderthals, have used probabilistic methods to assign ancestry to haplotypes, and infer how this ancestry changes across a chromosome (Sankararaman et al. 2008; Price et al. 2009; Henn et al. 2012; Lawson et al. 2012; Omberg et al. 2012; Churchhouse and Marchini 2013; Maples et al. 2013; Sankararaman et al. 2014). Other methods have modeled speciation with the allowance for variable introgression rates among loci (Garrigan et al. 2012; Roux et al. 2013), allowing the detection of more ancient gene flow.
There have also been recent attempts to characterize heterogeneity in patterns of introgression across the genome using the D statistic, calculated either in small windows (Kronforst et al. 2013; Smith and Kronforst 2013) or for individual SNPs (Rheindt et al. 2014). The robustness of the D statistic for detecting a genome-wide excess of shared derived alleles has been thoroughly explored (Green et al. 2010; Durand et al. 2011; Yang et al. 2012; Eaton and Ree 2013; Martin et al. 2013). However, it has not been established whether D provides a robust and unbiased means to identify individual loci with an excess of shared derived alleles, or to demonstrate that these loci have been subject to introgression. Any inherent biases of the D statistic when applied to specific loci have implications for methods that assume its robustness.
For example, Smith and Kronforst (2013) made use of the D statistic in a proposed test to distinguish between the hypotheses of introgression and shared ancestral variation at wing-patterning loci of Heliconius butterflies. Two wing-patterning loci are known to show an excess of shared derived alleles between comimetic populations of Heliconius melpomene and H. timareta (Heliconius Genome Consortium 2012). At one of these loci, phylogenetic evidence and patterns of linkage disequilibrium are consistent with recent gene flow (Pardo-Diaz et al. 2012). Nevertheless, Smith and Kronforst (2013) argue that this shared variation might represent an ancestral polymorphism that was maintained through the speciation event by balancing selection. Conceptually, this is not unlike the population structure argument of Eriksson and Manica (2012), except that here structure is limited to one or a few individual loci.
Smith and Kronforst proposed that the alternative explanations of introgression or ancestral polymorphism could be distinguished by considering absolute divergence within and outside of the loci of interest. Both hypotheses predict an excess of shared derived alleles at affected loci, but introgression should lead to reduced absolute divergence due to more recent coalescence at these loci, whereas the locus-specific population structure hypothesis predicts no reduction in absolute divergence at these loci compared with other loci in the genome. Loci with an excess of shared derived alleles, and therefore showing evidence of shared ancestry, were located by calculating the D statistic in nonoverlapping 5-kb windows across genomic regions of interest, and identifying outliers using an arbitrary cutoff (the 10% of windows with the highest D values). The mean absolute genetic divergence (dXY) was then compared between the outliers and nonoutliers, and found to be significantly lower in outlier windows, consistent with recent introgression (Smith and Kronforst 2013). This method makes two assumptions. First, that the D statistic can accurately identify regions that carry a significant excess of shared variation, and second, that D outliers do not have inherent biases leading to their cooccurrence with regions of low absolute divergence. These assumptions, which extend the use of D beyond its original definition, may be made by other researchers for similar purposes, but they remain to be tested.
Here, we first assess the reliability of the D statistic as a means to quantify introgression at individual loci. Using simulations of small sequence windows, we compare D to a related statistic that was developed by Green et al. (2010) specifically for estimating f, the proportion of the genome that has been shared, and we propose improvements to this statistic. We then use whole-genome data from several Heliconius species to investigate how these statistics perform on empirical data, and specifically how they are influenced by underlying heterogeneity in diversity across the genome. Lastly, we use a large range of simulated data sets to test the proposal that recent gene flow can be distinguished from shared ancestral variation based on absolute divergence in D outlier regions.
Results
The D Statistic Is Not an Unbiased Estimator of Gene Flow
Patterson’s D statistic was developed to detect, but not to quantify introgression. We used the deterministic derivation of the expected value of D (E[D]) provided by Durand et al. (2011, eq. 5) to test how sensitive the value of D is to other factors apart from the proportion of introgression. We define the proportion of introgression (f) as the proportion of haplotypes in the recipient population (P2) that trace their ancestry through the donor population (P3) at the time of gene flow (fig. 1A). The expected D value increases with the proportion of introgression (f), but not linearly (fig. 1B) and expected D increases as population size decreases (fig. 1B and C). The split times between populations also have a small effect, with a more recent split between P1 and P2 leading to higher expected values of D (fig. 1C). This implies that empirically calculated values of the D statistic will depend on various parameters other than the amount of gene flow, irrespective of the number of sites analyzed.
Direct Estimators of f Outperform D on Simulated Data
Analysis of simulated data confirmed that the D statistic is not an appropriate measure for quantifying introgression over small genomic windows, but that direct estimation of the proportion of introgression (f) provides a more robust alternative. The D statistic (eq. 1, Materials and Methods) was compared with the f estimator of Green et al. (2010) (eq. 4, Materials and Methods), which is referred to as below, along with two proposed modified versions of this statistic (eqs. 5 and 6, Materials and Methods). The first, (eq. 5), is similar to in that it explicitly assumes unidirectional gene flow from P3 to P2, but makes a further assumption that maximal introgression would lead to complete homogenization of allele frequencies in P2 and P3. This is a conservative assumption, as an extremely high rate of migration would be necessary to attain a maximal value of . The second, (eq. 6), is dynamic in that it allows for bidirectional introgression on a site-by-site basis, setting the donor population at each site as that which has the higher frequency of the derived allele. These f estimators are distinct from the F2, F3, and F4 statistics of Reich et al. (2009, 2012) and Patterson et al. (2012), which all test for correlated allele frequencies associated with introgression (much like the D statistic). However, Patterson’s (2012) F4-ratio is conceptually very similar to the f estimators discussed here, in that it estimates the proportional contribution of a donor population.
To compare the utility of these statistics for quantifying introgression in small genomic windows, we simulated sequences from four populations: P1, P2, and P3 and outgroup O, with the relationship (((P1,P2),P3),O), with a single instantaneous gene flow event, either from P3 to P2 or from P2 to P3. Simulations were performed over a range of different values of f (the probability that any particular haplotype is shared during the introgression event), and with various window sizes, recombination rates, and times of gene flow.
A subset of the results are shown in figure 2, and full results are provided in supplementary figure S1, Supplementary Material online. In general, the D statistic proved sensitive to the occurrence of introgression, with strongly positive values for any nonzero value of f. However, it was a poor estimator of the amount of introgression, as defined by the simulated value of f (fig. 2). Moreover, D values showed dramatic variance, particularly at low simulated values of f. Even in the absence of any gene flow, a considerable proportion of windows had intermediate D values. This variance decreased with increasing window size and recombination rate (supplementary fig. S1, Supplementary Material online).
In simulations of gene flow from P3 to P2, all three f estimators gave fairly accurate estimates of the simulated f value, provided gene flow was recent (fig. 2 and supplementary fig. S1A, Supplementary Material online). When gene flow occurred further back in time, f estimators tended to give underestimates, but were nevertheless well correlated with the simulated f value (supplementary fig. S1A, Supplementary Material online). This is unsurprising, as genetic drift and the accumulation of mutations in these lineages after the introgression event should dilute the signal of introgression. In simulations of gene flow in the opposite direction, from P2 to P3, both and showed considerable stochasticity, particularly when recombination rates were low and gene flow was recent (supplementary fig. S1, Supplementary Material online). The size of the window had little effect on this behavior (supplementary fig. S1, Supplementary Material online), implying that it was not an effect of the number of sites analyzed, but rather the level of independence among sites. This is also to be expected, as these statistics can give values greater than 1 where derived allele frequencies happen by chance to be higher in P3 than P2 (see Materials and Methods). Unlike these two statistics, behaved predictably at all recombination rates and times of gene flow, giving estimates that were fairly well correlated with the simulated f, but underestimating its absolute value (fig. 2 and supplementary fig. S1, Supplementary Material online).
Generally, the variance in was lower than in the other two f estimators (supplementary fig. S1A–I, Supplementary Material online). Unlike the D statistic, displayed minimal variance at low simulated values of f (fig. 2). However, all four statistics showed greater variance and more extreme values when recombination rates were lower (supplementary fig. S2, Supplementary Material online), as expected given that decreased recombination reduces the number of independent sites analyzed.
Although none of the examined measures were able to accurately quantify both forms of introgression in all cases, has some appealing characteristics as a measure to identify introgressed loci in a genome scan approach. It has low variance and is not prone to false positives when gene flow is absent and recombination rare. In all of our simulations, it provided estimates that were proportional to the simulated level of introgression. Although it tended toward underestimates, genome scans for introgressed loci would primarily be interested in relative rates of introgression across the genome, rather than absolute rates.
f Estimators Are Robust to Variation in Nucleotide Diversity Across the Genome
Analysis of published whole-genome data from Heliconius species confirmed that Patterson’s D statistic was prone to extreme values in regions of low diversity, whereas f estimators were not (fig. 3A–C). We reanalyzed published whole-genome sequence data from two closely related Heliconius butterfly species, H. melpomene and H. timareta, and four outgroup species from the related silvaniform clade. The races H. melpomene amaryllis and H. timareta thelxinoe are sympatric in Peru, and show genome-wide evidence of gene flow (Martin et al. 2013), with particularly strong signals at two wing-patterning loci: HmB, which controls red pattern elements, and HmYb, which controls yellow and white pattern elements (Heliconius Genome Consortium 2012; Pardo-Diaz et al. 2012). To determine whether heterogeneity in diversity across the genome may influence D and the f estimators, we calculated D, , , , and nucleotide diversity (π) in nonoverlapping 5-kb windows across the genome. Variance in the D statistic was highest among windows with low nucleotide diversity and decreased rapidly with increasing diversity (fig. 3A and supplementary fig. S4, Supplementary Material online). Windows from the wing-patterning loci were among those with the highest D values, but there were many additional windows with D values approaching or equal to 1. By contrast, , calculated for all windows with positive D, was far less sensitive to the level of diversity, with most outlying windows showing intermediate levels of diversity (fig. 3B). Notable exceptions were windows located within the wing-patterning regions, which tended to have high values and below average diversity. This is consistent with the strong selection known to act upon the patterning loci. The lack of extreme values in windows with low diversity suggests that most of the D outliers are spurious, and that provides a better measure of whether a locus has shared ancestry between species. Finally, we also tested the other two f estimators described here: and (eqs. 4 and 5, Materials and Methods). Both performed similarly to except that both had higher variance (supplementary figs. S3 and S4, Supplementary Material online), consistent with the simulations reported above (supplementary fig. S1, Supplementary Material online), and both gave a considerable number of values greater than 1, confirming that was the most conservative and stable statistic.
Taken together, these findings demonstrate that, when small genomic windows are analyzed, a high D value alone is not sufficient evidence for introgression. Many of the D outlier loci probably represent statistical noise, concentrated in regions of low diversity, whereas outliers for the f estimates, and particularly , tend to be less biased.
This effect could also be observed on the scale of whole chromosomes. The variance in D among 5-kb windows for each of the Heliconius chromosomes (n = 21) was strongly negatively correlated with the average diversity per chromosome (r[19] = − 0.936, P < 0.001; fig. 3C). This relationship was most clearly illustrated by the Z chromosome: It had the lowest diversity by some margin, as expected given its reduced effective population size, and the highest variance among D values for 5-kb windows, despite the fact that previous chromosome-wide analyses suggest very limited gene flow affecting this chromosome (Martin et al. 2013). By contrast, the variance in , estimated for all windows with positive D, had a weak positive correlation with the mean diversity per chromosome (r[19] = 0.440, P < 0.05). This was driven by the fact that the Z chromosome had the lowest diversity and also the lowest variance in , as expected given the reduced gene flow affecting this chromosome. When the Z chromosome was excluded, there was no significant relationship between the variance in values and average diversity (r[18] = 0.092, P > 0.05). We also considered the effect of window size on the variance of D and the estimators of f. As window size increases, the higher variance of D in regions of lower diversity persists, but becomes less extreme (supplementary fig. S4, Supplementary Material online). In summary, these data show that extreme D values, both positive and negative, occur disproportionately in genomic regions with lower diversity, whereas values are less biased by underlying heterogeneity in genetic variation.
Inherent Biases in the D and Statistics Confound a Test to Distinguish between Introgression and Shared Ancestral Variation
The biases associated with the D statistic described above may have important consequences for methods that use D to identify candidate introgressed regions. For example, Smith and Kronforst (2013) proposed a method to discriminate between gene flow and shared ancestral variation that relies upon D values calculated for small genomic regions (see Introduction). Briefly, the Smith and Kronforst test calculated D for all nonoverlapping 5-kb windows. Absolute divergence (dXY) was then compared between the set of windows that were outliers for the D statistic (defined as the windows with the top 10% of D values) and the remaining 90% of nonoutlier windows. The method predicts that introgression between species at a specific genomic region should reduce the between-species divergence in this region as compared with the rest of the genome, whereas shared ancestry due to ancestral population structure would not lead to lower divergence. We first confirmed this prediction using simulations, and then assessed whether biases in the D statistic might affect the power of the method.
To test the prediction that introgression and ancestral population structure leave distinct footprints in terms of absolute divergence, 10,000 sequence windows for three populations (P1,P2,P3) and an outgroup (O) were simulated. In total, 9,000 windows were defined as “Background,” having the topology (((P1,P2),P3),O), without any gene flow or population structure. The remaining 1,000 windows were defined as “Alternate” and were subject to either gene flow or structure (see Materials and Methods for details). Ten percent of windows were defined as Alternate to match Smith and Kronforst’s design, wherein the top 10% of D values are taken as outliers. Three different Alternate scenarios were considered: Gene flow from P2 to P3, gene flow from P3 to P2 and ancestral structure leading to shared ancestry between P2 and P3. For all scenarios, Alternate windows were defined with the topology ((P1,(P2,P3)),O). In the gene flow scenarios, the split time between P2 and P3 in the Alternate topology was set to be more recent than the split between P1 and P2 in the Background topology (fig. 4A and B). In the ancestral structure scenario, the split time between P1 and P2 in the Alternate topology was set to be more ancient than the split between P2 and P3 in the Background topology (fig. 4C). This was designed to model a region of the genome undergoing balancing selection or some other process that maintains polymorphism at particular loci before the speciation event. Gene flow or structure in the Alternate windows can be considered to be complete ( = 1). For example, under gene flow from P2 to P3, all P3 alleles trace their ancestry through P2 at the time of gene flow. This simplified design, where gene flow or structure is absent in 90% of the sequences and complete in 10%, allowed for the most straightforward and predictable test of Smith and Kronforst’s method; if the logic of the method does not follow in this extreme scenario, it is unlikely to do so in more complex situations.
For each of the three evolutionary scenarios, 120 different permutations of split times and times of gene flow or structure were simulated. The split times and times of gene flow for all models are given in the first three columns of supplementary tables S1–S3, Supplementary Material online. To simplify our comparisons between models, we focused specifically on dXY between P2 and P3, the most relevant parameter when testing for introgression between P2 and P3. We tested whether P2–P3 dXY was significantly lower in the Alternate windows (those that had experienced gene flow or structure) compared with the Background windows, using a Wilcoxon rank-sum test, with Bonferroni correction over all 120 models of the same type, and a significance threshold of 99%. We first performed simulations with a recombination rate parameter (4Nr) of 0.01, and later repeated all simulations at 4Nr = 0.001. The results of all tests are given in full in supplementary tables S1–S3, Supplementary Material online. These results are summarized in table 1 and figure 5, and a single illustrative example for each model type is given in figure 4E–L.
Table 1.
Model Type (No. of models) | Number of Models with Significantly Reduced Mean P2–P3
dXY |
||||
---|---|---|---|---|---|
Alternate versus Background | D Outliers versus Nonoutliers | f̂d Outliers versus Nonoutliers | f̂G Outliers versus Nonoutliers | f̂hom Outliers versus Nonoutliers | |
4Nr = 0.01 | |||||
Gene flow P3 → P2 (120) | 120 | 119 | 120 | 120 | 120 |
Gene flow P2 → P3 (120) | 120 | 120 | 120 | 120 | 120 |
Ancestral structure (120) | 0 | 105 | 72 | 67 | 66 |
Null (45) | 0 | 39 | 44 | 45 | 44 |
4Nr = 0.001 | |||||
Gene flow P3 → P2 (120) | 120 | 120 | 120 | 120 | 120 |
Gene flow P2 → P3 (120) | 120 | 120 | 120 | 120 | 120 |
Ancestral structure (120) | 0 | 120 | 119 | 118 | 118 |
Null (45) | 0 | 32 | 45 | 45 | 45 |
Note.—Significantly lower dXY was evaluated using a Wilcoxon rank-sum test, with a 99% significance threshold after Bonferroni correction over the 120 models of each type (except for null models, of which there were 45).
As predicted, in all models simulating gene flow, average dXY between P2 and P3 was significantly lower in Alternate windows compared with Background windows. In contrast, in all models simulating ancestral population structure, there was no significant difference in P2–P3 dXY between the Background and Alternate windows, again in agreement with predictions. These findings therefore demonstrate that the intuitive premise of Smith and Kronforst’s (2013) method is justified.
We then tested whether introgression could be distinguished from shared ancestral variation where loci with shared ancestry are not known (as would be the situation with empirical data), but are instead inferred by selecting the top 10% of D values (outliers), following the Smith and Kronforst method. We also tested this method using the top 10% of f estimates among windows with positive D (using , , and ). Using the D statistic to identify outliers, mean dXY between P2 and P3 was significantly reduced in outlier windows as compared with nonoutlier windows in all 120 models simulating gene flow from P3 to P2, and all but one of the models simulating gene flow from P2 to P3. The single nonsignificant case had the most ancient possible t23 and the most recent possible t12, and only 11.9% of D outlier windows were genuine Alternate windows, the lowest recall of any model. Using any of the three f estimators, mean dXY between P2 and P3 was significantly reduced in outlier windows in all gene flow models.
However, mean P2–P3 dXY was also significantly reduced in D and f̂ outlier windows in more than half of the 120 models simulating ancestral population structure (figs. 4C, 4G, and 5; table 1 and supplementary table S3, Supplementary Material online). This demonstrates that a simple test for reduced divergence in P2–P3 dXY among D or outlier windows would, under a range of ancestral structure scenarios, produce results consistent with introgression. The fact that this bias was similar whether D or f estimators were used to identify outliers indicates that there is an inherent tendency in all of these statistics toward regions with below-average divergence between P2 and P3. To confirm this finding, we analyzed a set of simulations using a null model, with no gene flow or structure in any of the 10,000 windows, over 45 permutations of split times (supplementary table S4, Supplementary Material online). Outlier windows showed significantly reduced dXY between P2 and P3 in most or all of the null models. Finally, we repeated all of these simulations with a lower within-window recombination rate parameter (4Nr) of 0.001. This tended to increase the reduction in P2–P3 dXY for outliers in ancestral structure and null models (figs. 4I–L and 5), with at most three of the ancestral structure models showing nonsignificant drops in P2–P3 dXY for outliers, and most or all null models showing significantly lower P2–P3 dXY for outliers, regardless of the statistic used (table 1).
In summary, although shared ancestral variation and introgression can theoretically be distinguished based on the fact that only the latter should reduce dXY between P2 and P3, an inherent bias in both the D and f̂ statistics makes a simple test for a statistical difference in dXY between outliers and nonoutliers problematic. Both D and f̂ outliers tended toward windows with lower P2-P3 dXY, regardless of the underlying evolutionary history, and particularly when recombination rates were low. In the absence of any gene flow, the outliers must therefore be identifying windows that coalesce more recently in the ancestral population. However, even when the reduction in P2-P3 dXY was significant for ancestral structure or null models, it was typically smaller than the reductions in dXY seen in the gene flow models (fig. 5). In the presence of gene flow, some windows coalesce more recently than the species split, so the magnitude of the reduction in P2-P3 dXY is greater. This difference could potentially be used to distinguish introgression from shared ancestral variation, but can not be done with a simple significance test, and will require a more sophisticated model-fitting approach.
Discussion
With the advent of population genomics, studies of species divergence have moved from simply documenting interspecific gene flow, toward the identification of specific genomic regions that show strong signals of either introgression or divergence (Garrigan et al. 2012; Heliconius Genome Consortium 2012; Staubach et al. 2012; Roux et al. 2013; Bosse et al. 2014; Huerta-Sánchez et al. 2014; Sankararaman et al. 2014). This is a useful goal for many reasons. It can permit the identification of large-scale trends, such as chromosomal differences, and the fine-scale localization of putative targets of adaptive introgression for further characterization. Therefore, simple and easily computable statistics that can be used to identify loci with a history of introgression have considerable appeal.
Previous studies have explored the behavior of Patterson’s D statistic, a test for gene flow based on detecting an inequality in the numbers of ABBA and BABA patterns, using whole-genome analyses across large numbers of informative sites (Green et al. 2010; Yang et al. 2012; Eaton and Ree 2013; Martin et al. 2013; Wall et al. 2013). These studies have shown that D is a robust method when applied as intended: To test for an excess of shared variation on a genome-wide scale. Indeed, a major strength of the ABBA–BABA test is that it combines data from across the genome, accounting for chance fluctuations among loci, and therefore is able to detect the net effect of gene flow. Moreover, the nonindependence among linked sites can be accounted for by block-jackknifing (Green et al. 2010).
However, it is not clear whether D can be extended beyond its original use to identify specific loci with introgressed variation. We have documented two main problems with this approach. First, D is not an unbiased estimator of the amount of introgression that has occurred. In particular, it is influenced by effective population size (Ne), leading to more extreme values when Ne is low. Second, when calculated over small windows, it is highly stochastic, particularly in genomic regions of low diversity and low recombination rate, such that D outliers will tend to be clustered within these regions. Local reductions in genetic diversity along a chromosome can come about through neutral processes, such as population bottlenecks, but also through directional selection. Therefore, these problems may be exacerbated in studies specifically interested in loci that experience strong selective pressures, as this would increase the likelihood of detecting chance outliers at such loci.
Direct estimation of f, the proportion of introgression, holds more promise as a robust method for detecting introgressed loci. Green et al. (2010) proposed that f could be estimated by comparing the observed difference in the number of ABBA and BABA patterns to that which would be expected in the event of complete introgression. As this expected value is calculated from the observed data, this method controls for differences in the level of standing variation, making it more suitable for application to small regions. In Green et al.’s approach, complete introgression from P3 to P2 was taken to mean that P2 would come to resemble a subpopulation of lineage P3. Here, we make the conservative assumption that complete introgression would lead to homogenization of allele frequencies, such that the frequency of the derived allele in P2 would be identical to that in P3. Green et al.’s approach assumed unidirectional introgression from P3 to P2, but can lead to spurious values when introgression occurs in the opposite direction. We have therefore proposed a new dynamic estimator of f, in which the donor population can differ between sites, and is always the population with the higher frequency of the derived allele. Although this conservative estimator leads to slight underestimation of the amount of introgression that has occurred, it provides an estimate that is roughly proportional to the level of introgression, regardless of the direction. It is therefore a more suitable measure for identifying introgressed loci. This is supported by our analysis of whole-genome data from Heliconius butterflies, where many 5-kb windows had maximal D values (D = 1), but only a few had high values, the vast majority of which were located around the wing-patterning loci previously identified as being shared between these species through adaptive introgression (Heliconius Genome Consortium 2012; Pardo-Diaz et al. 2012).
The sensitivity of D to heterogeneous genomic diversity is likely to affect studies that have drawn conclusions from D statistics calculated for particular genome regions. For example, Wall et al. (2013) identified regions carrying putative long (8–100 kb) haplotypes segregating in European humans, and then found that these regions showed evidence of a Neanderthal origin, as indicated by elevated D statistics. However, it may be that such haplotypes would be overrepresented in low-recombination regions, which also tend to have reduced diversity in humans and many other species (Cutter and Payseur 2013). In another recent Heliconius study, FST was calculated for 5-kb windows across the genome. Windows showing increased differentiation between H. melpomene and H. pachinus (according to FST) also showed significantly elevated D statistics in a test for introgression between the same species pair (Kronforst et al. 2013). This illustrates how the sensitivity of both D and FST to within-species diversity can produce conflicting results. This sensitivity is likely to be particularly problematic in studies using very small genomic regions. At the extreme, Rheindt et al. (2014) calculated D for single SNPs and predicted that genes linked to SNPs with outlying D values are more likely to have been introgressed.
In the present study, to investigate whether biases in D when calculated over small regions could influence subsequent analyses, we investigated a recently proposed method to distinguish between introgression and shared ancestral variation (Smith and Kronforst 2013). The premise of this test is that introgression should result in an excess of shared derived alleles and a reduction in absolute divergence (dXY), whereas shared ancestral variation will exhibit the former but not the latter signature. Our simulations confirmed that the intuitive predictions of this method are valid, but also showed that this test can be misled by the use of D to identify outliers. Windows that were outliers for D exhibited below average dXY in simulations with gene flow, but also in most simulations with ancestral structure, or where both gene flow and ancestral population structure were absent. All three f estimators also failed to distinguish between introgression and ancestral structure in many models. This implies that all of these statistics are systematically biased toward regions that coalesce more recently, regardless of whether gene flow has occurred.
We predict that D would have additional problems in real genomes, where selective constraint leads to a correlation between within-species diversity and between-species divergence, causing D outliers to be even more strongly associated with reduced dXY. However, it is notable that the reduction in divergence among D and f̂ outliers was almost always greater in simulations with introgression than in simulations with ancestral structure or with no Alternate topology, across a large range of split times and dates of gene flow. There may, therefore, be considerable information about the evolutionary history of DNA sequences present in the joint distribution of dXY and . On the other hand, in real data, levels of divergence can vary dramatically due to heterogeneity in selective constraint, mutation rate, and recombination rate, which would exaggerate the problems described here. Even an unbiased statistic, when applied to small genomic windows, would be confounded by heterogeneity in recombination rate across the genome. In regions of reduced recombination, fewer independent data points are sampled by each window, so extreme estimates become more likely. Heterogeneity in recombination rate is therefore an essential consideration in any study that aims to scan the genome for regions of interest.
Conclusions
In an era of increasing availability of genomic data, there is a demand for simple summary statistics that can reliably identify genomic regions that have been subject to selection, introgression and other evolutionary processes. It seems unlikely, however, that any single summary statistic will be able to reliably distinguish these processes from noise introduced by demography, drift, and heterogeneity in recombination rate. Here, we have shown that, while Patterson’s D statistic provides a robust signal of shared ancestry across the genome, it should not be used for naïve scans to ascribe shared ancestry to small genomic regions, due to its tendency toward extreme values in regions of reduced variation. Estimation of f, the proportion of introgression, particularly using our proposed statistic , provides a better means of identifying putatively introgressed regions. Nevertheless, both D and tend to identify regions of reduced interspecies divergence, even in the absence of gene flow, which may confound tests to distinguish between recent introgression and shared ancestral variation based on absolute divergence (dXY) in outlier regions. However, the joint distribution of dXY and statistics may be a useful summary statistic for model-fitting approaches to distinguish between these evolutionary hypotheses.
Materials and Methods
Statistics Used to Detect Shared Ancestry
In this study, we focused on an approach to identify an excess of shared derived polymorphisms, indicated by the relative abundance of two SNP patterns termed ABBAs and BABAs (Green et al. 2010). Given three populations and an outgroup with the relationship (((P1, P2), P3), O) (fig. 1A), ABBAs are sites at which the derived allele B is shared between the nonsister taxa P2 and P3, whereas P1 carries the ancestral allele, as defined by the outgroup. Similarly, BABAs are sites at which the derived allele is shared between P1 and P3, whereas P2 carries the ancestral allele. Under a neutral coalescent model, both patterns can only result from incomplete lineage sorting or recurrent mutation, and should be equally abundant in the genome (Durand et al. 2011). A significant excess of ABBAs over BABAs is indicative either of gene flow between P2 and P3, or some form of nonrandom mating or structure in the population ancestral to P1, P2, and P3. This excess can be tested for, using Patterson’s D statistic,
(1) |
where CABBA(i) and CBABA(i) are counts of either 1 or 0, depending on whether or not the specified pattern (ABBA or BABA) is observed at site i in the genome. Under the null hypothesis of no gene flow and random mating in the ancestral population, D will approach zero, regardless of differences in effective population sizes (Durand et al. 2011). Hence, a D significantly greater than zero is indicative of a significant excess of shared derived alleles between P2 and P3.
If population samples are used, then rather than binary counts of fixed ABBA and BABA sites, the frequency of the derived allele at each site in each population can be used (Green et al. 2010; Durand et al. 2011), effectively weighting each segregating site according to its fit to the ABBA or BABA pattern, with
(2) |
(3) |
where pij is the frequency of the derived allele at site i in population j. These values are then used in equation 1 to calculate D (Durand et al. 2011).
Green et al. (2010) also proposed a related method to estimate f, the fraction of the genome shared through introgression (Green et al. 2010; Durand et al. 2011). This method makes use of the numerator of equation 1, the difference between sums of ABBAs and BABAs, which is called S. In the example described above, with ((P1,P2),P3),O), the proportion of the genome that has been shared between P2 and P3 subsequent to the split between P1 and P2 can be estimated by comparing the observed value of S to a value estimated under a scenario of complete introgression from P3 to P2. P2 would then resemble a lineage of the P3 taxon, and so the denominator of equation 1 can be estimated by replacing P2 in equations 2 and 3 with a second lineage sampled from P3, or by splitting the P3 sample into two,
(4) |
where P3a and P3b are the two lineages sampled from P3. Splitting P3 arbitrarily in this way may lead to stochastic errors at individual sites, particularly with small sample sizes. These should be negligible when whole-genome data are analyzed but could easily lead to erroneous values of (including > 1) when small genomic windows are analyzed, as in the present study. We therefore used a more conservative version, in which we assume that complete introgression from P3 to P2 would lead to complete homogenization of allele frequencies. Hence, in the denominator, P3a and P3b are both substituted by P3:
(5) |
Although this conservative assumption may lead to underestimation of the proportion of sites shared, it also reduces the rate of stochastic error. Moreover, in the present study, we are less concerned with the absolute value of , and more with the relative values of between genomic regions.
The statistic assumes unidirectional gene flow from P3 to P2 (i.e., P3 is the donor and P2 is the recipient). Because the branch leading to P3 is longer than that leading to P2 (fig. 1A), gene flow in the opposite direction (P2 to P3) is likely to generate fewer ABBAs. Thus, in the presence of gene flow from P2 to P3, or in both directions, the equation should lead to an underestimate. However, when small genomic windows are analyzed, the assumption of unidirectional gene flow could lead to overestimates, because any region in which derived alleles are present in both P2 and P3, but happen to be at higher frequency in P2, will yield f estimates that are greater than 1. Thus, we propose a dynamic estimator in which the denominator is calculated by defining a donor population (PD) for each site independently. For each site, PD is the population (either P2 or P3) that has the higher frequency of the derived allele, thus maximizing the denominator and eliminating f estimates greater than 1:
(6) |
Assessing the Ability of D and f Estimators to Quantify Introgression in Small Sequence Windows
To assess how reliably Patterson’s D statistic, and other estimators of f are able to quantify the actual rate of introgression, we simulated sequence data sets with differing rates of introgression using ms (Hudson 2002). For each data set, we simulated 100 sequence windows for eight haplotypes each from four populations with the relationship (((P1,P2),P3),O). The split times t12 and t23 (as on fig. 1A) were set to 1 × 4N generations and 2 × 4N generations ago, respectively, and the root was set to 3 × 4N generations ago. An instantaneous, unidirectional admixture event, either from P3 to P2 or from P2 to P3, was simulated at a time tGF with a value f, which determines the probability that each haplotype is shared. We tested two different values for tGF: 0.1 and 0.5 × 4N generations ago. For each direction of gene flow and each tGF, 11 simulated data sets were produced, with f values ranging from 0 (no gene flow) to 1 (all haplotypes are shared). Finally, the entire set of simulations was repeated with three different window sizes: 1, 5, and 10 kb, and with three different recombination rates: 0.001, 0.01, and 0.1, in units of 4Nr, the population recombination rate. DNA sequences were generated from the simulated trees using Seq-Gen (Rambaut and Grass 1997), with the Hasegawa-Kishino-Yano substitution model and a branch scaling factor of 0.01. Simulations were run using the provided script compare_f_estimators.r, which generates the ms and Seq-Gen commands automatically. An example set of commands to simulate a single 5-kb sequence using the split times mentioned above, with gene flow from P3 to P2 at tGF = 0.1 and f = 0.2, and with a recombination rate parameter of 0.01 would be:
ms 32 1 -I 4 8 8 8 8 -ej 1 2 1 -ej 2 3 1 -ej 3 4 1 -es 0.1 2 0.8 -ej 0.1 5 3 -r 50 5000 -T | tail -n + 4 | grep -v // > treefile
partitions=($(wc -l treefile))
seq-gen -mHKY -l 5000 -s 0.01 -p $partitions <treefile >seqfile.
We then compared the mean and standard error for D (eq. 1) and the three f estimators (eqs. 4, 5, and 6), calculated for all 100 windows in each data set.
Analysis of Heliconius Whole-Genome Sequence Data
To investigate how the D and statistics are affected by underlying diversity in a given window, we reanalyzed whole genome data from Martin et al. (2013). For ABBA–BABA analyses, populations were defined as follows: P1 = H. m. aglaope (four diploid samples), P2 = H. m. amaryllis (4), P3 = H. timareta thelxinoe (4), O=H. hecale (1), H. ethilla (1), H. pardalinus sergestus (1), and H. pardalinus ssp. nov. (1). Patterson’s D (eq. 1) and the three f estimators (eqs. 4–6) were calculated, along with nucleotide diversity (π) and absolute divergence (dXY), for nonoverlapping 5-kb windows across the genome. Both π and dXY were calculated as the mean number of differences between each pair of individuals, sampled either from the same population (π), or from separate populations (dXY). Sites with missing data were excluded in a pairwise manner, and each pair of individuals contributed equally to the mean. Windows were restricted to single scaffolds and windows for which fewer than 3,000 sites had genotype calls for at least half of the individuals were discarded. To calculate D and the f estimators only biallelic sites were considered. The ancestral state was inferred using the outgroup taxa, except when the four outgroup taxa were not fixed for the same allele, in which case the most common allele overall was taken as ancestral. The HmB locus was defined as positions 300000–450000 on scaffold HE670865 and the HmYb locus as positions 650000–900000 on scaffold HE667780 of version 1.1 of the H. m. melpomene genome sequence. We also analyzed windows from each of the 21 chromosomes of the H. m. melpomene genome sequence separately. Scaffolds were assigned to chromosomes according to the Heliconius Genome Consortium (2012), and incorporating the improved assignment of Z-linked scaffolds by Martin et al. (2013) (details available in Dryad repositories http://dx.doi.org/10.5061/dryad.m27qq and http://dx.doi.org/10.5061/dryad.dk712). This analysis was performed using egglib_sliding_windows.py, and figures were generated using figures_3_S3.R and figure_S4.R.
Assessing a Test to Distinguish Introgression from Shared Ancestral Variation Based on Absolute Divergence
Smith and Kronforst (2013) proposed a simple test to distinguish between the hypotheses of pre and postspeciation shared ancestry based on absolute divergence. To assess this method on data of known history, we generated a large range of sequence data sets using ms (Hudson 2002) and Seq-Gen (Rambaut and Grass 1997). For the simplest (“null”) model 10,000 5-kb sequence windows were simulated for eight haplotypes each from three populations and an outgroup, with the relationship (((P1,P2),P3),O), without gene flow or population structure. To approximate a scenario in which a subset of the genome has a distinct phylogenetic history, either due to gene flow or genomically localized ancestral population structure, we used a combined model approach. This entailed combining 9,000 5-kb windows from the null model (90% Background windows), with 1,000 5-kb windows simulated with the topology ((P1,(P2,P3)),O), consistent with shared ancestry between P2 and P3 (10% Alternate windows). By altering the split times, three distinct scenarios were emulated: Gene flow from P2 to P3, gene flow from P3 to P2, and ancestral structure (fig. 4A–D). Using entirely distinct topologies in this way is equivalent to making the probability of gene flow (or structure) equal to one in the 1,000 Alternate windows. Although this approach of partitioning each data set into two somewhat arbitrarily sized subsets with evolutionary histories at two extremes is biologically unlikely, it provided a simple and powerful framework in which to evaluate Smith and Kronforst’s approach, with clear expectations. Model combination data sets were generated using run_model_combinations.py and shared_ancestry_simulator.R, which generates the ms and Seq-Gen commands automatically, in a similar form to those given above. For example if t12 = 1, t23 = 2, 4Nr = 0.01, and gene flow from P3 to P2 at tGF = 0.2, the ms calls for Background and Alternate models, respectively, would be:
ms 32 1 -I 4 8 8 8 8 -ej 1 2 1 -ej 2 3 1 -ej 3 4 1 -r 50 5000 -T
ms 32 1 -I 4 8 8 8 8 -ej 0.2 2 3 -ej 2 3 1 -ej 3 4 1 -r 50 5000 -T
We calculated Patterson’s D (eq. 1) and the three f estimators (eqs. 4–6) for all windows, and identified the top 1,000 “outliers” (10%) with the most extreme values. For D, only positive values were included as outliers, as negative values indicate an excess of BABAs, consistent with introgression between P1 and P3. Similarly, for f estimators, only windows with D≥0 were considered, as these values only give meaningful quantification of introgression when there is an excess of ABBAs. To compare P2-P3 divergence between the Background and Alternate windows, or between outlier and nonoutlier windows, we calculated dXY for each window as described above, for each pair of populations. Average dXY was compared between subsets of windows using a Wilcoxon rank-sum test, as values tended to be nonnormally distributed (confirmed with Bonferroni-corrected Shapiro–Wilk tests).
These tests were repeated over a large range of split times. In all cases the root was set to 3.0 × 4N generations ago, and the other splits ranged from 0.2 to 2.0. Times of gene flow and structure also varied on the same scale. In total, this gave 45 null models and 120 models each for the two gene flow scenarios and ancestral structure scenario (405 overall). The analyzed models therefore covered a vast range of biologically relevant scales. In all cases, the Seq-Gen branch scaling factor was set to 0.01. Full parameters for all models are provided in supplementary tables S1-S4, Supplementary Material online. Finally, to examine the effects of recombination rate, the entire simulation study was repeated using population recombination rate (4Nr) values of 0.01 and 0.001. Summary statistics for all models were compiled using generate_summary_statistics.R.
Software
All scripts mentioned in the text, along with the generated datasets, are available at http://dx.doi.org/10.5061/dryad.j1rm6. This work was made possible by the free, open source software packages EggLib (De Mita and Siol 2012), phyclust (Chen 2011), R (R Core Team 2013), ggplot2 (Wickham 2009), plyr (Wickham 2011), reshape (Wickham 2007), and Inkscape (http://www.inkscape.org, last accessed August 20, 2014).
Supplementary Material
Supplementary figures S1–S4 and tables S1–S4 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgments
The authors are grateful to Milan Malinsky for helpful and detailed discussions about this work. The authors also thank Jim Mallet, Marcus Kronforst, and two anonymous reviewers for their comments on the manuscript. Richard Merrill and Richard Wallbank contributed to initial discussions that inspired the work. This work was supported by the Leverhulme Trust (F/09364/E to C.D.J.), the Herchel Smith Fund (Postdoctoral Fellowship to J.W.D.) and BBSRC (BB/H01439X/1, to C.D.J.).
References
- Abbott R, Albach D, Ansell S, Arntzen JW, Baird SJ, Bierne N, Boughman J, Brelsford A, Buerkle CA, Buggs R, et al. Hybridization and speciation. J Evol Biol. 2013;26:229–246. doi: 10.1111/j.1420-9101.2012.02599.x. [DOI] [PubMed] [Google Scholar]
- Barton NH, Gale KS. Genetic analysis of hybrid zones. In: Price J, Harrison RG, editors. Hybrid zones and the evolutionary process. New York: Oxford University Press; 1993. [Google Scholar]
- Bosse M, Megens H-J, Frantz LAF, Madsen O, Larson G, Paudel Y, Duijvesteijn N, Harlizius B, Hagemeijer Y, Crooijmans RP, et al. Genomic analysis reveals selection for Asian genes in European pigs following human-mediated introgression. Nat Commun. 2014;5:4392. doi: 10.1038/ncomms5392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B. Measures of divergence between populations and the effect of forces that reduce variability. Mol Biol Evol. 1998;15:538–543. doi: 10.1093/oxfordjournals.molbev.a025953. [DOI] [PubMed] [Google Scholar]
- Chen W-C. Ames (IA): Iowa Stat University; 2011. Overlapping codon model, phylogenetic clustering, and alternative partial expectation conditional maximization algorithm [Ph.D. Dissertation] [Google Scholar]
- Churchhouse C, Marchini J. Multiway admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol. 2013;37:1–12. doi: 10.1002/gepi.21692. [DOI] [PubMed] [Google Scholar]
- Cruickshank TE, Hahn MW. Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Mol Ecol. 2014;23:3133–3157. doi: 10.1111/mec.12796. [DOI] [PubMed] [Google Scholar]
- Cutter AD, Payseur BA. Genomic signatures of selection at linked sites: unifying the disparity among species. Nat Rev Genet. 2013;14:262–274. doi: 10.1038/nrg3425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Mita S, Siol M. EggLib: processing, analysis and simulation tools for population genetics and genomics. BMC Genet. 2012;13:27. doi: 10.1186/1471-2156-13-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durand EY, Patterson N, Reich D, Slatkin M. Testing for ancient admixture between closely related populations. Mol Biol Evol. 2011;28:2239–2252. doi: 10.1093/molbev/msr048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaton D, Ree R. Inferring Phylogeny and Introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae) Syst Biol. 2013;682:689–706. doi: 10.1093/sysbio/syt032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ellegren H, Smeds L, Burri R. The genomic landscape of species divergence in Ficedula flycatchers. Nature. 2012;491:756–760. doi: 10.1038/nature11584. [DOI] [PubMed] [Google Scholar]
- Eriksson A, Manica A. Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc Natl Acad Sci U S A. 2012;109:13956–13960. doi: 10.1073/pnas.1200567109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrigan D, Kingan SB, Geneva AJ, Andolfatto P, Clark AG, Thornton K, Presgraves DC. Genome sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res. 2012;22:1499–1511. doi: 10.1101/gr.130922.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn MW, White BJ, Muir CD, Besansky NJ. No evidence for biased co-transmission of speciation islands in Anopheles gambiae. Philos Trans R Soc Lond B Biol Sci. 2012;367:374–384. doi: 10.1098/rstb.2011.0188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heliconius Genome Consortium. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature. 2012;487:94–98. doi: 10.1038/nature11041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henn BM, Botigué LR, Gravel S, Wang W, Brisbin A, Byrnes JK, Fadhlaoui-Zid K, Zalloua PA, Moreno-Estrada A, Bertranpetit J, et al. Genomic ancestry of North Africans supports back-to-Africa migrations. PLoS Genet. 2012;8:e1002397. doi: 10.1371/journal.pgen.1002397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Huerta-Sánchez E, Jin X, Asan, Bianba Z, Peter BM, Vinckenbosch N, Liang Y, Yi X, He M, Somel M, et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature. 2014;512:194–197. doi: 10.1038/nature13408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kronforst MR, Hansen MEB, Crawford NG, Gallant JR, Zhang W, Kulathinal RJ, Kapan DD, Mullen SP. Hybridization reveals the evolving genomic architecture of speciation. Cell Rep. 2013;5:666–677. doi: 10.1016/j.celrep.2013.09.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kulathinal RJ, Stevison LS, Noor MAF. The genomics of speciation in drosophila: diversity, divergence, and introgression estimated using low- coverage genome sequencing. PLoS Genet. 2009;5(7):e1000550. doi: 10.1371/journal.pgen.1000550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin SH, Dasmahapatra KK, Nadeau NJ, Salazar C, Walters JR, Simpson F, Blaxter M, Manica A, Mallet J, Jiggins CD. Genome-wide evidence for speciation with gene flow in Heliconius butterflies. Genome Res. 2013;23:1817–1828. doi: 10.1101/gr.159426.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noor MAF, Bennett SM. Islands of speciation or mirages in the desert? Examining the role of restricted recombination in maintaining species. Heredity. 2009;103:439–444. doi: 10.1038/hdy.2009.151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Omberg L, Salit J, Hackett N, Fuller J, Matthew R, Chouchane L, Rodriguez-Flores JL, Bustamante C, Crystal RG, Mezey JG. Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations. BMC Genet. 2012;13:49. doi: 10.1186/1471-2156-13-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pardo-Diaz C, Salazar C, Baxter SW, Merot C, Figueiredo-Ready W, Joron M, McMillan WO, Jiggins CD. Adaptive introgression across species boundaries in Heliconius butterflies. PLoS Genet. 2012;8:e1002752. doi: 10.1371/journal.pgen.1002752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012;192:1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pinho C, Hey J. Divergence with gene flow: models and data. Annu Rev Ecol Evol Syst. 2010;41:215–230. [Google Scholar]
- Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, Beaty TH, Mathias R, Reich D, Myers S. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2013. [cited 2014 Aug 20]. Available from: http://www.R-project.org/ [Google Scholar]
- Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 1997;13:235–238. doi: 10.1093/bioinformatics/13.3.235. [DOI] [PubMed] [Google Scholar]
- Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, Parra MV, Rojas W, Duque C, Mesa N, et al. Reconstructing Native American population history. Nature. 2012;488:370–374. doi: 10.1038/nature11258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461:489–494. doi: 10.1038/nature08365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rheindt FE, Fujita MK, Wilton PR, Edwards SV. Introgression and phenotypic assimilation in Zimmerius flycatchers (Tyrannidae): population genetic and phylogenetic inferences from genome-wide SNPs. Syst Biol. 2014;63:134–152. doi: 10.1093/sysbio/syt070. [DOI] [PubMed] [Google Scholar]
- Roesti M, Hendry AP, Salzburger W, Berner D. Genome divergence during evolutionary diversification as revealed in replicate lake-stream stickleback population pairs. Mol Ecol. 2012;21:2852–2862. doi: 10.1111/j.1365-294X.2012.05509.x. [DOI] [PubMed] [Google Scholar]
- Roux C, Tsagkogeorga G, Bierne N, Galtier N. Crossing the species barrier: genomic hotspots of introgression between two highly divergent Ciona intestinalis species. Mol Biol Evol. 2013;30:1574–1587. doi: 10.1093/molbev/mst066. [DOI] [PubMed] [Google Scholar]
- Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, Patterson N, Reich D. The genomic landscape of Neanderthal ancestry in present-day humans. Nature. 2014;507:354–357. doi: 10.1038/nature12961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sankararaman S, Sridhar S, Kimmel G, Halperin E. Estimating local ancestry in admixed populations. Am J Hum Genet. 2008;82:290–303. doi: 10.1016/j.ajhg.2007.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith J, Kronforst MR. Do Heliconius butterfly species exchange mimicry alleles? Biol Lett. 2013;9:20130503. doi: 10.1098/rsbl.2013.0503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staubach F, Lorenc A, Messer PW, Tang K, Petrov DA, Tautz D. Genome patterns of selection and introgression of haplotypes in natural populations of the house mouse (Mus musculus) PLoS Genet. 2012;8:e1002891. doi: 10.1371/journal.pgen.1002891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wall JD, Yang MA, Jay F, Kim SK, Durand EY, Stevison LS, Gignoux C, Woerner A, Hammer MF, Slatkin M. Higher levels of neanderthal ancestry in East Asians than in Europeans. Genetics. 2013;194:199–209. doi: 10.1534/genetics.112.148213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wickham H. Reshaping data with the reshape package. J Stat Softw. 2007;21(12):1–213. [Google Scholar]
- Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer; 2009. p. 213. [Google Scholar]
- Wickham H. The split-apply-combine strategy for data analysis. J Stat Softw. 2011;40(1):1–29. [Google Scholar]
- Wu C. The genic view of the process of speciation. J Evol Biol. 2001;14:851–865. [Google Scholar]
- Yang MA, Malaspinas A-S, Durand EY, Slatkin M. Ancient structure in Africa unlikely to explain Neanderthal and non-African genetic similarity. Mol Biol Evol. 2012;29:2987–2995. doi: 10.1093/molbev/mss117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. A likelihood ratio test of speciation with gene flow using genomic sequence data. Genome Biol Evol. 2010;2:200–211. doi: 10.1093/gbe/evq011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.