Abstract
It is estimated that a large proportion of amino acid substitutions in Drosophila have been fixed by natural selection, and as organisms are faced with an ever-changing array of pathogens and parasites to which they must adapt, we have investigated the role of parasite-mediated selection as a likely cause. To quantify the effect, and to identify which genes and pathways are most likely to be involved in the host–parasite arms race, we have re-sequenced population samples of 136 immunity and 287 position-matched non-immunity genes in two species of Drosophila. Using these data, and a new extension of the McDonald-Kreitman approach, we estimate that natural selection fixes advantageous amino acid changes in immunity genes at nearly double the rate of other genes. We find the rate of adaptive evolution in immunity genes is also more variable than other genes, with a small subset of immune genes evolving under intense selection. These genes, which are likely to represent hotspots of host–parasite coevolution, tend to share similar functions or belong to the same pathways, such as the antiviral RNAi pathway and the IMD signalling pathway. These patterns appear to be general features of immune system evolution in both species, as rates of adaptive evolution are correlated between the D. melanogaster and D. simulans lineages. In summary, our data provide quantitative estimates of the elevated rate of adaptive evolution in immune system genes relative to the rest of the genome, and they suggest that adaptation to parasites is an important force driving molecular evolution.
Author Summary
All organisms are attacked by an ever-changing array of pathogens and parasites, and it is widely supposed that the ensuing host–parasite “arms race” must drive extensive adaptive evolution in genes of the immune system. Here we have taken advantage of new sequencing technologies and analytical approaches to quantify the amount of adaptation that is occurring in immunity genes relative to the rest of the genome. We sampled two species of fruit fly (D. melanogaster and D. simulans) from eight different populations around the world, and sequenced 136 immunity and 287 non-immunity genes from these samples. Based on the differences in the sequences between the two species, and the genetic diversity within each species, we have estimated that natural selection drives twice as much change in immune-related proteins as in proteins with no immune function. Interestingly, the rate of adaptation is also more variable among immunity genes than among other genes in the genome, with a small subset of immunity genes evolving under intense natural selection. We suggest that these genes may represent hotspots of host–parasite coevolution within the genome.
Introduction
Hosts face an ever-changing array of parasites to which they must adapt, and parasites are widely believed to be one of the most important and universal selection pressures in natural populations. Consistent with this view, immune genes in several taxa are known to evolve faster than other genes, and sometimes significantly faster than the neutral rate – a signature of adaptive evolution [1],[2],[3]. Indeed, many studies of one or a few immune genes have identified the action of positive selection in Drosophila, including Relish [4], the Scavenger Receptors [5] RNAi genes [6], TEPs [7], Persephone [8] and others [2]. More recently, complete genome sequencing of multiple Drosophila species found that immune-related genes have high rates of amino-acid substitution, and are more likely to show evidence of adaptive evolution than other genes [1],[9]. Here we go beyond the yes/no detection of selection, to quantify the additional adaptation that occurs in proteins of the immune system over and above that which occurs in the rest of the genome.
The rate at which natural selection fixes new mutations can be estimated by comparing the amount of polymorphism within populations to divergence between species at synonymous and nonsynonymous sites [10],[11],[12],[13],[14]. Approaches of this kind have been used to estimate the genome-wide rate of adaptive evolution, and found that it is often surprisingly high [10],[13],[15],[16],[17]. However, the nature of the selection pressures underlying this evolution remains unknown.
One approach to answering this question is to compare estimated rates of adaptive evolution between proteins with different functions. Moreover, focussing on genes where we have a strong expectation of elevated positive selection also has a further benefit; there is an ongoing debate about the extent to which the high genomic estimates represent artefacts of processes such as population demography [15],[18],[19], and testing the a priori hypothesis that immunity genes will have increased adaptive rates can address this issue.
To assess the role of pathogens and other parasites as a cause of molecular evolution we have resequenced population samples of most of the best-characterised immunity genes in the Drosophila melanogaster genome, together with position-matched ‘control’ genes with no known immune function. This provides a quantitative estimate of the impact of parasite-mediated selection on the rate of adaptive evolution, and suggests that immunity genes have double the genome-average rate (Figure 1). We found that this was not caused by a generally elevated rate in immunity genes. Instead, most immunity genes show similar rates of adaptive evolution to the rest of the genome, with only a small subset evolving under very intense selection (Figure 2). These genes tend to be concentrated in a few pathways, which we argue are likely to be hotspots of host-parasite coevolution (Figure 3). Interestingly, these pathways are known to be suppressed by pathogens, and this suggests that active parasite-suppression of the immune system is an important cause of this adaptive evolution. Furthermore, when independent lineages are compared, similar genes show accelerated rates of adaptation (Figure 4). This suggests that despite their dynamic nature, host-parasite interactions may create similar selective pressures in related species, leading to replicable signatures at the molecular level.
Results
We have resequenced 136 of the best characterised immunity genes in Drosophila melanogaster and D. simulans. To get an unbiased estimate of the background rate of adaptive evolution, we also sampled position-matched ‘control’ genes with no known immune function. We sampled flies from six D. melanogaster populations and two D. simulans populations, and pooled genomic DNA from four outcrossed flies (eight alleles of each gene) from each population. We then amplified the target genes by PCR, and sequenced them using the Solexa-Illumina platform. After excluding sites with less than 20-fold coverage (Figure S1) and genes represented by less than 100 bp of sequence, there remained a total of 462.7 kbp of protein coding sequence from D. melanogaster representing 415 genes, and 335.6 kbp from D. simulans representing 309 genes. In this coding sequence we identified 12,974 putative SNPs in D. melanogaster and 10,759 in D. simulans. Raw data are available from the NCBI Short Read Archive under accession number SRA009020, or on request from the authors, and data for individual genes is given in Table S1.
Short-read sequencing of long PCR products provides a cost-efficient approach to identifying polymorphic sites and to estimating levels of genetic diversity, and has been shown to be as, or more, accurate than traditional Sanger sequencing [20]. By pooling template DNA between multiple individuals, cost-efficiency can be improved even further, though this may come at the cost of reduced accuracy. To assess the quality of our pooled-template short-read data, we re-sequenced 11 loci in two populations from diploid genomic DNA of the same individuals, using traditional Sanger sequencing (a total of 12,415 bp; see Text S1 and Figures S2, S3, S4, S5, S6, S7, S8, S9 for a detailed analysis of data quality and a comparison of the methods). We found that our pooled-template short-read approach successfully recovered ∼90% of the polymorphisms identified by Sanger sequencing, and more than 94% of short-read polymorphisms were verified by the Sanger data. Assuming the Sanger sequences are correct, on a per-site basis, this is an accuracy of 99.8%. Although estimates of allele-frequency are relatively poor (the correlation between Sanger and short-read estimates was Pearson's ρ = 0.71), our estimates of genetic diversity are highly correlated between the two methods (Pearson's ρ = 0.94 and 0.90 for per locus estimates of θw and θπ respectively). Our approach compares favourably with automated Sanger-sequencing of diploid genomic DNA, which is reported to have an error rate of ∼7% of SNPs [reviewed in 20]. However, as with related methods [21], the majority of our sequencing errors appear to result from PCR (allelic dropout and misincorporation of bases) or unequal mixing of template DNA. Because of this, future mixed-template studies may be improved by the use of direct DNA-capture in place of PCR, and/or mixing larger numbers of individuals, so that read-frequency better-reflects population allele-frequency.
For the following analyses of adaptive rates we focus on Kenyan populations of each species, as these are thought to be representative their ancestral range [22], and should minimise demographic artefacts associated with recent colonisation [14],[15],[18]. However, analyses of combined data, which give very similar results, are presented in Figures S10, S11, S12, S13, S14, S15.
Immunity genes show higher rates of adaptive evolution than other genes
The proportion of amino acid substitutions that were fixed by natural selection (denoted α) can be estimated using extensions of the McDonald-Kreitman test [16], which compares non-synonymous and synonymous changes, and contrasts within-species polymorphism to fixed differences between species. We have extended existing maximum likelihood approaches [15],[23],[24] to estimate separate α values for immunity and non-immunity genes, and for different classes of immunity genes (see Materials and Methods).
We found that the proportion of substitutions attributable to positive selection in immune genes is approximately 50% greater than the genome average. Based on the divergence between D. simulans and D. melanogaster and polymorphism in Kenyan populations of both species, we estimated that 65% of amino acid substitutions in immunity genes have been fixed by selection (95% bounds bootstrapping across genes within categories: 55–72%, Figure 1A). This is significantly higher than our estimate for non-immunity genes, which is very close to previous genome-wide estimates (reviewed in [10]) (α = 41%; 95% bounds are 31–50%; difference from immunity genes: p = 0.004, inferred by bootstrapping).
The effect remained highly significant when data from all populations were combined, though absolute estimates of α were slightly lower (immune: α = 58%; non-immune: α = 33%; p = 0.004; Figure S10). Since the exclusion of rare variants led to slightly higher estimates of α (Figure S16), this effect is probably caused by the enlarged sample size containing a higher proportion of (low-frequency) mildly-deleterious non-synonymous variants, which can cause α to be underestimated [23]. Estimates of α in the Greek (Athens) populations had greater variance and failed to detect a significant difference between immunity and non-immunity genes (Figure S10B), as might be expected because the relatively low genetic diversity of this population means we have little statistical power to accurately infer α [14].
The proportion of amino acid substitutions fixed by selection (α) will clearly be affected by the number of substitutions not fixed by selection, i.e., the number of effectively neutral substitutions fixed through genetic drift. Therefore, it is possible that the higher α of immunity genes does not reflect any increase in the absolute number of adaptive substitutions per non-synonymous site (denoted a [16]). This possibility has been little explored, because a, unlike α, is difficult to estimate as a multi-gene average, and because single-gene estimates of either statistic tend to be imprecise. Here we use an approach that allows us to obtain relatively stable estimates of a for individual genes (see Materials and Methods), which can then be averaged across immune and non-immune genes. Using Kenyan populations of D. melanogaster and D. simulans, we estimated that since their common ancestor, selection has fixed an average of 10.6×10−3 adaptive substitutions per non-synonymous site in immunity genes, but only 5.7×10−3 in other genes (difference between immunity and control genes: p = 0.02; Figure 1B). This difference in the absolute number of adaptive substitutions corresponds to 50% increase in the proportion (α) described above, and suggests that natural selection is fixing adaptive substitutions in immunity genes at nearly double the genome average rate.
Immune genes show more variation in rates of adaptive evolution than other genes
The high rate of adaptive evolution that we found in immunity genes could be driven either by a general elevation in the strength of selection across all immunity genes, or by a few key genes experiencing intense selection pressures. To investigate this, we examined the distribution of a across genes. Although mean a is higher for immunity genes than other genes (Figure 1B), the modal class is the same, i.e., centred on zero in both cases (Figure 2A versus Figure 2B), and the difference in mean is driven by a subset of immune genes with unusually high a (Figure 2C; this results in a significantly higher variance for immunity genes). The wider distribution of a across immunity genes suggests that most of these genes experience similar selection pressures to the rest of the genome, while a small subset are under substantially stronger selection. This is consistent with the analyses of D. simulans genome sequences that found little evidence that immunity genes as a group are outliers in terms of recurrent adaptive evolution [17]. Thus it appears that host-parasite arms races may involve a relatively small subset of the immune system.
This analysis could be confounded if our estimates were less accurate for immune genes than control genes, but this is unlikely for two reasons. First, the immunity genes tend to be longer than control genes, which will reduce the variance of a estimates and make our analysis conservative (Figure 2C). Second, the pattern remains significant and quantitatively almost identical if the analysis is restricted to genes with more than 500 non-synonymous sites (Figure S17, S18).
Immune genes with different functions show different rates of adaptive evolution
Clues as to the nature of the selection pressures acting on immune genes can be gained from looking at which functional classes of immune gene are experiencing the strongest selection [1],[2]. To examine how selection pressures differ between immune genes with different functions, we classified the genes in two different ways.
First, we classified genes according to the branch of the immune system in which they function: the humoral, cellular, melanisation and antiviral RNAi responses. We found little variation between the first three categories (α = 51%, 62% and 63%; per-site a = 0.009, 0.010 and 0.012, respectively), and individually no category was significantly different from non-immunity genes (Figure 1A and Figure 1B). However, RNAi genes were an exception to this, showing approximately twice the proportion of adaptive substitutions as compared to non-immune genes (α = 88% vs. 41%; p<0.001), and seven times the number of adaptive substitutions per site (a = 0.042 vs. 0.0057; p<0.001; Figure 1). This is consistent with previous results, which found that some RNAi genes evolve rapidly under positive selection [6],[25].
Second, we classified immune genes (excluding those involved in RNAi) according to their mode of action: pathogen recognition, signalling cascade, and antimicrobial peptides (AMPs). This categorisation gave a superior fit to the data according to model selection techniques (see Materials and Methods, and Table S2) and was also a significantly better fit than randomly assigning genes to categories of the same size (randomization test: p<10−3). Using this alternative categorisation, no group was significantly higher than non-immune genes, although signalling molecules did have a marginally higher α but not a (estimated α = 57% vs. 41%; p = 0.085). Consistent with previous results [26],[27], AMPs showed no evidence of adaptive evolution (were not detectably different from α = 0; Figure 1A), undergo significantly less adaptive evolution than RNAi, signalling and cellular recognition genes (p<0.014 in each case), and undergo marginally less adaptive evolution than non-immune genes (estimated α = −13% vs. 41%; p = 0.082). Alternative analyses using other populations and outgroups resulted in a qualitatively identical pattern (Figures S10, S11, S12, S13, S14, S15), except that the use of D. yakuba as an outgroup resulted in the signalling molecules having a significantly higher α than the controls (p<0.031; Figure S14A and S14B).
Some genes and pathways are under exceptionally strong selection
Because the high rate of adaptive evolution in immune system genes is caused mainly by a subset of genes under very strong selection (Figure 1 and Figure 2), we investigated how these genes are distributed across the immune system (Figure 3). The two main signalling pathways in the immune system are the Toll and IMD pathways, and of these the IMD pathway has a much higher rate of adaptive evolution than the Toll pathway (IMD: mean estimated a = 0.023; Toll: mean a = 0.009; difference between Toll and IMD p = 0.039 by bootstrapping within classes). Within the Toll pathway, the extracellular molecules are under stronger selection than the cytoplasmic ones (extracellular: mean a = 0.015, cytoplasmic: mean a = 0.005, p = 0.033). The antiviral RNAi genes again show strong adaptive evolution [6] (mean estimated a = 0.032). Elsewhere, TEP I and PGRP-LD are also under exceptionally strong selection [1],[7]. It has been suggested that the phagocytosis receptor Dscam, which can produce up to 18,000 differently spliced isoforms, may allow Drosophila to mount specific immune responses [28],[29]. However, despite having over 22 kbp of coding sequence from Dscam, we were unable to find any evidence of adaptive evolution in this gene, indicating that this gene is not subject to arms-race selection.
Genes experience correlated selection pressures in different species
If the immune system adapts to parasites in similar ways in related species, then we would expect to see the same genes experiencing positive selection in different lineages [30]. Alternatively, each species could respond differently, resulting in different genes being positively selected in different lineages [30].
To address this question, we estimated the rate of adaptive evolution separately for each of the lineages leading to D. simulans and D. melanogaster from the common ancestor of the two species. The pattern of α (and a) across different pathways and functional categories of genes was very similar between the two lineages (Figures S12, S13), suggesting that the broad distribution of selection pressures between immune functions is the same. For example, in both lineages antiviral RNAi genes have the highest rates of adaptive evolution and antimicrobial peptides have the lowest rates.
Estimates of a along these individual lineages are associated with high levels of noise due to the short length of the branches; furthermore, the measurement error will be negatively correlated across the two lineages. Despite these sources of error, however, the data show a significant positive correlation in immunity gene a estimates between the two lineages (Figure 4), and this suggests that individual genes, and not just categories of gene, are under similar selection pressures in both lineages. This correlation was not significantly different to that that found in the non-immunity genes, indicating that there is no greater tendency for parasites to cause lineage specific selection than other selective agents (Figure 4).
Immunity genes have similar levels of polymorphism and population structure to other genes
The analyses presented above can identify selection that has occurred over millions of years, but recent selective sweeps can also be detected though reductions in genetic diversity. In both D. melanogaster and D. simulans there was no significant difference in the diversity of synonymous sites (πs) between immunity and non-immunity genes (Kenyan D. melanogaster: πs = 1.60% vs. 1.55%; Kenyan D. simulans: 2.46% vs. 2.62%; Figure S19, Figure S20, Table S3). Furthermore, if the immune genes are split into functional categories, only the diversity of the antiviral RNAi genes is significantly lower than the control genes (D. melanogaster πs = 0.80%, p<0.001; D. simulans πs = 1.01%, p<0.001. Figure S19, Figure S20, Table S3). This is consistent with RNAi genes having the highest rates of adaptive substitution in the immune system, and suggests a high proportion of them may have recently experienced selective sweeps in both species. Furthermore, none of the immune genes had unusually high levels of polymorphism, suggesting host-parasite coevolution in Drosophila has not resulted in the ancient polymorphisms like those seen in vertebrate MHC genes and some plant resistance genes [31],[32].
It is known that flies are infected by different parasites in different populations, and this could lead to local adaptation where different alleles of a gene are favoured in different populations [33],[34],[35],[36],[37]. However, we could not detect any differences between immune genes and the controls in the amount of population structure in either D. melanogaster or D. simulans (Figure S21) providing no evidence to suggest that local adaptation of immune genes is common. However, it should be noted that our statistical power to detect genetic structure may be extremely low, and the effects of local adaptation on patterns of nucleotide variation may be small [38].
We also compared the amino acid diversity (πa) of the immunity and control genes, as this may reflect differences in selective constraint or the effects of balancing selection. In all eight populations πa was slightly higher in the immune genes, and in three populations the difference was significant (Figure S22, Figure S23, Table S3). Compared to the control genes, immune signalling molecules tend to have lower amino acid diversity, while antimicrobial peptides and recognition molecules in the cellular immune system have significantly higher amino acid diversity (Figure S22, S23). These differences correspond to the estimated number of substitutions occurring by genetic drift (Figure S24), but not to differences in πs, implying that they are caused by differences in selective constraint, rather than long-term balancing selection maintaining amino acid polymorphisms.
Discussion
We have found that the rate of adaptive substitution in immunity genes is nearly double the genome average. This is the first quantitative estimate of the rate at which natural selection drives protein evolution in genes of the immune system relative to the genome as a whole, and confirms that adaptation to parasites is an important force driving evolution. There are several reasons why parasites may be a powerful selection pressure. Firstly, parasites can cause high rates of mortality and morbidity, and therefore have a large impact on the fitness of their hosts. Secondly, the direction of parasite-mediated selection continually changes, due to coevolutionary arms races between hosts and parasites [39], and ecological factors altering the composition of the parasite community. Finally, parasites generally have shorter generation times, and (in the case of viruses) elevated mutation rates, potentially giving them an edge in the ‘arms-race’. This means that hosts may often be maladapted to their current set of parasites, and therefore under strong selection to evolve resistance.
We have also found that the high rate of adaptive substitution of immunity genes is driven by a small subset of immune genes under strong selection, while the majority of immunity genes have similar rates of adaptive evolution to the rest of the genome. This suggests that rapid ‘arms-race’ coevolution may only involve a small subset of molecules in the immune system. Since there is a tendency for these strongly-selected genes to cluster by pathway or protein-family, these clusters may reflect hotspots for coevolutionary interaction with parasites.
By examining the function of these groups of strongly-selected genes, we can gain clues regarding the underlying molecular processes that drive this coevolution. It is striking that almost all of these genes fall within the IMD signalling pathway and the antiviral RNAi pathway (Figure 3). It is known that both signalling pathways and RNAi are targeted by parasite molecules that suppress the immune response, and it has been suggested that this suppression may cause much of the adaptive evolution seen in immunity molecules [1],[2],[4],[25],[40]. The Toll pathway tends to have lower rates of adaptive evolution. It is unclear why this is, although it may reflect the pathogens with which it interacts, or constraint from its other functions in development [41]. In contrast to the signalling pathways, the PGRPs and GNBPs that act as receptors for the Toll and IMD pathways are not positively selected, possibly reflecting their role in binding to highly conserved pathogen molecules [7]. Unlike many other organisms (especially vertebrates [42]), AMPs in Drosophila show less adaptive evolution than most genes. This contrasts with the high rate of AMP gain and loss in the Drosophila phylogeny [1], and suggests that whatever process favours the duplication of AMPs does not result in strong selection on their protein sequence. Our results also imply that AMPs may be weakly constrained, with genetic drift fixing amino acid substitutions at a relatively high rate. This may be a consequence of gene duplication, as duplicated genes often have elevated rates of amino acid substitution [43].
It is interesting to note that components of the antiviral RNAi pathway also mediate defence against transposable elements [44],[45],[46], and these ‘genomic parasites’ may be an important selective force on these genes [25]. Indeed, several RNAi genes with no reported anti-viral function [25],[47],[48], and other genes involved in chromatin function [17], show evidence of rapid adaptive evolution in Drosophila.
At the phenotypic level, many organisms show evidence of convergent evolution, with different species evolving similar adaptations in response to similar selection pressures. However, it is unclear whether convergence is also common in molecular evolution, or whether molecular evolution is idiosyncratic, with each species following a unique evolutionary pathway [30]. One way to address this question is to test whether the same genes are evolving adaptively in different species [30]. At a broad level, we found that similar functional classes of immunity genes tend to have elevated rates of adaptive evolution in both the D. melanogaster lineage and the D. simulans lineage. At a finer scale, the rate of adaptive evolution of individual genes is correlated in the two lineages (despite the very high levels of noise associated with these single-lineage estimates). Because this correlation was not significantly different in immunity genes and our control genes, this suggests the fluctuating selection pressures associated with host-parasite coevolution do not result in unusually high rates of lineage-specific selection. Together these results suggest that the immune system of these two closely related species experience similar selection pressures, and adapt to those selection pressures in similar ways.
Previous studies on immunity genes have applied various tests of adaptive evolution, and found that a higher than average fraction of immunity genes test ‘positive’ (e.g., [1],[2]). However, the statistical power of these tests will depend on factors such as selective constraint and gene length, and these could differ between immunity and non-immunity genes, even if their rates of adaptive substitution were identical. Furthermore, such confounding factors will be even more important if adaptive substitution is frequent across the genome, meaning that a large proportion of all genes evolve under some degree of positive selection [10]. Therefore a particular strength of the current approach, which can compare the estimated rates of adaptive evolution across different groups of genes, is that it provides quantitative estimates of the effect size rather than simply counting the number of ‘significant’ tests.
Estimates of the rate of adaptive substitution based on the McDonald-Kreitman test have been subject to some recent criticism as they can be influenced by factors such as population demography [18],[19]. However, it seems unlikely the differences observed here are artefacts. First, we compared loci where we have a strong a priori expectation of adaptive substitution to position-matched control loci. Second, we found no significant differences in the rate at which genetic drift causes non-adaptive evolution at these loci, such as could mislead the tests (Figure S24). Finally, false signatures of adaptive substitution can occur in populations that have experienced bottlenecks or recent expansions, and yet the signal we observed was much stronger in the ancestral Kenyan populations (Figure S10A), and weakest in the more derived populations (Figure S10B), while quantitative estimates of a differed surprisingly little between datasets. As new sequencing technologies result in ever larger datasets, this approach promises to be a powerful way to identify the selection pressures driving molecular evolution.
Our data not only confirm that parasites are an important driving force in molecular evolution [1],[2], they quantify the magnitude of this effect, and show that the rate of adaptive protein evolution in immunity genes is nearly twice the genome average. This elevated rate in the immune system is due to a subset of genes evolving under intense positive selection, and many of these genes are strongly selected in both D. melanogaster and D. simulans, suggesting that our results may reveal general principles of immune system evolution. In particular, some of the most strongly selected genes may be targeted by parasite suppressors the immune response, and this may be a key battlefield in coevolution. These data add to the growing evidence that much adaptive protein sequence evolution is driven by co-evolutionary conflicts within or between genomes [49],[50].
Materials and Methods
Sequencing and sequence analysis
Flies were sampled from six populations of D. melanogaster and two populations of D. simulans, covering both their original range in Africa and more recent global expansion. In each population we extracted genomic DNA from four female flies that were either collected from the wild or were the progeny of crosses between pairs of isofemale lines (i.e. we sampled eight chromosomes from each population). Targeted genes were amplified by PCR in ∼5 kbp products, and the PCR products from each population were then mixed together, purified on a gel, and sequenced using the Solexa-Illumina sequencing platform to high coverage (mean >130-fold; Figure S1). The 36 bp sequencing reads were aligned to the D. melanogaster or D. simulans genome using MAQ [51] allowing for up to 2 mismatches per read, which resulted in 5–16 million mapped reads in each population. The sites were then assigned to coding or non-coding sequence using the genome annotation, and coding sites were classified as synonymous or non-synonymous. Positions with less than 20-fold coverage were excluded, as were genes represented by less than 100 bp; however, our results were not strongly affected by the exclusion of sites with less than 50-fold or 100-fold coverage (Figure S25). Full details of the Solexa-Illumina sequencing, together with a detailed comparison with traditional Sanger sequencing, are given in Text S1. A full listing of loci, their positions and polymorphism counts are given in Table S1.
Adaptive substitutions
To estimate the rate of adaptive substitution, we used a multi-locus, maximum likelihood extension of the McDonald-Kreitman test. This method is based on Welch 2006 (ref. [15], see also [23],[24]), but contains several new features and models. Software that implements the new methods is available on request from the authors, or from http://tree.bio.ed.ac.uk/software/.
We compared non-synonymous and synonymous divergence between D. melanogaster and D. simulans with polymorphism from both species. For each locus, the six observations (dN, dS, and pN and pS for each species), were assumed to have the following expected values:
where lS and lN are the number of synonymous and non-synonymous sites, λ = μt is the expected neutral divergence between the species, θi = 4Neμ is the expected neutral polymorphism for species i, ni is the number of alleles sampled for species i (taken here to be 8 per sampled population), and f is the fraction of non-synonymous mutants that are effectively neutral [15].
The parameters of greatest interest here, α or a, quantify the multiplicative or additive deviation of the observed dN from its expectation under neutrality and purifying selection. Positive estimates of either α or a are consistent with adaptive protein evolution, while negative values result either from sampling error, or from the presence of mildly deleterious mutations (which violate the assumptions of the test, contributing to pN but rarely reaching fixation [16],[52]). This violation can be mitigated by excluding low frequency synonymous and non-synonymous polymorphisms, as this is expected to remove the great majority of mildly deleterious mutations while leaving the neutral pN/pS ratio unaltered [52],[53]. To explore this phenomenon, we repeated our analyses excluding all putative polymorphisms with an estimated minor-allele frequency below a range of threshold frequencies (Figure S3). Our results were qualitatively unaltered, and so in the main text we report only results with all sampled polymorphisms included in the counts.
To estimate the model parameters it was assumed that observed quantities were Poisson distributed around their expected values [15],[23],[24]. This distribution is derived under the assumption that substitutions and polymorphisms occur as independent events, but this assumption can be violated, e.g., by linked selection causing the clustering of substitution events in time. We used three approaches to reduce the impact of such violations. First, for some parameter types (selective constraint f and/or adaptive substitution a), we assigned separate parameters to each locus, making the extent of stochastic variation irrelevant to the parameter estimates obtained. Second, we obtained confidence intervals by bootstrapping across loci, rather than using the curvature of the likelihood surface. Third, we used model-selection criteria that allow for un-modeled over-dispersion (such as that arising from the clustering of events in time). To avoid over-parameterization associated with assigning large numbers of locus-specific parameters, we assumed that λ (the neutral mutation rate multiplied by divergence time) took a single value across all loci.
To model neutral polymorphism, we exploited the correlation between θ at a locus, and its local recombination rate [54], by fitting the model θ = mr+b, where r is the local D. melanogaster recombination rate [55]. Maximum likelihood estimates of m and b were then obtained for each of the two species. This model has the advantage of providing appropriate estimates of θ for loci where the synonymous polymorphism is not at equilibrium, such as after a recent selective sweep. Model selection techniques (see below) also showed that it was significantly preferred to models in which θ did not vary between loci, and in which each locus had a separate parameter. Importantly, however, estimates of a were very similar under all three parameterizations (Figure S26). Given our chosen model, a data set of k loci was used to fit k+5 nuisance parameters, plus the a or α values of interest.
To choose between different parameterizations of the likelihood model (see Table S2) we used the Akaike Information Criterion, corrected for finite sample size and over-dispersion in the count data [56]. This criterion is given by QAICc = −2lnL/c+2K+K(K+1)/(n-K-1) where lnL is the maximized likelihood for the model, K is the number of parameters it contains, and n is the number of data points (taken to be 6 times the number of loci). The factor c is the correction for overdispersion, and was estimated by c = (2lnLfull-2lnLsat)/n full, where “full” denotes the largest model in the set of models being compared, and “sat” denotes the saturated model, in which the expected values of all data points were set to their observed values. The conditional likelihood of each model was obtained by converting the QAICc values into Akaike weights [56].
To compare estimates of adaptive substitution along two independent lineages, we used a variant of the method above, including polymorphism from a single species, and polarizing substitutions on to the D. melanogaster or D. simulans branch based on the inferred ancestral sequence. Ancestral sequences were inferred using maximum likelihood under a codon-based model and the tree (((Dmel,Dmel), (Dsim,Dsim)), ((Dyak), (Dere))) using PAML [57].
Genetic diversity and differentiation statistics
Genetic diversity was quantified in two ways. First, an estimate of θ derived from the number of polymorphic sites, calculated exactly as Watterson's θw under the assumption that all eight chromosomes in each population were sampled [58]. Although sites with low read depth may not sample all chromosomes, even at 20-fold coverage (our minimum threshold for inclusion) given equal representation of the chromosomes there is >90% chance that at least 7 of the 8 chromosomes have been sampled. Given the observed read depths this effect would lead us to underestimate Watterson's θ by less than 0.5% of its correct value for most loci (Figure S9). Second, an estimate of θ based on π (the average number of pairwise differences per site) was calculated from read frequencies (rather than allelic frequencies) at each site based on the assumption that read frequencies should reflect underlying allele frequencies. In fact, although significantly correlated, read frequencies do not provided a good estimate of allele frequencies in our data (Pearson's ρ = 71; Figure S4, see Text S1 for a full discussion). However, when averaged over multiple sites, π based on read-depth is extremely highly correlated with that based on true allele frequencies from Sanger sequence data, suggesting that this is an excellent measure of diversity (Pearson's ρ = 0.90; Figure S26).
The degree of population structure was quantified using a sequence-based estimate of FST derived from πs calculated within and between populations: FST = (πtotal–πsub)/πtotal [e.g. 59] where πsub is the average genetic diversity of a gene within a population and π total is diversity across all populations. Averages across genes were calculated as the ratio between the mean of the numerator and the mean of the denominator for those genes, rather than the mean of the ratios. The significance of differences between classes of genes in FST and genetic diversity was assessed by bootstrapping. Genes were re-sampled with replacement within each category, and the statistic was recalculated 1000 times to produce a null distribution.
Supporting Information
Acknowledgments
We thank Urmi Trivedi, Marian Thomson, and Sujai Kumar for help with the short-read sequencing; Floh Bayer for the Sanger re-sequencing; and all the people who provided Drosophila samples. We thank David Begun and three anonymous reviewers for valuable comments that substantially improved the manuscript.
Footnotes
The authors have declared that no competing interests exist.
DJO is supported by Wellcome Trust Research Career Development Fellowship 085064/Z/08/Z (www.wellcome.ac.uk), FMJ by a Royal Society University Research Fellowship (royalsociety.org), and JJW by BBSRC grant DO17750 awarded to Andrew Rambaut (www.bbsrc.ac.uk). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Sackton TB, Lazzaro BP, Schlenke TA, Evans JD, Hultmark D, et al. Dynamic evolution of the innate immune system in Drosophila. Nature Genetics. 2007;39:1461–1468. doi: 10.1038/ng.2007.60. [DOI] [PubMed] [Google Scholar]
- 2.Schlenke TA, Begun DJ. Natural Selection Drives Drosophila Immune System Evolution. Genetics. 2003;164:1471–1480. doi: 10.1093/genetics/164.4.1471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, et al. A Scan for Positively Selected Genes in the Genomes of Humans and Chimpanzees. PLoS Biology. 2005;3:e170. doi: 10.1371/journal.pbio.0030170. doi: 10.1371/journal.pbio.0030170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Begun DJ, Whitley P. Adaptive Evolution of Relish, a Drosophila NF-kB/IkB Protein. Genetics. 2000;154:1231–1238. doi: 10.1093/genetics/154.3.1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lazzaro BP. Elevated Polymorphism and Divergence in the Class C Scavenger Receptors of Drosophila melanogaster and D. simulans. Genetics. 2005;169:2023–2034. doi: 10.1534/genetics.104.034249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Obbard DJ, Jiggins FM, Halligan DL, Little TJ. Natural Selection Drives Extremely Rapid Evolution in Antiviral RNAi Genes. Current Biology. 2006;16:580. doi: 10.1016/j.cub.2006.01.065. [DOI] [PubMed] [Google Scholar]
- 7.Jiggins FM, Kim K-W. Contrasting evolutionary patterns in Drosophila immune receptors. J Mol Evol. 2006;63:769–780. doi: 10.1007/s00239-006-0005-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jiggins FM, Kim KW. A screen for immunity genes evolving under positive selection in Drosophila. Journal of Evolutionary Biology. 2007;20:965–970. doi: 10.1111/j.1420-9101.2007.01305.x. [DOI] [PubMed] [Google Scholar]
- 9.Heger A, Ponting CP. Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Research. 2007;17:1837–1849. doi: 10.1101/gr.6249707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Eyre-Walker A. The genomic rate of adaptive evolution. Trends in Ecology & Evolution. 2006;21:569–575. doi: 10.1016/j.tree.2006.06.015. [DOI] [PubMed] [Google Scholar]
- 11.McDonald JH, Kreitman M. Adaptive Protein Evolution at the Adh Locus in Drosophila. Nature. 1991;351:652–654. doi: 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
- 12.Sawyer SA, Parsch J, Zhang Z, Hartl DL. Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proceedings of the National Academy of Sciences. 2007;104:6504–6510. doi: 10.1073/pnas.0701572104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shapiro JA, Huang W, Zhang C, Hubisz MJ, Lu J, et al. Adaptive genic evolution in the Drosophila genomes. Proceedings of the National Academy of Sciences. 2007;104:2271–2276. doi: 10.1073/pnas.0610385104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Parsch J, Zhang Z, Baines JF. The Influence of Demography and Weak Selection on the McDonald-Kreitman Test: An Empirical Study in Drosophila. . Mol Biol Evol. 2009;26:691–698. doi: 10.1093/molbev/msn297. [DOI] [PubMed] [Google Scholar]
- 15.Welch JJ. Estimating the genomewide rate of adaptive protein evolution in Drosophila. Genetics. 2006;173:821–837. doi: 10.1534/genetics.106.056911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Smith NGC, Eyre-Walker A. Adaptive protein evolution in Drosophila. Nature. 2002;415:1022. doi: 10.1038/4151022a. [DOI] [PubMed] [Google Scholar]
- 17.Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh Y-P, et al. Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans. PLoS Biol. 2007;5:e310. doi: 10.1371/journal.pbio.0050310. doi: 10.1371/journal.pbio.0050310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Eyre-Walker A. Changing Effective Population Size and the McDonald-Kreitman Test. Genetics. 2002;162:2017–2024. doi: 10.1093/genetics/162.4.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hughes AL. Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level. Heredity. 2007;99:364–373. doi: 10.1038/sj.hdy.6801031. [DOI] [PubMed] [Google Scholar]
- 20.Harismendy O, Ng P, Strausberg R, Wang X, Stockwell T, et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology. 2009;10:R32. doi: 10.1186/gb-2009-10-3-r32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wilding C, Weetman D, Steen K, Donnelly M. High, clustered, nucleotide diversity in the genome of Anopheles gambiae revealed through pooled-template sequencing: implications for high-throughput genotyping protocols. BMC Genomics. 2009;10:320. doi: 10.1186/1471-2164-10-320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stephan W, Li H. The recent demographic and adaptive history of Drosophila melanogaster. 2006;98:65–68. doi: 10.1038/sj.hdy.6800901. [DOI] [PubMed] [Google Scholar]
- 23.Bierne N, Eyre-Walker A. The genomic rate of adaptive amino acid substitution in Drosophila. Molecular Biology and Evolution. 2004;21:1350–1360. doi: 10.1093/molbev/msh134. [DOI] [PubMed] [Google Scholar]
- 24.Sawyer SA, Hartl DL. Population-Genetics of Polymorphism and Divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Obbard DJ, Gordon KHJ, Buck AH, Jiggins FM. The evolution of RNAi as a defence against viruses and transposable elements. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364:99–115. doi: 10.1098/rstb.2008.0168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lazzaro BP, Clark AG. Molecular population genetics of inducible antibacterial peptide genes in Drosophila melanogaster. Molecular Biology and Evolution. 2003;20:914–923. doi: 10.1093/molbev/msg109. [DOI] [PubMed] [Google Scholar]
- 27.Jiggins FM, Kim K-W. The Evolution of Antifungal Peptides in Drosophila. Genetics. 2005;171:1847–1859. doi: 10.1534/genetics.105.045435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Watson FL, Puttmann-Holgado R, Thomas F, Lamar DL, Hughes M, et al. Extensive Diversity of Ig-Superfamily Proteins in the Immune System of Insects. Science. 2005;309:1874–1878. doi: 10.1126/science.1116887. [DOI] [PubMed] [Google Scholar]
- 29.Schmucker D, Chen B. Dscam and DSCAM: complex genes in simple animals, complex animals yet simple genes. Genes & Development. 2009;23:147–156. doi: 10.1101/gad.1752909. [DOI] [PubMed] [Google Scholar]
- 30.Levine MT, Begun DJ. Comparative Population Genetics of the Immunity Gene Relish: Is Adaptive Evolution Idiosyncratic? PLoS ONE. 2007;2:e442. doi: 10.1371/journal.pone.0000442. doi: 10.1371/journal.pone.0000442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bergelson J, Kreitman M, Stahl EA, Tian D. Evolutionary Dynamics of Plant R-Genes. Science. 2001;292:2281–2285. doi: 10.1126/science.1061337. [DOI] [PubMed] [Google Scholar]
- 32.Hughes AL, Nei M. Pattern of Nucleotide Substitution at Major Histocompatibility Complex Class-I Loci Reveals Overdominant Selection. Nature. 1988;335:167–170. doi: 10.1038/335167a0. [DOI] [PubMed] [Google Scholar]
- 33.Kraaijeveld AR, Godfray HCJ. Geographic Patterns in the Evolution of Resistance and Virulence in Drosophila and Its Parasitoids. The American Naturalist. 1999;153:S61–S74. doi: 10.1086/303212. [DOI] [PubMed] [Google Scholar]
- 34.Corby-Harris V, Pontaroli AC, Shimkets LJ, Bennetzen JL, Habel KE, et al. Geographical Distribution and Diversity of Bacteria Associated with Natural Populations of Drosophila melanogaster Appl Environ Microbiol. 2007;73:3470–3479. doi: 10.1128/AEM.02120-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Carpenter JA, Obbard DJ, Maside X, Jiggins FM. The recent spread of a vertically transmitted virus through populations of Drosophila melanogaster. Molecular Ecology. 2007;16:3947–3954. doi: 10.1111/j.1365-294X.2007.03460.x. [DOI] [PubMed] [Google Scholar]
- 36.Brun G, Plus N. The viruses of Drosophila. In: Ashburner M, Wright TRF, editors. The genetics and biology of Drosophila. New York: Academic Press; 1980. pp. 625–702. [Google Scholar]
- 37.Johnson KN, Christian PD. Molecular Characterization of Drosophila C Virus Isolates. Journal of Invertebrate Pathology. 1999;73:248–254. doi: 10.1006/jipa.1998.4830. [DOI] [PubMed] [Google Scholar]
- 38.Kelly John K. Geographical Variation in Selection, from Phenotypes to Molecules. The American Naturalist. 2006;167:481–495. doi: 10.1086/501167. [DOI] [PubMed] [Google Scholar]
- 39.Woolhouse MEJ, Webster JP, Domingo E, Charlesworth B, Levin BR. Biological and biomedical implications of the coevolution of pathogens and their hosts. Nature Genetics. 2002;32:569–577. doi: 10.1038/ng1202-569. [DOI] [PubMed] [Google Scholar]
- 40.Thoetkiattikul H, Beck MH, Strand MR. Inhibitor kappa B-like proteins from a polydnavirus inhibit NF-kappa B activation and suppress the insect immune response. Proceedings Of The National Academy Of Sciences Of The United States Of America. 2005;102:11426–11431. doi: 10.1073/pnas.0505240102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lemaitre B, Nicolas E, Michaut L, Reichhart J-M, Hoffmann JA. The Dorsoventral Regulatory Gene Cassette spatzle/Toll/cactus Controls the Potent Antifungal Response in Drosophila Adults. Cell. 1996;86:973–983. doi: 10.1016/s0092-8674(00)80172-5. [DOI] [PubMed] [Google Scholar]
- 42.Tennessen JA. Molecular evolution of animal antimicrobial peptides: widespread moderate positive selection. Journal of Evolutionary Biology. 2005;18:1387–1394. doi: 10.1111/j.1420-9101.2005.00925.x. [DOI] [PubMed] [Google Scholar]
- 43.Wagner A. Selection and gene duplication: a view from the genome. Genome Biology. 2002;3 doi: 10.1186/gb-2002-3-5-reviews1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Czech B, Malone CD, Zhou R, Stark A, Schlingeheyde C, et al. An endogenous small interfering RNA pathway in Drosophila. 2008;453:798–802. doi: 10.1038/nature07007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chung W-J, Okamura K, Martin R, Lai EC. Endogenous RNA Interference Provides a Somatic Defense against Drosophila Transposons. 2008;18:795–802. doi: 10.1016/j.cub.2008.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zambon RA, Vikram VN, Wu LP. RNAi is an antiviral immune response against a dsRNA virus in Drosophila melanogaster. Cellular Microbiology. 2006;8:880–889. doi: 10.1111/j.1462-5822.2006.00688.x. [DOI] [PubMed] [Google Scholar]
- 47.Klattenhoff C, Xi H, Li C, Lee S, Xu J, et al. The Drosophila HP1 Homolog Rhino Is Required for Transposon Silencing and piRNA Production by Dual-Strand Clusters. 2009. Cell In Press, Corrected Proof. [DOI] [PMC free article] [PubMed]
- 48.Vermaak D, Henikoff S, Malik HS. Positive Selection Drives the Evolution of rhino a Member of the Heterochromatin Protein 1 Family in Drosophila. PLoS Gen. 2005;1:e9. doi: 10.1371/journal.pgen.0010009. doi: 10.1371/journal.pgen.0010009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Swanson WJ, Vacquier VD. The rapid evolution of reproductive proteins. Nature Reviews Genetics. 2002;3:137–144. doi: 10.1038/nrg733. [DOI] [PubMed] [Google Scholar]
- 50.Presgraves DC. Does genetic conflict drive rapid molecular evolution of nuclear transport genes in Drosophila? BioEssays. 2007;29:386–391. doi: 10.1002/bies.20555. [DOI] [PubMed] [Google Scholar]
- 51.Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008;18:1851–1858. doi: 10.1101/gr.078212.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Fay JC, Wyckoff GJ, Wu CI. Positive and negative selection on the human genome. Genetics. 2001;158:1227–1234. doi: 10.1093/genetics/158.3.1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Charlesworth J, Eyre-Walker A. The McDonald-Kreitman Test and Slightly Deleterious Mutations. Mol Biol Evol. 2008;25:1007–1015. doi: 10.1093/molbev/msn005. [DOI] [PubMed] [Google Scholar]
- 54.Begun DJ, Aquadro CF. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature. 1992;356:519–520. doi: 10.1038/356519a0. [DOI] [PubMed] [Google Scholar]
- 55.Singh ND, Arndt PF, Petrov DA. Genomic Heterogeneity of Background Substitutional Patterns in Drosophila melanogaster. Genetics. 2005;169:709–722. doi: 10.1534/genetics.104.032250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Burnham KP, Anderson DR. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag; 1998. [Google Scholar]
- 57.Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications in the Biosciences. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- 58.Watterson GA. On the number of segregating sites in models without recombination. Theor Popn Biol. 1975;7:256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]
- 59.Pannell JR, Charlesworth B. Effects of metapopulation processes on measures of genetic diversity. Philosophical Transactions of the Royal Society of London Series B: Biological Sciences. 2000;355:1851–1864. doi: 10.1098/rstb.2000.0740. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.