Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2007 Feb 6;104(7):2271–2276. doi: 10.1073/pnas.0610385104

Adaptive genic evolution in the Drosophila genomes

Joshua A Shapiro *, Wei Huang , Chenhui Zhang , Melissa J Hubisz , Jian Lu *, David A Turissini *, Shu Fang *, Hurng-Yi Wang *, Richard R Hudson *, Rasmus Nielsen §, Zhu Chen †,¶,, Chung-I Wu *,**,
PMCID: PMC1892965  PMID: 17284599

Abstract

Determining the extent of adaptive evolution at the genomic level is central to our understanding of molecular evolution. A suitable observation for this purpose would consist of polymorphic data on a large and unbiased collection of genes from two closely related species, each having a large and stable population. In this study, we sequenced 419 genes from 24 lines of Drosophila melanogaster and its close relatives. Together with data from Drosophila simulans, these data reveal the following. (i) Approximately 10% of the loci in regions of normal recombination are much less polymorphic at silent sites than expected, hinting at the action of selective sweeps. (ii) The level of polymorphism is negatively correlated with the rate of nonsynonymous divergence across loci. Thus, even under strict neutrality, the ratio of amino acid to silent nucleotide changes (A:S) between Drosophila species is expected to be 25–40% higher than the A:S ratio for polymorphism when data are pooled across the genome. (iii) The observed A/S ratio between species among the 419 loci is 28.9% higher than the (adjusted) neutral expectation. We estimate that nearly 30% of the amino acid substitutions between D. melanogaster and its close relatives were adaptive. (iv) This signature of adaptive evolution is observable only in regions of normal recombination. Hence, the low level of polymorphism observed in regions of reduced recombination may not be driven primarily by positive selection. Finally, we discuss the theories and data pertaining to the interpretation of adaptive evolution in genomic studies.

Keywords: McDonald–Kreitman test, selection, polymorphism


Recent studies based on DNA sequence data from large numbers of genes have increasingly suggested the prevalence of adaptive evolution in coding (15) as well as noncoding (6, 7) regions. The extent to which positive selection influences DNA polymorphism and divergence appears to be incompatible with the Neutral Theory of Molecular Evolution (8). This theory posits that the overall pattern of DNA evolution can be accounted for by mutation, genetic drift, and negative selection. It does not deny the operation of positive selection on some loci but only asserts that the overall pattern of genomic evolution can be explained without invoking adaptive evolution. Presumably, adaptive changes at any given time involve too small a fraction of the genome to be a statistically significant factor, despite their overwhelming biological significance.

The evidence used to test the Neutral Theory can be classified as divergence among species (911), polymorphism within species (1214) or a combination of these (15, 16). The combined approach, as exemplified by the McDonald–Kreitman (MK) test and its derivatives, can separate the effects of negative and positive selection and is especially informative about adaptive evolution. Many such studies have concluded that positive selection may play a significant role in driving amino acid substitutions in the human and Drosophila melanogaster lineages (15).

However, as pointed out (4), the MK test requires all sites to be from the same genealogy (i.e., no recombination among sites). When there has been recombination between sites, the genealogies of each segment will differ, and the possibility arises that there will be an association between the rate of evolution in a segment and the time to its most recent common ancestor. Such an association, even if weak, may substantially affect the results of the MK test. To the best of our knowledge, no report has attempted to estimate such a correlation, although variants of the MK test insensitive to such a relationship have been developed (3, 4). Hence, many estimates of adaptation by the MK test may require recalibration. For noncoding regions, recalibration may be an even more challenging task because of recombination between different types of sites.

Previous applications of the MK test were based on a relatively small sample of genes, which were often chosen without heeding possible biases. Indeed, earlier Drosophila studies contained a large fraction of rapidly evolving genes (many pertaining to male reproduction), which are more likely to be driven by positive selection than the average loci. To have an unbiased view of the average effect of positive selection on the entire genome, it is necessary to conduct an extensive survey of polymorphism and divergence with the following sampling considerations.

  1. Choice of genes should not have any overt bias. Criteria based on functional category, disease etiology, and prior knowledge of polymorphism or rate of evolution are likely to create biases. A strategy without biases is to choose genes by location and to cover a large part of the genome evenly. Furthermore, because the efficacy of selection depends critically on the rate of recombination (17), the contrast between loci from regions of low and high recombination will inform about the operation of selection.

  2. The number of loci should be large enough to yield several thousand SNPs.

  3. The number of individuals has to be large enough to reveal the difference in the frequency of deleterious and neutral polymorphism. It is also desirable to concentrate the sampling in the center of the species' origin for two reasons. First, the diversity of the species can be more accurately captured. Second, the confounding effects of drastic demographic changes associated with colonization can be minimized.

  4. There should be multiple outgroup sequences to minimize misinference of ancestral states.

In this study, we report a survey of 419 loci on the autosomes of D. melanogaster. These loci are mainly from the third chromosome and are evenly distributed on that chromosome, which accounts for 40% of the Drosophila genome. We used 21 isogenic lines of D. melanogaster, 15 of them of African origin. For outgroups, we sequenced one line from each of the three sibling species of D. melanogaster: Drosophila simulans, Drosophila mauritiana, and Drosophila sechellia. For comparison, we also analyzed polymorphisms in the seven sequenced D. simulans genomes.

Results

Polymorphism.

Summaries of polymorphism and nucleotide diversity for 419 autosomal loci are shown in Table 1. Genes in normal and low recombination regions are separately tallied, and sites are divided by mutation class into synonymous, nonsynonymous, and noncoding sites. The data presented are for all samples, which largely reflect African polymorphism. In general, the cosmopolitan samples showed lower levels of polymorphism than those from Africa, as reported (12, 1820), with the bulk of cosmopolitan polymorphisms forming a subset of the African polymorphisms. Overall, the full combined sample was not significantly different from the African group alone [see supporting information (SI) Fig. 5]; we therefore chose to use the full set of samples for further analysis.

Table 1.

Estimates of nucleotide diversity, Tajima's D, and standardized Fay and Wu's H for 419 autosomal loci, separated by recombination rate

Site class θ θW θL D H
Normal recombination, n = 252
    A 0.0019 (0.0024) 0.0024 (0.0028) 0.0020 (0.0024) −0.585* 0.320*
    S 0.0232 (0.0130) 0.0233 (0.0120) 0.0260 (0.0159) −0.134 −0.311
    NC 0.0172 (0.0165) 0.0180 (0.0157) 0.0183 (0.0152) −0.147 −0.158
Low recombination, n = 167
    A 0.0013 (0.0021) 0.0014 (0.0020) 0.0014 (0.0020) −0.468* 0.188
    S 0.0113 (0.0114) 0.0118 (0.0106) 0.0134 (0.0131) −0.288 −0.364
    NC 0.0094 (0.0242) 0.0096 (0.0200) 0.0112 (0.0189) −0.211 −0.120

Low recombination is defined as <0.002 cM/kB. Unweighted means are shown, with the standard deviation among loci for estimates of θ given in parentheses. P values are for a two-tailed Wilcoxon rank sum test where the true mean is 0 A, amino acid changes; S, silent changes; NC, noncoding.

*P < 10−8.

P < 10−2.

We used three measures for nucleotide diversity, θW, θΠ, and θL, which estimate the population mutation rate per site, 4Nμ, by using different weightings of low-, intermediate-, and high-frequency variants. At neutral equilibrium, all three measures should yield comparable values. The normalized difference between θW and θΠ is the D statistic of Tajima (21), and the normalized difference between θL and θΠ is the H statistic of Fay and Wu, as modified by Zeng et al. (13, 22). Although the separate estimates of polymorphism (θΠ, θW, and θL) are quite comparable over these many loci, the average Tajima's D is somewhat negative for all mutation classes, both in the full dataset and in the African sample alone (data not shown), implying that even within Africa, D. melanogaster is not at the neutral equilibrium because of demography and/or selective effects.

Demography alone would be expected to affect patterns of polymorphism in the different classes of sites similarly. The contrast among classes of sites can therefore be informative about the specific effects of selection on each class. For example, the amino acid changing sites are only ≈1/10 as variable as the silent sites, indicating strong selective constraint, as expected. Furthermore, they have significantly negative D and positive H, indicating a deficit of moderate- and high-frequency polymorphisms. Apparently, many nonsynonymous polymorphisms are deleterious and reach only low frequency. We also note that the noncoding regions (which consist mainly of intronic and 5′ untranslated regions) are less variable than the synonymous sites of the coding regions (Mann–Whitney U; P = 3 × 10−9). These noncoding regions are likely under moderate selective constraint, as reported (7, 23, 24). Both silent and noncoding sites have negative D and H values, indicating that excess high-frequency variants in each class. (Negative D alone can mean too many low- and/or high-frequency variants but negative D and H together should mean the latter.) This pattern of too many high-frequency variants may be indicative of either hitchhiking under positive selection (25) or population structure (26). We consider the second explanation less likely, because D. melanogaster populations are not in general strongly structured (FST = 0.04 between African and cosmopolitan samples). In addition, when the African samples are considered alone, the same patterns remain (data not shown). Nevertheless, polymorphism data are merely suggestive of positive selection. Other aspects of the data need to be carefully evaluated to form strong conclusions.

The Effect of Recombination on Polymorphism.

An important feature of Table 1 is that the level of variation depends on recombination rate (12, 27). The full relationship for silent site variation in D. melanogaster is given in Fig. 1a (P < 10−15; see figure legends). In contrast, the level of divergence between species (D. melanogaster vs. D. simulans group) is not influenced by recombination rate (P = 0.28; see SI Fig. 6). The cause(s) of this reduced level of variation in regions of low recombination has been subject to debate. On the one hand, hitchhiking with advantageous mutations is expected to reduce the level of variation so-called selective sweep, as recombination decreases (2729). On the other hand, the background selection model (30) suggests that deleterious mutations, by reducing the effective population size in regions of reduced recombination, may also remove linked neutral variants.

Fig. 1.

Fig. 1.

Relationship between recombination rate and θW for silent sites (Spearman rank correlation R = 0.53, P < 10−15) (Upper) and Tajima's D for silent sites (R = 0.16, P = 8.6 × 10−4) (Lower).

Although background selection is clearly operating, the question is whether background selection can be rejected as the sole contributing force, or whether selective sweeps must be invoked to account for the observations. In Fig. 1b, we show that Tajima's D is also correlated with the recombination rate, indicating that lower recombination regions have an excess of rare variants, confirming earlier observations (27). Although this correlation has been suggested to support the selective sweep hypothesis, the effect is quite weak.

In regions of normal recombination, there should also be pockets of reduced variation created by selective sweeps (13, 28). Under background selection alone, the level of variation should be relatively homogeneous (31). We note in Fig. 1a that a substantial portion of the loci show very low silent-site diversity, even as the recombination rate increases, whereas others appear to have elevated polymorphism. To test the significance of this spread in nucleotide variation among loci with comparable recombination rates, we used the Hudson–Kreitman–Aguade (HKA) test, which compares the relative levels of nucleotide diversity among loci, calibrated against divergence. In this application, we used a variation of the HKA test that incorporates demographic parameters (17).

The demographic history of populations can have dramatic effects on the observed pattern of noncoding polymorphism, and recent work has suggested that much of the observed pattern of noncoding polymorphism in Drosophila can be explained by purely demographic models (1820). To address this question, we fit a demographic model to the data under the assumption that the African and cosmopolitan flies represent two diverged populations with migration between them. We then estimated the number of loci with a significant deficit or excess of polymorphism relative to divergence by simulation under the inferred model. Of the 247 loci in the normal recombination region, 26 have significantly lower polymorphism than expected, whereas 31 appear to have an excess of polymorphism (SI Table 3). Although the excess polymorphism may be due to balancing selection, differential selection among populations, or weak background selection, the excess of divergence is far more suggestive of positive selection.

We also attempted to directly identify loci that had undergone a selective sweep by using the method of Nielsen et al. (32), which has been shown to be fairly robust to assumptions about demography. This method estimates the location and strength of a sweep at each locus, then uses simulation to determine whether that model fits the data better than a model without selection. Under this set of models, 33 of the 419 autosomal loci show significant evidence of selective sweeps, 23 of which are in the “normal” recombination region (of 252 in that region).

These tests support the idea that there has been a substantial amount of positive selection in the history of D. melanogaster. However, they remain potentially susceptible to demographic effects. Because the set of possible demographic models is enormous, we cannot completely exclude the possibility that we have simply not looked at the correct one. A better method is to use tests that have an internal control of sites known to be neutral.

Polymorphism vs. Divergence.

Here, we contrast the patterns of amino acid divergence and polymorphism by using the MK test (15). The MK test determines whether there are more amino acid substitutions (DivA) among species than expected as compared with the level of the amino acid polymorphism (PolyA) within species. The ratio of silent site divergence (DivS) to polymorphism (PolyS) changes is used to calibrate the divergence/polymorphism ratio for neutral changes. We summarize the contingency tables of divergent/polymorphic and amino acid/silent changes and define the Fixation Index (FI) as FI = (DivA/DivS)/(PolyA/PolyS). FI is expected to be equal to 1 in the neutral case when all polymorphic sites come from a single genealogy. When the observed FI is larger than the neutral expectation, positive selection is usually invoked (14).

Our sampling was designed to detect selection at the genomic level with as little bias as possible. Therefore, each individual gene segment was chosen not to maximize the sensitivity of detection but rather to collectively capture the overall signal of selection across the whole genome. In total, 33 of the 419 loci show a significant excess of amino acid substitutions between species (SI Table 4), but only two of these remain significant when corrected for a false discovery rate of 0.05 (33).

To extract the information from the entire collection of genes, we pooled data across loci and analyze the combined dataset (1, 2). Although combining contingency tables in the MK framework may yield a dataset that is intuitive and efficient, there are potential biases in this practice. The assumption of FI = 1 under neutrality, as applied in the MK test, is true only when all of the polymorphisms come from a single genealogy with little or no recombination among sites, or when all polymorphism is unlinked. When the nucleotide sites are derived from regions with different genealogies, caution should be exercised.

A simple example of this effect is shown in Fig. 2. Although the last table is the sum of the two neutral cases, the combination of the two produces an A/S ratio for divergence that is much larger than that for polymorphism. Under the standard interpretation of the MK test, these ratios are indicative of positive selection during the divergence of the two species. However, this is merely an artifact of the fact that the more constrained locus [amino acid/silent changes (A/S) = 0.2] has a higher level of polymorphism. Although this example may exaggerate the relationship between levels of constraint and polymorphism, any true correlation in the data will cause an apparent deviation from neutrality and affect conclusions about the nature of selection. If no correlation exists, adding more loci should cancel out random effects of the type observed in Fig. 2 (although type I error may be increased).

Fig. 2.

Fig. 2.

MK tables with FI = 1 can combine to produce a table of FI > 1, thus appearing nonneutral. Both hypothetical loci appear neutral as the ratio of nonsynonymous (A) to synonymous (S) changes within each table is the same for polymorphism (P) and divergence (D). The level of polymorphism between loci, however, is not the same, with a P:D ratio of 0.3 and 1.0 for the first and second tables, respectively. Although the different levels of polymorphism may be incompatible with strict neutrality, this rejection, if true, applies only to the polymorphic portion of the data. By contrast, the MK test is usually used to infer positive selection in the divergence among species.

In Fig. 3, we show there is indeed a negative correlation in our dataset (Pearson's R = 0.174, P = 0.0005). To minimize the effects of this correlation, as well as other effects that may exist due simply to the variation in the relationship between polymorphism and divergence (P/D) and A/S among loci, we calculated the expected FI for the aggregate of all loci by summing the expected contingency table values for each locus under neutrality with no recombination (i.e., the product of the marginal sums divided by the total count for each locus). We assessed the significance of the observed FI by bootstrapping, as described in Materials and Methods.

Fig. 3.

Fig. 3.

Correlation between P/D and A/S, where P (polymorphism), D (divergence), A (amino acid altering), and S (silent) are the margins for the 2 × 2 contingency table of the MK test. P/D denotes the size of the genealogy, and A/S denotes the selective constraint of the gene. Pearson's R = −0.188 and P = 0.0001. Seven genes have a P/D ratio >5 and are shown on the top (solid diamonds). When the contingency tables are summed up, genes with a larger P/D ratio generally contribute proportionately more to the total.

When all autosomal segments are combined, the A/S ratios for divergence and polymorphism are 2,307/6,286 (=0.367) and 1,462/4,695 (=0.311), respectively (see Table 2 and Fig. 4a), and FI = 1.18. Although this value initially appears to show an excess of amino acid divergence (i.e., >1), the expected FI for this set of loci is 1.25, and the observed value actually represents a marginally significant deficit of amino acid divergence (two-tailed P = 0.06). The source of this deficit is explained below.

Table 2.

Results of the MK test, separated into normal and low recombination regions

Locus set Polymorphism (P)
Set statistic All sites High frequency Divergence (D)
All loci Replacement sites (A) 1,462 465 2,307
Silent sites (S) 4,695 2,411 6,286
A/S 0.311 0.193 0.367
FI 1.18 1.90
FIExp 1.25 1.38
P 0.968 <10−4
Normal recombination A 981 297 1315
S 3,566 1,883 3,442
A/S 0.275 0.158 0.382
FI 1.39 2.30
FIExp 1.29 1.44
P 0.050 <10−4
Low recombination A 481 152 992
S 1,129 528 2,844
A/S 0.426 0.288 0.349
FI 0.82 1.21
FIExp 1.12 1.18
P 1.00 0.404

High-frequency sites are those with derived allele frequency > 0.19 (corresponding to sites with a derived allele present in more than three copies). These sites may more faithfully represent neutral polymorphisms (see Results). P values are P(FIsimulated ≥ FIobs) from 10,000 simulations.

Fig. 4.

Fig. 4.

Polymorphism A/S ratios in D. melanogaster, binned by the frequency of the derived variants. The first and last three gray bars represent singletons, doubletons, and tripletons. The middle six gray bars have a frequency increment of 0.1. The nine bars with frequency >0.19 are designated high-frequency classes (see Results). Their A/S ratios, being statistically indistinguishable, may be considered the ratio for neutral polymorphic variants. The A/S ratio for high-frequency variants, as well as that for all variants, is shown by the open bar. The black bar gives the divergence A/S ratio between D. melanogaster and D. simulans.

So far, we have made no distinction between polymorphisms that are truly neutral and deleterious ones that have merely not yet been eliminated by selection. In Fig. 4, we divide the polymorphic sites into bins by the frequency of the derived variant (1). A/S declines as the frequency of the derived variant increases. In particular, A/S for the three lowest-frequency classes, which correspond to singletons, doubletons, and tripletons, is higher than in the higher-frequency classes. Many of these low-frequency amino acid mutations are likely to be slightly deleterious, are destined to be removed by selection, and do not contribute to long-term evolutionary trends. To test the significance of this observation, we calculated χ2 statistics for the table of silent and replacement polymorphisms binned by frequency. Here again, we used randomly generated tables to determine the appropriate null distribution of the χ2 statistic. The observed χ2 for the full table of frequencies is significantly greater than expected (P < 10−4), indicating an excess of heterogeneity in A/S across all classes. After removing the singleton and doubleton classes from the table, there remains significant heterogeneity among frequency classes (P = 0.04). However, when tripletons are also removed (leaving sites of frequency >0.2), the remainder of the table does not show significant heterogeneity (P = 0.07).

This “flat” profile is expected if most mutations more common than 20% are effectively neutral. We can first use the A/S for these sites to calculate the proportion of negatively selected amino acid polymorphisms. The classes with frequency >0.19 (corresponding to variants with more than three copies in our sample) have an A/S ratio of 0.186 (= 450/2,413), whereas the A/S ratio for all sites is 0.311. We estimate that the proportion of nonsynonymous polymorphic sites that are deleterious and destined to be removed is (0.311−0.186)/0.311 = 40%.

Similarly, we can calculate the proportion of amino acid fixations driven by positive selection. As shown in Table 2, the A:S ratio of the high-frequency polymorphic sites (A/S = 0.186) is significantly lower than the divergent A/S ratio (A/S = 0.367), even after correcting for biases resulting from combining tables as described [FI = 1.97 vs. E(FI) = 1.40; P < 10−4]. The proportion of amino acid fixations driven by positive selection is therefore approximately (1.97−1.40)/1.97 = 28.9%.

Effect of Recombination on Adaptive Divergence.

The reduced level of nucleotide diversity in regions of low recombination (Table 1) has two very different explanations, background selection and selective sweep. Analyses of polymorphism data alone have not been able to delineate the two effects unambiguously. In Table 2, genes from regions of low and normal recombination are grouped separately. Interestingly, for low recombination regions, the A:S ratio for divergence is not much different from polymorphism. In contrast, the differential in the A/S ratio in regions of normal recombination is quite pronounced. The same pattern has been observed in previous studies (34, 35), and the implication is that positive selection is associated with genes in normal but not in low recombination regions. We shall return to this topic in Discussion.

Alternative Explanations for the MK Test Results.

Positive results from the MK test do not necessarily indicate positive selection. If the effect of selective constraint is constant in divergence and polymorphism, positive selection is indeed a plausible explanation. However, there are at least two scenarios under which selective constraints may not be constant.

First, it is possible that the higher A/S ratio in divergent sites is the result of a reduced population size at some point in the divergence of D. melanogaster from the D. simulans clade. This would increase the number of effectively neutral mutations (4Nes <1), allowing more of the slightly deleterious amino acid changes to become fixed in the past. To address this concern, we estimated the ancestral effective population size. Takahata and Satta (36) and many others (3739) showed that neutral divergence across loci between two closely related species can be informative about the level of polymorphism at the time of speciation. The maximum likelihood estimate of θ at silent sites in the common ancestor of D. melanogaster and D. simulans is 0.0396 per base pair (K. Zeng, personal communication), which is larger than the extant level of polymorphism. Moreover, the cosmopolitan sister species, D. simulans, is comparably polymorphic with D. melanogaster. The two species of Drosophila therefore do not appear to show evidence of a population size smaller than the current size since the time of speciation.

Second, it is possible that selection intensity fluctuates for reasons unrelated to population-size changes. For example, when a species colonizes a new territory, selective constraint may change for some, but not all, loci. As a result, the reduced A/S ratios for some loci in one species, relative to the A/S ratio between species, may reflect changes in the strength of selective constraints, rather than the operation of positive selection. Comparing divergence with polymorphism in both species may alleviate, albeit not eliminate, that concern. This was an element of the original proposal of the MK test (15), but the practice has rarely been followed.

We therefore examined polymorphism data available from the D. simulans sequencing project. Because the sequencing was performed on seven separate lines of D. simulans, we can identify polymorphic sites from a homologous set of genes. We find the same patterns as in the D. melanogaster data. The A/S ratio for divergence is higher than for common polymorphism, suggesting that A/S ratio in the polymorphism of D. melanogaster is not at a historical low; thus, the level for divergence indeed appears high.

Discussion

Many studies have attempted to determine the extent of adaptive evolution at the genomic level (17) by contrasting divergence with polymorphism. There have been two major concerns about these studies: random gene sampling from the genome and the analysis of multiple loci simultaneously. We presented the criteria for random sampling in the Introduction and addressed the second issue extensively in Results.

The difficulty with the analysis of multiple genes in the MK test framework is that variants should either be on the same genealogy or completely unlinked, as with the Poisson Random Field model (40, 41). When there is partial linkage among sites, variation in P/D and A/S across linkage units (or genes) should be incorporated into the analysis. The negative correlation between P/D and A/S observed here is intriguing and deserves further investigation. It is plausible that the faster evolving genes (higher A/S) may have experienced more recent sweeps and hence are less variable (lower P/D). If true, previous conclusions about the presence of adaptive evolution in regions will likely stand, but the proof should be of a different kind.

By and large, previous conclusions of extensive adaptive evolution in the coding regions of Drosophila (1, 35) do turn out to be correct. Recently, the MK test has been applied to noncoding (N) sites, pooling variants from unlinked genomic segments (7). Because the observed FI is only 1.2 in that study, we caution the reliance on an expected FI of 1 for interpreting deviation from neutrality. Even a weak correlation of P/D with N/S can generate a neutral expectation of FI > 1.2, leaving no excess of noncoding divergence.

Another noteworthy observation of this report is the contrasting MK test results between regions of low and normal recombination. Loci in regions of normal recombination bear the signature of adaptive evolution, whereas loci in regions of reduced recombination do not. In the discussions on background selection vs. selective sweep as an explanation for the effect of recombination on nucleotide diversity (Fig. 1a), the focus is generally on the influence of selection on the linked neutral variants. Below, we shall further consider the mutual influences among sites under selection.

Under the so-called Hill–Robertson (HR) effect (17), a reduction in recombination would tend to reduce the efficacy of selection. In regions of low recombination, deleterious mutations may thus linger longer and have a greater effect in removing polymorphism. In contrast, positively selected sites have a reduced probability of becoming fixed. As a result, regions of low recombination may be subjected to even stronger background selection but weaker hitchhiking, when the HR effect is in operation. This interference between hitchhiking and background selection has been partially analyzed (42) and deserves to be modeled more realistically in the future.

Our analysis indicates that adaptive positive selection on amino acid changes is prevalent in D. melanogaster. A large unbiased sample of genes and unbiased statistical analyses are both indispensable for reaching such a conclusion.

Materials and Methods

We sequenced 419 loci in 21 lines of D. melanogaster and one line each of D. simulans, D. mauritiana, and D. sechellia. A complete list of the lines sampled and their geographic origin is shown in SI Table 5. The sequenced segments consisted of 367 third-chromosome loci and 54 second-chromosome loci. Approximately one-third of the genes were selected from a set of genes broadly annotated as being involved in behavior, reproduction, and response to stimuli (43). Most of the remainder were chosen randomly from a list of all functionally annotated genes on the third chromosome, weighted by recombination rate so that coverage was denser in regions of higher recombination. The final set (≈10%) consisted of genes with no functional annotation, selected by the same random method. A complete list of loci sequenced appears in SI Table 6. Although these selection criteria do not conform to our stated ideal sampling strategy, differences between the genes selected by ontology and those selected randomly were small. In general, the ontology set showed less evidence of selection in our analyses than the remainder, making many of our conclusions conservative.

Population genetic statistics (θW, θΠ, and θL) were calculated by using custom C++ software based on the libsequence package (44). Numbers of silent and replacement sites in an alignment and the values Ka and Ks were calculated according to the methods of Comeron (45). Recombination rates for each chromosomal band were assigned according to genome-wide estimates (46). Polymorphisms were assigned as ancestral or derived based on the most conservative outgroup (i.e., if a polymorphic site shared an allele with one outgroup and another with a different outgroup, the higher-frequency allele was assigned as ancestral).

Demographic parameters were estimated under a composite maximum-likelihood model. The model assumes that the African and cosmopolitan populations diverged T generations ago, with migration between the populations at rate m and a ratio of population sizes (Afr:Cos) r. Likelihood of the observed silent-site frequency spectrum was estimated for a grid of parameters, with one million coalescent trees simulated for each grid point (47, 48).

For more detailed methods, please see SI Text.

Supplementary Material

Supporting Information

Acknowledgments

We thank D. J. Begun and C. H. Langley (Center for Population Biology, University of California, Davis) for permission to use the D. simulans polymorphism data and K. Thornton (Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY) and A. Eyre-Walker (Centre for the Study of Evolution and School of Life Sciences, University of Sussex, Brighton, U.K.) for providing software, code, and assistance with usage. M.-L. Wu assisted with fly stock maintenance and sample preparation. We also thank the associates from the Department of Genetics, Chinese National Human Genome Center at Shanghai, and Shanghai South Gene Technology Co., Ltd., for technical support. A. Eyre-Walker, P. Andolfatto, J. Fay, M. Przeworski, C. Bustamante, J. Zhang, M. Kreitman, and members of the C.-I.W. laboratory provided critical commentary and helpful discussions. This work was supported by National Institutes of Health grants and an OOCS grant from Chinese Academy of Sciences (to C.-I.W.) and by a predoctoral fellowship from the Howard Hughes Medical Institute (to J.A.S.) W.H. is supported by Chinese High-Tech Program (no. 863; Grants 2002BA711A10 and 2004CB518605) and by Shanghai Municipal Foundation for Sciences and Technologies (Grants 03DJ14008 and 04DJ14003).

Abbreviations

MK

McDonald–Kreitman

FI

Fixation Index

P/D

polymorphism/divergence

A/S

amino acid/silent changes.

Footnotes

The authors declare no conflict of interest.

Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. EF384276EF392362).

This article contains supporting information online at www.pnas.org/cgi/content/full/0610385104/DC1.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_0610385104_1.pdf (63.7KB, pdf)
pnas_0610385104_2.pdf (147.3KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES