Abstract
The phylogenetic relationship of the now fully sequenced species Drosophila erecta and D. yakuba with respect to the D. melanogaster species complex has been a subject of controversy. All three possible groupings of the species have been reported in the past, though recent multi-gene studies suggest that D. erecta and D. yakuba are sister species. Using the whole genomes of each of these species as well as the four other fully sequenced species in the subgenus Sophophora, we set out to investigate the placement of D. erecta and D. yakuba in the D. melanogaster species group and to understand the cause of the past incongruence. Though we find that the phylogeny grouping D. erecta and D. yakuba together is the best supported, we also find widespread incongruence in nucleotide and amino acid substitutions, insertions and deletions, and gene trees. The time inferred to span the two key speciation events is short enough that under the coalescent model, the incongruence could be the result of incomplete lineage sorting. Consistent with the lineage-sorting hypothesis, substitutions supporting the same tree were spatially clustered. Support for the different trees was found to be linked to recombination such that adjacent genes support the same tree most often in regions of low recombination and substitutions supporting the same tree are most enriched roughly on the same scale as linkage disequilibrium, also consistent with lineage sorting. The incongruence was found to be statistically significant and robust to model and species choice. No systematic biases were found. We conclude that phylogenetic incongruence in the D. melanogaster species complex is the result, at least in part, of incomplete lineage sorting. Incomplete lineage sorting will likely cause phylogenetic incongruence in many comparative genomics datasets. Methods to infer the correct species tree, the history of every base in the genome, and comparative methods that control for and/or utilize this information will be valuable advancements for the field of comparative genomics.
Synopsis
To take full advantage of the growing number of genome sequences from different organisms, it is necessary to understand the evolutionary relationships (phylogeny) between organisms. Unfortunately, phylogenies inferred from individual genes often conflict, reflecting either poor inferences or real variation in the history of genes. In this study, the authors examine relationships within the Drosophila melanogaster species subgroup, a group of flies with three fully sequenced species in which phylogeny has been a source of controversy. Although the bulk of the data support a phylogeny with Drosophila melanogaster as an outgroup to sister species Drosophila erecta and Drosophila yakuba, large portions of their genes support alternative phylogenies. According to the authors, the most plausible explanation for these observations is that polymorphisms in the ancestral population were maintained during the two rapid speciation events that led to these species. Subsequent to speciation, polymorphisms were randomly fixed in each species, and in some cases non-sister species fixed the same ancestral polymorphisms, while sister species did not. In these cases the genes are correctly inferred to have conflicting phylogenies. The authors note that rapid speciation events will often lead to such conflict, which needs to be accounted for in evolutionary analyses.
Introduction
With the sequencing of 12 species from the genus Drosophila, the field of comparative genomics is now presented with the opportunity and challenge of understanding the function and history of every base in the model organism Drosophila melanogaster (Dmel). This process will hopefully result in the discovery of new biological phenomena and the development of new methodologies that will eventually help with the task of annotating other clades in the tree of life, particularly the human genome. Because most analyses of multiple genome sequences involve inferences about evolutionary history, they require an accurate description of the relationship of the species being analyzed.
The species history of the genus Drosophila has been the subject of numerous studies, and the consensus from the literature suggests that the relationship of the 12 sequenced species is well resolved, with the exception of the species within the Dmel species subgroup and perhaps the placement of the Hawaiian species, D. grimshawi, and the virilis-repleta species, D. virilis and D. mojavenis [1–5]. Within the Dmel species group, the placement of D. erecta (Dere) and D. yakuba (Dyak) relative to the Dmel lineage has been the subject of numerous conflicting studies [1–3,6–15]. Considering the placement of Dmel, Dere, and Dyak, all three of the possible phylogenies (Figure 1) have received support. The topology (Dmel,(Dere,Dyak)), which we shall refer to as tree 1, was supported by studies of polytene chromosome banding sequences [6], satellite DNA [7], the COI and COII mitochondrial genes [3], mitochondrial DNA [16], the fru gene [17], the Cu/Zn SOD gene [18], the H3 gene family [19], a concatenation of mitochondrial and nuclear genes [20], a concatenation of the genes Adh, Adhr, Gld, and ry [8], and a concatenation of the genes Adh, Amyrel, janA, janB, and Sod [9]. The topology ((Dmel,Dere),Dyak), which we shall refer to as tree 2, was supported by studies of an internal transcribed spacer region of ribosomal RNA genes [10], nucleotide sequences 5′ of the Amy gene [15] and the Adh gene [8,21]. The topology ((Dmel,Dyak),Dere), which we shall refer to as tree 3, was supported by studies of protein electrophoresis [11], mitochondrial DNA [12], single-copy nuclear and mitochondrial DNA hybridization [13], the Adh gene [1,14] and the Amy gene [15]. The support that each of these studies provides for the three phylogenies, however, is not uniformly strong. The most recent study by Ko et al. using the concatenation of multiple nuclear genes provides the most compelling evidence, with 100% bootstrap support, for the placement of Dere and Dyak as sister taxa relative to the Dmel lineage. That Ko et al. found such strong support for tree 1, despite using the Adh gene, which on its own has been found to support the other two trees, suggests that the past incongruence was likely the result of sampling variance [22,23]. Incongruence, however, can also be the result of numerous systematic biases [24–28] that are not overcome by increased sampling [29–31], as well as phylogenetically meaningful phenomena, such as lateral transfer [32] and incomplete linage sorting [25,33–48].
In this study, we set out to examine the possible causes of incongruence in this phylogeny and to investigate the placement of Dere and Dyak in the Dmel species subgroup, using the newly sequenced genomes in the genus Drosophila. Although we found that tree 1, placing Dere and Dyak as sister species, is the best-supported tree, we found genome-wide incongruence in substitutions, insertions/deletions (indels), and gene trees. We show that the branch separating the split of Dmel from the split of Dere and Dyak is sufficiently short that incomplete lineage sorting is a plausible explanation for the incongruence. We further show that the support for the three possible trees is nonrandomly distributed across the genome such that adjacent genes supporting the same tree are more likely in regions of low recombination, and substitutions supporting the same tree are most enriched roughly on the same scale as estimates of linkage disequilibrium, consistent with theoretical predictions under the coalescent [49]. We tested for obvious systematic biases and found that no factor we examined could account for the incongruence. We conclude by suggesting that incongruence due to incomplete lineage sorting has important implications for comparative genomics research.
Results
Comparative Annotation of Drosophila Species
To analyze the phylogenetic history of the gene compliment of each of the seven fully sequenced species in the subgenus Sophophora, we mapped Dmel gene annotations onto each unannotated genome. Dmel coding sequences (19,186) were mapped to potential orthologous regions in each species using TBLASTN, and GeneWise was used to build gene models based on the Dmel gene in each region. These GeneWise models were matched back to Dmel translations using BLASTP, and genes for which clear orthologs could be found were used in downstream analysis (see Methods). Peptide sequences from orthologs were aligned using TCoffee [50] and cDNA alignments were mapped onto the peptide alignments.
Species and Trees
Of these seven subgenus Sophophora species, we chose to use Dmel, Dere, Dyak, and D. ananassae (Dana) for our initial analysis of the placement of Dere and Dyak within the Dmel species subgroup (we examine the effects of species choice on our results below). Dmel was chosen because the annotations were mapped from Dmel, and it is the primary model organism of the subgenus. D. simulans (Dsim) and D. sechellia (Dsec) were excluded from initial analysis because they were assumed to provide mostly redundant information to Dmel and they reduced the number of clear orthologs spanning the species by 2,544 genes, presumably because of lower sequence coverage and issues regarding the assembly of polymorphic reads in Dsim. Dana was chosen over D. pseudoobscura (Dpse) because it is the closest fully sequenced outgroup to the Dmel species subgroup. More than 9,000 genes (9,405) were found to have clear orthologs in all four of the chosen species. Figure 1 shows the three possible unrooted trees relating the species.
Genome-Wide Incongruence
We began our analysis looking directly at the genome-wide counts of amino acid substitutions, nucleotide substitutions, and indel events that were informative with respect to each of the three possible trees (see Methods). For all three characters, tree 1, which groups Dere and Dyak together, was found to have the most support (Figure 2A–2C). By a majority-rule consensus, tree 1 would be inferred to be the species tree, consistent with the findings of Ko et al. [8]. The high proportion of substitutions and indels supporting the alternate trees, however, suggests a poorly resolved tree and pervasive incongruence.
What is the cause of this incongruence? The incongruent substitutions could be the product of any of a number of systematic biases, but the incongruent indels are unambiguous characters that are more difficult to explain as methodological artifacts [51,52]. The population genetic theory of the coalescent states that sufficiently close speciation events will lead to incongruence due to incomplete lineage sorting (Figure 3) [38]. Below we explore the compatibility of our data with the coalescent as well as test for possible systematic biases.
Maximum Likelihood Gene Trees Show Incongruence
We first repeated our analysis using maximum likelihood (ML) methods [53,54] to measure the informative divergence spanning the inferred speciation events and to test the robustness of the incongruent substitutions using more complex models of sequence evolution. ML analysis is not currently scalable to entire genomes in a single calculation, so we partitioned the genome into individual genes. If incomplete lineage sorting is the underlying cause of the incongruence, such a partition might also reveal variation in allelic histories that multigene concatenations could obscure [27,45,55]. Wanting to capture both the observed nucleotide and amino acid differences across the species [56], we used the F3×4 codon-based model from the PAML package [57] to compare the likelihood of each tree given each cDNA alignment (we test other models below). Consistent with the parsimony-based analysis, the majority of genes (57.8%) support tree 1, while a high proportion (42.2%) support the other two trees (Figure 2D).
The median synonymous divergence trees for the sets of genes supporting each tree are: (dmel:0.1301,(dere:0.1095,dyak:0.1201):0.0664,dana:1.3246) for tree 1, ((dmel:0.1744,dere:0.1076):0.0498,dyak:0.0757,dana:1.2871) for tree 2, and ((dmel:0.1801,dyak:0.1163):0.0454,dere:0.0719,dana:1.3147) for tree 3 (Figure 4). The branches between the speciation events are quite short, with the tree 1 branch being the longest at only 0.066, suggesting that these species split in rapid succession.
Incongruence Is Expected for These Species under the Coalescent
Is the time spanning these speciation events short enough to expect the observed levels of incongruence? Using the coalescent, the probability of congruence, or monophyly, can be directly calculated for the three-taxon case using the equation p(congruence) = 1 − 2 / 3exp(−t), where t is the time between speciation events in units of generations / 2Ne and Ne is the effective population size [58–60]. Figure 5 shows this probability graphically as a function of t. In order to go from an estimate of the informative divergence to this probability, the substitutions per site per year, the ancestral generation time and the ancestral population size must be known. Synonymous substitutions per site per year has been estimated to be in the range of 1–2 × 10−8 in Drosophila [1,13,61,62]. Generations per year for the extant taxa in the Dmel species subgroup is about ten and can be used as an estimate for the ancestral generation time [63]. The ancestral population size has been estimated in the range of 106 to 107, but this should be considered a poorly resolved parameter [64]. Theoretically, the median informative branch length measured above includes both divergence prior to the first speciation event and divergence between the two speciation events. If we take the informative divergence estimated from genes supporting the alternative trees to represent the expected amount of divergence prior to the first speciation event (0.05 and 0.045 for trees 2 and 3, respectively) and subtract their average (0.0475) from the tree 1 total informative divergence (0.066), we can get an estimate of the informative divergence spanning the two speciation events (0.019). This leads to an estimate of 9.5 × 105 to 1.9 × 106 years, or 9.5 × 106 to 1.9 × 107 generations. The range of values for t becomes 0.48 to 9.5, which produces probabilities for congruence in the range of 0.59 to 0.99995 (Figure 5). Although the uncertainty in these parameter estimates does not permit us to say that incongruence would be guaranteed, they do allow us to say that incongruence due to incomplete lineage sorting is expected under plausible assumptions about these species' ancestral population and speciation events.
Spatial Structure of Tree Support
Given that we observed incongruence in individual sites as well as for whole genes, we wanted to better understand the extent to which sites supporting the same tree are spatially correlated, with a particular interest in the compatibility of this structure with the incomplete lineage-sorting hypothesis. The above analysis of gene trees suggests that sites can be correlated out to the length of genes. To see if this correlation extends beyond individual genes we looked for blocks of adjacent genes supporting the same gene and tested for unusual block lengths. Using permutations of ML gene tree states to obtain significance, we found gene tree block lengths at expected frequencies, with the exception of an excess of long blocks supporting tree 3 in the range of 250 kb to 700 kb, three of which were highly significant (p < 0.05).
If the blocks of genes supporting the same tree were the product of incomplete lineage sorting, then regions of low recombination ought to have larger blocks [65]. Although the ancestral recombination rates are not known, we looked to see if block lengths are correlated with Dmel recombination rates [66]. We found a weak negative correlation for all blocks (Pearson's R = −0.13, p < 0.1) as well for blocks for each specific tree, with tree 2 blocks showing the strongest correlation (Pearson's R = −0.30, p < 0.05). These weak correlations suggest a minor role for recombination rates in determining the spatial structure of support for different trees across the genome; however, there are many reasons for why strong correlations would not be expected, including poorly conserved recombination rates across these species [67–69] and gene conversion in regions of low recombination [70–72]. Nonetheless, these weak correlations establish a connection between recombination and the spatial structure of support that is at least consistent with lineage sorting. We next looked at the spatial correlation of individual sites to understand the spatial correlation at a finer scale.
Using the whole-genome frequencies of informative amino acid and nucleotide substitutions supporting each tree, we looked to see if sites supporting the same tree are locally enriched across chromosomes (see Methods for more details). Figure 6 shows that informative amino acid and nucleotide substitutions supporting the same tree cluster together on the scale of less than 8 kb for trees 1 and 2 and less than 2 kb for tree 3. These local deviations in the frequencies of informative substitutions from the expected frequencies are quite highly significant (X 2 test, p < 10−10).
What forces might have shaped these clusters of informative sites supporting the same tree? Under the coalescent, linked neutrally evolving sites supporting the same tree have been proposed to be correlated at an expected distance equal to linkage disequilibrium [49]. Linkage disequilibrium in Dmel has been estimated to extend to the length of a few kilobases [73], suggesting that our results are consistent with theoretical expectations [49]. Theoretical considerations together with recent empirical evidence from Dmel, however, imply that neutral sites would not be expected to be in disequilibrium at distances greater than a few hundred base pairs [74,75], suggesting that perhaps selection has acted to increase the scale of these correlations [65]. Regardless of the influence of selection, the structure of the support for different trees across the genome is consistent with recombination acting within the context of incomplete lineage sorting.
Additional support for this conclusion comes from the observation that mitochondrial genes exhibit no incongruence (K. Montooth and D. Rand, personal communication). This is expected, as recombination is not thought to occur in the mitochondrial genome. While mitochondrial evolution differs from nuclear evolution in more ways than just recombination [76], the complete lack of incongruence is nevertheless striking.
Thus far we have presented results suggesting that incomplete lineage sorting is a plausible explanation for the observed incongruence. We next sought to rule out alternate explanations.
Statistical Support for Incongruence
Is the incongruence in gene trees unexpected given the strength of support for each inference? To address this question, we used the bootstrap [77] value, RELL [78], from 10,000 replicates as an estimate of the expected incongruence due to chance alone. Taylor and Piel have shown that for a large set of yeast genes, originally reported by Rokas et al [79], there is no significant difference between nonparametric bootstrap values and accuracy, as measured by congruence [80]. Earlier work suggests that bootstrap values are conservative and likely to underestimate accuracy [81,82]. Figure 7A shows the proportion of genes supporting each tree in bins of bootstrap value. Unlike the yeast phylogeny, our observed incongruence consistently exceeds that expected by bootstrap values. Thus, the incongruence for these four species using the F3×4 codon model appears to be statistically significant.
Incongruence Is Robust to Model Choice
We next tested whether the incongruence is robust to model choice. An empirical study of model choice and accuracy by Ren et al found that codon-based models are able to recover both recent and deep divergences well, while nucleotide-based models are less efficient at deep divergences and amino acid–based models are less efficient at recent divergences [56]. They also found that while more complex models fit the data better, they are not necessarily more accurate, a conclusion that has been made by other studies [83,84]. We looked at six models: nucleotide-based (HKY, HKY+G), codon-based (F3×4, F3×4+G), and amino acid–based (WAG+F, WAG+F+G) models both with and without a discrete gamma model of variable rates among sites (see Methods). Incongruence was found to exceed expected levels from bootstrap values across all models, suggesting that the incongruence is indeed robust to model choice (Figure S1).
Comparing congruence across models, simpler models seem to produce more congruence than more complex models (Table 1). For each of the three types of models, addition of a discrete gamma resulted in lower congruence. For the models without discrete gamma, HKY was more congruent than F3×4, which was more congruent than WAG+F, perhaps due to the relatively recent divergences in this phylogeny. Interestingly, the more complex models, F3×4+G for nucleotides and WAG+F+G for amino acids, fit the alignments better for most genes, according to Akaike's information criterion (Table 1) [85]. Thus, consistent with the finding of Ren et al. with the yeast dataset [56], more complex models fit the data better but produce less congruence.
Table 1.
Species Choice Does Not Explain the Observed Incongruence
To evaluate the robustness of the incongruence to species choice we examined the set of 5,784 genes for which a clear ortholog could be found in all seven fully sequenced species in the subgenous Sophophora: Dmel, Dsim, Dsec, Dere, Dyak, Dana, and Dpse. All 21 possible species combinations that include Dere and Dyak and at least one of Dmel, Dsim, and Dsec, as well as at least one of Dana and Dpse, were considered. The HKY model was used both because it was found to produce the most congruence in the original four species as well as because it is considerably more computationally efficient than the codon models. Across all species combinations, incongruence is consistently greater than expected from bootstrap values, suggesting that incongruence is not species choice dependent (Figures S1A and S2).
Ranking species combinations by levels of congruence reveals that our original species choice produces the most congruence (Table 2), suggesting that our estimates are conservative. The relative congruence of the species combinations appears nonrandom, with respect to presence or absence of individual species, so we calculated the average congruence for each species across the combinations containing that species. Although the average congruence is very similar for each species, we found that Dana (82.4%) contributes most to congruence, while Dsim (80.8%), Dsec (80.4%), and Dpse (79.7%) contribute roughly equally and Dmel (78.9%) actually contributes least to congruence. We note that the presence of Dmel in the most congruent species combination goes against this general trend, perhaps reflecting further complexities in the impact of species choice on congruence.
Table 2.
Consistency
Although the incongruence appears to be robust to model and species choice, a much more stringent test is to look at incongruence in the partition of genes that consistently support the same tree across all models and across all species combinations [86]. Of the 5,784 genes analyzed, 2,347 are consistent across all models and of those, 1,600 (68.2%) are congruent while 443 (18.9%) support tree 2 and 304 (12.9%) support tree 3. Similarly, 1,918 genes are consistent across species combinations and of those, 1,474 (76.8%) are congruent while 291 (15.2%) support tree 2 and 153 (8%) support tree 3. Finally, 970 genes are consistent across all models and all species combinations and of those, 804 (82.9%) are congruent, while 101 (10.4%) support tree 2 and 61 (6.3%) support tree 3. This conservative partitioning reduces the amount of incongruence but does not eliminate it. We note that under the incomplete lineage-sorting hypothesis, incongruent genes are expected to have accumulated fewer informative substitutions (Figure 4) and therefore might be expected to be less robust to such a consistency test.
To assess the statistical significance of the incongruence in the partition of genes consistent across all models and species combinations [31], we used the HKY model bootstrap values from the Dmel, Dere, Dyak, and Dana species combination to look at congruence as a function of bootstrap value. As shown in figure 7B, the congruence is less than expected for the highest bootstrap values. For the 521 genes with bootstrap values between 0.9 and 1.0, which is more than half of consistent genes, the incongruence was highly significant (X 2 test, p < 10−3).
To further test whether the statistical support from the incongruent genes is the result of consistent signal, as opposed to having hidden support [87] for tree 1, we concatenated the 804 consistent tree 1 genes, 101 consistent tree 2 genes, and 61 tree 3 genes into three large alignments and repeated the ML analysis for the Dmel, Dere, Dyak, and Dana species combination and the HKY model. Interestingly, each tree-specific concatenation supported its tree with 100% bootstrap support [88]. Thus, the signal for incongruence appears to be consistent, highly significant, and robust to model and species choice consistency partitioning.
Sequence and Evolutionary Properties
We next looked at sequence and evolutionary properties of the genes supporting each tree to see if any clear biases could explain the incongruence. The properties we examined are sequence quality, gene length (measured in ungapped codons in the alignment), base composition (GC content) across the species at each position in the codon, transition–transversion ratio (kappa), ratio of nonsynonymous to synonymous divergence (dN/dS), informative synonymous divergence (ISD), ratio of informative synonymous divergence to noninformative synonymous divergence (RINSD), and total synonymous divergence (TSD). Table S1 shows the correlation of bootstrap values to each of these properties for the whole set of genes, genes supporting each tree, the set of genes found to be consistent across models and species combinations, the genes that consistently supported each tree, and the set of inconsistent genes. Distributions for each property are shown in Figures 8 and S3–S8.
The strongest and most consistent correlations with bootstrap value are for ISD and RINSD (Table S1), which are in essence the signal and signal to noise. We've already shown that the median informative divergence in the genes supporting tree 1 is greater than that for the genes supporting trees 2 and 3 (Figure 4). Reflecting this, the distributions of ISD and RINSD for genes supporting trees 2 and 3 are shifted toward lower values compared to genes supporting tree 1 (Figures 8A and S3A). Comparing consistent genes and inconsistent genes reveals that nearly all genes with ISD values close to zero are classified as inconsistent (Figure 8B). Among consistent genes, those supporting trees 2 and 3 still have distributions of ISD and RINSD shifted slightly toward lower values compared to those supporting tree 1 (Figures 8C and S3B). The fact that incongruent genes are expected to have lower ISDs than congruent genes under the incomplete lineage-sorting model (see above), and the fact the ISD and RINSD distributions are highly overlapping for each of the three trees, suggests that lack of signal or low signal to noise cannot explain the observed incongruence.
The long branch out to Dana (Figure 4) presents the concern that the incongruence may be due to homoplasy and perhaps long-branch attraction. TSD is distributed nearly identically across all sets of genes, including consistent and inconsistent genes, with a very slight bias toward trees 2 and 3 genes; inconsistent genes have lower TSDs (Figures 8D and S4). Although this does not rule out homoplasy as a source for noise in the inference of gene trees, it appears that regions with high mutational rates are not biased toward supporting incongruent or inconsistent genes [89], making it a less likely explanatory factor. In addition, although the trees in Figure 4 are not ultrametric (leaves equidistant from internal nodes), they are biased in the opposite direction as would be expected under long branch attraction, with the shortest branch in the Dmel species subgroup pairing with the longest branch out to Dana. Thus, homoplasy and long-branch attraction do not appear to be responsible for the incongruence.
Another possibility is that sampling variance in short genes is leading to the incongruence [90]. We've already shown that a concatenation of the consistent genes supporting each tree gives 100% bootstrap support, making sampling variance an unlikely explanation. Gene length is very similar across the sets of genes supporting each tree, but tree 1 genes tend to be slightly longer than genes supporting trees 2 and 3 (Figure 8E). Gene length is also weakly correlated with bootstrap value for the whole set, consistent genes, and tree 1 genes (both inconsistent and consistent) (Table S1). Our above results on the spatial correlation of sites, however, suggest that genes that extend more than a few kilobases would not be expected to be enriched for sites supporting the same tree above their background frequencies. We also found that enrichment is most pronounced for tree 1 sites and less so for incongruent sites. This increased mosaic structure [91] in incongruent genes is likely to be responsible for most of the shift to slightly larger genes in the tree 1 genes. The influence of sampling variance, however, is reflected in the shift of inconsistent genes compared to consistent genes toward shorter lengths. Thus, the small decrease in long genes in the incongruent set is probably a result of the spatial clustering of sites, while the small increase in short genes may be a combination of that effect and noise from sampling variance. Regardless, gene length is so similar across trees that it is unlikely to explain the incongruence.
GC content has been estimated to vary considerably across the species in the Dmel species subgroup [92] and is therefore a major concern for systematic bias. We found that GC content is highly similar across species at first and second codon positions, but varied systematically at the third codon position (Figures 8F, S5A, and S5B). Dmel and Dana have nearly identical distributions of third codon position GC content, which is shifted toward lower values compared to Dere and Dyak, which also have nearly identical distributions. This bias in GC content across species is very conservative with respect to the inference of incongruent genes because the incongruence would need to overcome the signal from base composition alone [29]. To further verify that this bias only works to decrease the incongruence, we converted the cDNA alignments into Rs and Ys, for purines and pyrimidines, respectively, and repeated the ML analysis using the F81 model of evolution, effectively averaging the contribution of GC and AT content and only measuring transversions [29,93,94]. As expected, incongruence actually increases (45.2%) under the RY coding and is still statistically significant (Figure S9). Other methods, for example those of Galtier and Gouy [95,96] and Gu and Li [97], attempt to explicitly model nonstationary evolution, rather than control for it. These methods might reveal more precisely the underestimation of incongruence due to the base composition bias in these species but are not expected to provide an explanation for the observed incongruence.
Sequence qualities, transition–transversion ratios, and dN/dS values were found be distributed similarly across trees, suggesting they are unlikely factors for systematic bias (Figures S6–S8).
Sequence Properties Associated with Spatial Clustering
We last looked to see if the spatial clustering of sites supporting the same tree could be explained by evolutionary rate or base composition variation. To examine the relationship of evolutionary rate and the clustering of sites supporting each tree, we measured total divergence and the fraction of sites supporting each tree in overlapping windows across the chromosomes. For windows of sizes 5 kb or 1 kb, no correlation could be found between divergence and the fraction of sites supporting each tree, suggesting that evolutionary rate is unlikely to explain the spatial clustering. To test whether changes in GC content could explain the clustering of sites we used the RY-coded alignments (described above) [29,93,94] and repeated the spatial clustering analysis. Figure S10 shows that sites are still correlated in a similar range of a few kilobases, suggesting that variance in GC content is unlikely to be causing the spatial clustering of sites. Thus, both the incongruence as well as the spatial clustering of sites appear to be robust to the sequence and evolutionary properties examined.
Discussion
We initially set out to confirm the placement of Dere and Dyak as sister species, relative to the Dmel lineage, in the Dmel species subgroup, using the fully sequenced genomes of seven species in the subgenus Sophophora. Although we did find that the best-supported phylogeny is that which places Dere and Dyak as sister species, we also found pervasive incongruence of substitutions, indels, and gene trees (figure 2). While incongruence in substitutions and gene trees could be the result of systematic biases, the incongruent indels, particularly unique insertions, presented strong enough evidence for unbiased incongruence that we also considered incomplete lineage sorting as a possible explanation. Assuming plausible values of substitution rate, generation time, and ancestral population size, we found that the time between the split of Dmel and the split of Dere and Dyak is sufficiently short that incomplete lineage sorting would be expected (Figures 3–5). Interestingly, we observed that the support for each of the three trees has a spatial structure across the genome, which is related to low recombination, both locally and globally (Figure 6). This further supports the hypothesis that the observed incongruence is due, at least in part, to incomplete lineage sorting.
To test for other plausible explanations we examined model choice, species choice, and variation in sequence and evolutionary properties and found no obvious candidate factors to explain the incongruence or the spatial structure of support for trees (Tables 1 and 2; Figures 7, 8, S1–S10). We therefore conclude that incomplete lineage sorting is the best-going explanation for the lack of resolution in this phylogeny.
Nevertheless, we likely did not exhaust the possible tests for alternate hypotheses for incongruence and suspect that this dataset will prove an interesting area for systematic research, much as the Rokas et al. yeast dataset has [69]. Comparing our results to the yeast dataset reveals important differences: there is significant incongruence beyond what would be expected by chance (Figure 7A), the level of incongruence is relatively robust to model choice (Tables 1 and 2; Figures 7B and S1), and basic sequence properties, like GC content, vary in ways that are conservative with respect to the incongruence (Figures 8, S3–S10) [29]. Similar to the yeast dataset, however, we find that the evolutionary model that maximizes the congruence (or accuracy, as Ren et al. refer to it) is typically the simplest (HKY), while the model that fits the data best is the most complex (F3×4 +G) (Table 1) [56].
To further understand the extent and nature of incomplete lineage sorting in the Dmel species subgroup, we suggest several types of future studies. First, to further test the agreement of the observed incongruence with theoretical predictions, better estimates of the ancestral effective population size, mutation rates, time between speciation events, ancestral recombination events [98], and examining the effects of selection (both directional and balancing [99]) would be of clear benefit. In addition, of great interest will be studies of lineage sorting across all taxa in the species group (especially the Dsim species complex [39]) and the influence of migration and gene flow on the symmetry of lineage sorting (because tree 2 is asymmetrically favored). Genome-wide population data already exist for Dsim and are expected for Dmel, which have the potential to help in the effort to understand these processes. Finally, methodological improvements might include increased large-scale taxon sampling, particularly from closely related taxa outside the species subgroup, such as the D. suzukii and D. takahashii subgroups [3], would alleviate potential biases introduced by the long branches out to Dana and Dpse.
Although this study should prove quite valuable to the increasing numbers of comparative genomics researchers studying the genus Drosophila, we believe our findings have important implications for comparative genomics as a whole. The idea that speciation events have occurred in rapid bursts throughout the tree of life [100–102] is likely broadly understood (for example, the short branch connecting the human, mouse, and dog lineages [103]), but the idea that genomes may be mosaics of conflicting genealogies as a result of rapid speciation is perhaps less well appreciated. As more species are sequenced, particularly the dense taxon sampling that is currently beginning in model organism clades, increasing numbers of close speciation events will likely result in many cases of incomplete lineage sorting in genome-scale data. As many methods used in comparative genomics require an accurate phylogeny, the comparative genomics community must develop methods that are robust to or take into account variation in phylogeny.
We envision three types of methods that will need to be developed to appropriately account for this kind of variation. The first are methods that can infer the most likely species tree using an entire genome in a single calculation, considering lineage sorting explicitly. The second are methods that can infer the most likely history of every base in every species, given the species tree. Last, comparative genomics methods that use phylogenies would need to be altered to control for and utilize the output from the second kind of method. Progress is being made in the first two categories [27,38,47,48,98,104–113], although no currently available method can deal with a whole-genome dataset such as this one. Though well appreciated in the systematics and population genetics communities, the issue of incomplete lineage sorting is rarely considered in the bioinformatics and comparative genomics communities, so the third category of method is virtually nonexistent. Accounting for variation in evolutionary histories will have different effects on different classes of methods, but we suggest that parsimony-based methods would be most strongly affected. An important example of such a phylogeny-based method is genome-wide multiple alignment using a guide tree (i.e., [114,115]), which is the first step in nearly all comparative genomic analyses. The availability of genome-scale datasets such as the one analyzed here should allow rapid progress in all three of these types of methods; we suggest that their development will be of great benefit to the evolutionary and comparative genomics community in the near future.
Materials and Methods
Assemblies.
Dmel release 4.2 genome, cDNA, and translation sequences were downloaded from Flybase (http://www.flybase.net). Prepublication assemblies for Dere and Dana (dated August 1, 2005), sequenced and assembled by Agencourt Bioscience (http://www.agencourt.com), and for Dsec (dated October 28, 2005), assembled and sequenced by the Broad Institute (http://www.broad.mit.edu), were downloaded from the Berkeley AAA website (http://rana.lbl.gov/drosophila). The prepublication assemblies for Dyak (dated July 4, 2004) and Dsim (dated June 2, 2005) were downloaded from the Washington University School of Medicine Genome Sequencing Center's website (ftp://genome.wustl.edu/pub). The Dpse v1.04 assembly was downloaded from Flybase. Assemblies can be found in Datasets S1–S6. Sequencing traces corresponding to these genomes are in the National Center for Biotechnology Information (NCBI) trace archive (http://ncbi.nlm.nih.gov/Traces/trace.cgi; species_code, “DROSOPHILA ERECTA,” “DROSOPHILA YAKUBA,” “DROSOPHILA ANANASSAE,” “DROSOPHILA SIMULANS,” “DROSOPHILA SECHELLIA,” “DROSOPHILA PSEUDOOBSCURA”).
Comparative annotation.
Each of the sequence assemblies were annotated separately by mapping Dmel gene models onto the unannotated genome in a pairwise fashion using a modified reciprocal–BLAST approach [116] to assign orthology/paralogy relationships, and a comparative gene finder, GeneWise [117,118], to build gene models. The annotation pipeline consisted of three steps. (1) For each Dmel translation, we used the protein sequence as a NCBI TBLASTN [119] query (e-value threshold, 1 × 10−3) against the scaffolds of the target assembly. (2) The scaffolds were ordered by the hit e-value reported by TBLASTN, and up to two regions were selected from the two best scaffolds and used as input to construct gene models using GeneWise. To improve the chance of constructing a complete gene model using GeneWise, the regions were selected by clustering high-scoring pairs on the scaffold such that every high-scoring pair within 100 kb of another high-scoring pair was included in the same region, and a buffer of 10 kb was included at the ends of the regions. (3) The predicted translations of the models reported by GeneWise were then used as BLASTP queries against a database of Dmel translations, with an e-value threshold of 1 × 10−3.
We then assigned orthology/paralogy relationships using a heuristic algorithm that takes into account (1) the rank of the starting Dmel translation in the BLASTP results, (2) the rank of alternative translations from the gene corresponding to the starting Dmel translation, and (3) whether or not there were highly ranked hits to genes other than the gene corresponding to the starting Dmel translation. One-to-one orthology was assigned when the only top-ranked hits in the BLASTP results were translations from the gene corresponding to the starting Dmel translation. Hits that had e-values within one order of magnitude were considered equivalently ranked. For genes with more than one translation with clear orthologs in each species, the first historically annotated (translation with the lowest letter ID) was used to represent the gene.
cDNA and translation sequences can be found in Datasets S7–S18.
Informative substitutions and indels.
Informative substitutions supporting each tree were counted across all cDNA and peptide alignments. Only single substitutions that split the four species into two groups of two were considered. Informative substitutions for tree 1 grouped Dmel and Dana together and Dere and Dyak together. Likewise, tree 2 grouped Dmel and Dere together and tree 3 grouped Dmel and Dyak together.
Informative indels supporting each tree were counted across all peptide alignments. Indels were classified as informative in the same way that substitutions were. Indels were further filtered to avoid artifacts from alignment errors. Only indels with five amino acids of perfect identity in flanking sequences, with no mono-, di-, or tri-amino acid repeats, were included. Insertions were inferred based on an absence in Dana and one of the ingroup species. Such insertions, where the inserted sequence is the same in the two species containing it, provided strong, unambiguous characters.
ML gene trees.
The Codeml program of the PAML package (version 3.14) [57,120] was run on each gene using the following three unrooted trees: tree 1, ((Dmel,(Dere,Dyak),Dana); tree 2, ((Dmel,Dere),Dyak,Dana); and tree 3, ((Dmel,Dyak),Dere,Dana) (see Figure 1). Codeml was run using the F3×4 model, such that equilibrium codon frequencies were calculated from the average nucleotide frequencies at the three codon positions (CodonFreq = 2), amino amino acid distances were equal (aaDist = 0), one dN/dS value was estimated for all lineages using an initial value of 0.4 (model = 0, fix_omega = 0, omega = 0.4), the transition–transversion ratio was estimated with an initial value of 2 (fix_kappa = 0, kappa = 2), substitution rates across sites were set to be equal (fix_alpha = 1, alpha = 0), substitution rates were allowed to vary freely across lineages (clock = 0), and codons with ambiguous positions (gaps or Ns) were ignored (cleandata = 1).
Spatial analysis.
Based on the ML tree for each gene, the genome was divided up into blocks supporting each tree. A ten-gene sliding window was used to calculate a running average of the support for each tree along each chromosome. Each window was assigned a tree based on the most frequent genealogy in the window. Each gene was then reassigned a tree based on the most frequent tree of all the windows that contained it. This effectively allows the neighbors of a gene to influence its assignment, and near neighbors have more influence than far neighbors. Adjacent genes that support the same tree were combined together into blocks. To measure the significance of the size of the blocks, the labels for each gene in the genome were randomized 1,000 times and the blocks were recalculated for each replicate, using the windowing method described above. Recombination rates for a subset of genes in Dmel, calculated by Hey and Kliman [66] using the R statistic, were downloaded. The average R in each block was calculated where a gene could be found in their set. The Pearson correlation of the average R within blocks and the length of blocks was calculated using the R statistics package [121].
Informative substitutions in genes were used to look at the structure of support for the different trees across the genome independent of the likelihood inference. The counts of each type of informative substitution were calculated in 60 nonoverlapping 1-kb windows surrounding each informative substitution across all chromosomes. The frequency of each kind of informative substitution across the whole genome was used to calculate an expected count for each 1-kb window. In each window, the enrichment of informative substitutions supporting the same tree was calculated. The X 2 significance of windows was calculated by comparing the observed frequencies of informative mutations supporting each tree with the genome averages of those frequencies.
Bootstrap values.
RELL bootstrap values [78] from 10,000 replicates were taken from the Codeml output.
PAML models.
All models were run using the same settings as described above for F3×4 except where HKY (model = 4) or WAG+F (model = 3) was specified and where the gamma function was used (fix_alpha = 0, alpha = 1.0, ncatg = 8).
Akaike's information criterion.
Akaike's information criterion (AIC) was calculated as AIC = −2 ln L + 2 N, where L is the likelihood of the model given the data, and N is the degrees of freedom [85]. Only consistent genes were used in this analysis, so the tree was the same across all models. The likelihood and degrees of freedom were taken directly from PAML output. HKY, HKY+G, F3×4, and F3×4+G were compared, and WAG+F and WAG+F+G were compared.
Sequence and evolutionary properties analysis.
The sequence quality in each species was calculated as the mean sequence quality score of the coding bases. Bootstrap value, length, GC content, transition–transversion ratio, dN/dS, ISD, NSD, and TSD were taken directly from the PAML output for the ML tree from the original analysis using the F3×4 model and the Dmel, Dere, Dyak, and Dana species combination. The Spearman rank correlations were calculated using the R statistics package [121].
Divergence windows.
To examine the correlation of divergence with the proportion of sites supporting each tree in local areas across the genome we used 5-kb and 1-kb windows, overlapping by 2.5 kb and 0.5 kb, respectively. Using the synonymous site divergences reported by Codeml from the original analysis, we calculated the synonymous divergence per coding site in each window. We also calculated the proportion of sites supporting each tree in each window. Windows with no synonymous coding sites were excluded.
Supporting Information
Acknowledgments
We thank Agencourt Bioscience, the Broad Institute, and Washington University School of Medicine Genome Sequencing Center for prepublication access to the genome sequence data. We thank Hiroshi Akashi and two anonymous reviewers for critical reading of the manuscript and helpful suggestions. We thank Peter Andolfatto, Doris Bachtrog, John Novembre, Joshua Pollack, and Montgomery Slatkin for discussions regarding the coalescent and the spatial correlation of substitutions. We thank Matt Hahn and Alisha Holloway for advice on phylogenetics. We thank Angela Depace, Justin Fay, Hunter Fraser, Emily Hare, Suzanne Lee, Richard Lusk, Stewart McArthur, John Novembre, Joshua Pollack, Montgomery Slatkin, Erica Rosenblum, and Jody Westbrook for comments on the manuscript.
Abbreviations
- Dana,
Drosophila ananassae;
- Dere,
D. erecta;
- Dmel
D. melanogaster;
- dN/dS
ratio of nonsynonymous to synonymous divergence
- Dpse,
D. pseudoobscura
- Dsec,
D. sechellia;
- Dsim,
D. simulans;
- Dyak,
D. yakuba;
- indel
insertion/deletion
- ISD
informative synonymous divergence
- ML
maximum likelihood
- RINSD
ratio of informative synonymous divergence to non-informative synonymous divergence
- TSD
total synonymous divergence
Footnotes
Competing interests. The authors have declared that no competing interests exist.
A previous version of this article appeared as an Early Online Release on August 28, 2006 (DOI: 10.1371/journal.pgen.0020173.eor).
Author contributions. DAP designed the research, performed the research, analyzed the data, and wrote the paper. VNI contributed to the research design, the comparative annotations, and the gene blocks analysis. AMM contributed to the research design and the coalescent analysis. MBE contributed to the research design. All authors contributed to the writing of the paper.
Funding. This work was supported by a National Institutes Grant R01-HG002779 to MBE and by a Pew Biomedical Scholars award to MBE.
References
- Russo CA, Takezaki N, Nei M. Molecular phylogeny and divergence times of drosophilid species. Mol Biol Evol. 1995;12:391–404. doi: 10.1093/oxfordjournals.molbev.a040214. [DOI] [PubMed] [Google Scholar]
- Powell JR. Progress and prospects in evolutionary biology: The Drosophila model. New York: Oxford University Press; 1997. 562 [Google Scholar]
- Lewis RL, Beckenbach AT, Mooers AO. The phylogeny of the subgroups within the melanogaster species group: Likelihood tests on COI and COII sequences and a Bayesian estimate of phylogeny. Mol Phylogenet Evol. 2005;37:15–24. doi: 10.1016/j.ympev.2005.02.018. [DOI] [PubMed] [Google Scholar]
- O'Grady PM, Kidwell MG. Phylogeny of the subgenus sophophora (Diptera: drosophilidae) based on combined analysis of nuclear and mitochondrial sequences. Mol Phylogenet Evol. 2002;22:442–453. doi: 10.1006/mpev.2001.1053. [DOI] [PubMed] [Google Scholar]
- Remsen J, O'Grady P. Phylogeny of Drosophilinae (Diptera: Drosophilidae), with comments on combined analysis and character support. Mol Phylogenet Evol. 2002;24:249–264. doi: 10.1016/s1055-7903(02)00226-9. [DOI] [PubMed] [Google Scholar]
- Lemeunier F, Ashburner MA. Relationships within the melanogaster species subgroup of the genus Drosophila (Sophophora). II. Phylogenetic relationships between six species based upon polytene chromosome banding sequences. Proc R Soc Lond B Biol Sci. 1976;193:275–294. doi: 10.1098/rspb.1976.0046. [DOI] [PubMed] [Google Scholar]
- Barnes SR, Webb DA, Dover G. The distribution of satellite and main-band DNA components in the melanogaster species subgroup of Drosophila. I. Fractionation of DNA in actinomycin D and distamycin A density gradients. Chromosoma. 1978;67:341–363. doi: 10.1007/BF00285965. [DOI] [PubMed] [Google Scholar]
- Ko WY, David RM, Akashi H. Molecular phylogeny of the Drosophila melanogaster species subgroup. J Mol Evol. 2003;57:562–573. doi: 10.1007/s00239-003-2510-x. [DOI] [PubMed] [Google Scholar]
- Parsch J. Selective constraints on intron evolution in Drosophila . Genetics. 2003;165:1843–1851. doi: 10.1093/genetics/165.4.1843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlotterer C, Hauser MT, von Haeseler A, Tautz D. Comparative evolutionary analysis of rDNA ITS regions in Drosophila . Mol Biol Evol. 1994;11:513–522. doi: 10.1093/oxfordjournals.molbev.a040131. [DOI] [PubMed] [Google Scholar]
- Eisses K. Genetic differntiation within melanogaster species group of the genus Drosophila (Sophophora) . Evolution. 1979;33:1063–1068. doi: 10.1111/j.1558-5646.1979.tb04762.x. [DOI] [PubMed] [Google Scholar]
- Solignac M, Monnerot M, Mounolou JC. Mitochondrial DNA evolution in the melanogaster species subgroup of Drosophila . J Mol Evol. 1986;23:31–40. doi: 10.1007/BF02100996. [DOI] [PubMed] [Google Scholar]
- Caccone A, Amato GD, Powell JR. Rates and patterns of scnDNA and mtDNA divergence within the Drosophila melanogaster subgroup. Genetics. 1988;118:671–683. doi: 10.1093/genetics/118.4.671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jeffs PS, Holmes EC, Ashburner M. The molecular evolution of the alcohol dehydrogenase and alcohol dehydrogenase-related genes in the Drosophila melanogaster species subgroup. Mol Biol Evol. 1994;11:287–304. doi: 10.1093/oxfordjournals.molbev.a040110. [DOI] [PubMed] [Google Scholar]
- Shibata H, Yamazaki T. Molecular evolution of the duplicated Amy locus in the Drosophila melanogaster species subgroup: Concerted evolution only in the coding region and an excess of nonsynonymous substitutions in speciation. Genetics. 1995;141:223–236. doi: 10.1093/genetics/141.1.223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nigro L, Solignac M, Sharp PM. Mitochondrial DNA sequence divergence in the Melanogaster and oriental species subgroups of Drosophila . J Mol Evol. 1991;33:156–162. doi: 10.1007/BF02193630. [DOI] [PubMed] [Google Scholar]
- Gailey DA, Ho SK, Ohshima S, Liu JH, Eyassu M, et al. A phylogeny of the Drosophilidae using the sex-behaviour gene fruitless. Hereditas. 2000;133:81–83. doi: 10.1111/j.1601-5223.2000.00081.x. [DOI] [PubMed] [Google Scholar]
- Arhontaki K, Eliopoulos E, Goulielmos G, Kastanis P, Tsakas S, et al. Functional constraints of the Cu,Zn superoxide dismutase in species of the Drosophila melanogaster subgroup and phylogenetic analysis. J Mol Evol. 2002;55:745–756. doi: 10.1007/s00239-002-2370-9. [DOI] [PubMed] [Google Scholar]
- Matsuo Y. Molecular evolution of the histone 3 multigene family in the Drosophila melanogaster species subgroup. Mol Phylogenet Evol. 2000;16:339–343. doi: 10.1006/mpev.2000.0810. [DOI] [PubMed] [Google Scholar]
- Kopp A, True JR. Phylogeny of the oriental Drosophila melanogaster species group: A multilocus reconstruction. Syst Biol. 2002;51:786–805. doi: 10.1080/10635150290102410. [DOI] [PubMed] [Google Scholar]
- Moriyama EN, Gojobori T. Rates of synonymous substitution and base composition of nuclear genes in Drosophila . Genetics. 1992;130:855–864. doi: 10.1093/genetics/130.4.855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rokas A, Carroll SB. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol. 2005;22:1337–1344. doi: 10.1093/molbev/msi121. [DOI] [PubMed] [Google Scholar]
- Gadagkar SR, Rosenberg MS, Kumar S. Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree. J Exp Zoolog B Mol Dev Evol. 2005;304:64–74. doi: 10.1002/jez.b.21026. [DOI] [PubMed] [Google Scholar]
- Zwickl DJ, Hillis DM. Increased taxon sampling greatly reduces phylogenetic error. Syst Biol. 2002;51:588–598. doi: 10.1080/10635150290102339. [DOI] [PubMed] [Google Scholar]
- Arbogast BS, Edwards SV, Wakeley J, Beerli P, Slowinski JB. Estimating divergence times from molecular data on phylogenetic and population genetic timescales. Annu Rev Ecol Syst. 2002;33:707–740. [Google Scholar]
- Sanderson MJ, Shafer HB. Troubleshooting molecular phylogenetic analyses. Annu Rev Ecol Syst. 2002;33:49–72. [Google Scholar]
- Felsenstein J. Inferring phylogenies. Sunderland (Massachusetts): Sinauer Associates; 2004. 664 [Google Scholar]
- Jermiin LS, Poladian L, Charleston MA. Evolution. Is the “Big Bang” in animal evolution real? Science. 2005;310:1910–1911. doi: 10.1126/science.1122440. [DOI] [PubMed] [Google Scholar]
- Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004;21:1455–1458. doi: 10.1093/molbev/msh137. [DOI] [PubMed] [Google Scholar]
- Brinkmann H, van der Giezen M, Zhou Y, Poncelin de Raucourt G, Philippe H. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst Biol. 2005;54:743–757. doi: 10.1080/10635150500234609. [DOI] [PubMed] [Google Scholar]
- Jeffroy O, Brinkmann H, Delsuc F, Philippe H. Phylogenomics: The beginning of incongruence? Trends Genet. 2006;22:225–231. doi: 10.1016/j.tig.2006.02.003. [DOI] [PubMed] [Google Scholar]
- Andersson JO. Lateral gene transfer in eukaryotes. Cell Mol Life Sci. 2005;62:1182–1197. doi: 10.1007/s00018-005-4539-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avise JC, Shapira JF, Daniel SW, Aquadro CF, Lansman RA. Mitochondrial DNA differentiation during the speciation process in Peromyscus . Mol Biol Evol. 1983;1:38–56. doi: 10.1093/oxfordjournals.molbev.a040301. [DOI] [PubMed] [Google Scholar]
- Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5:568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]
- Takahata N. Gene genealogy in three related populations: Consistency probability between gene and population trees. Genetics. 1989;122:957–966. doi: 10.1093/genetics/122.4.957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu CI. Inferences of species phylogeny in relation to segregation of ancient polymorphisms. Genetics. 1991;127:429–435. doi: 10.1093/genetics/127.2.429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. Gene trees, species trees and the segregation of ancestral alleles. Genetics. 1992;131:509–513. doi: 10.1093/genetics/131.2.509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maddison WP. Gene trees in species trees. Syst Biol. 1997;46:523–536. [Google Scholar]
- Kliman RM, Andolfatto P, Coyne JA, Depaulis F, Kreitman M, et al. The population genetics of the origin and divergence of the Drosophila simulans complex species. Genetics. 2000;156:1913–1931. doi: 10.1093/genetics/156.4.1913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen FC, Li WH. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet. 2001;68:444–456. doi: 10.1086/318206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor Popul Biol. 2002;61:225–247. doi: 10.1006/tpbi.2001.1568. [DOI] [PubMed] [Google Scholar]
- Rosenberg NA, Nordborg M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet. 2002;3:380–390. doi: 10.1038/nrg795. [DOI] [PubMed] [Google Scholar]
- Rosenberg NA. The shapes of neutral gene genealogies in two species: Probabilities of monophyly, paraphyly, and polyphyly in a coalescent model. Evolution Int J Org Evolution. 2003;57:1465–1477. doi: 10.1111/j.0014-3820.2003.tb00355.x. [DOI] [PubMed] [Google Scholar]
- Holland BR, Huber KT, Moulton V, Lockhart PJ. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol Biol Evol. 2004;21:1459–1461. doi: 10.1093/molbev/msh145. [DOI] [PubMed] [Google Scholar]
- Mossel E, Vigoda E. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science. 2005;309:2207–2209. doi: 10.1126/science.1115493. [DOI] [PubMed] [Google Scholar]
- Osada N, Wu CI. Inferring the mode of speciation from genomic data: A study of the great apes. Genetics. 2005;169:259–264. doi: 10.1534/genetics.104.029231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution Int J Org Evolution. 2005;59:24–37. [PubMed] [Google Scholar]
- Maddison WP, Knowles LL. Inferring phylogeny despite incomplete lineage sorting. Syst Biol. 2006;55:21–30. doi: 10.1080/10635150500354928. [DOI] [PubMed] [Google Scholar]
- Slatkin M, Pollack JL. The concordance of gene trees and species trees at two linked loci. Genetics. 2006;172:1979–1984. doi: 10.1534/genetics.105.049593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. [DOI] [PubMed] [Google Scholar]
- Rokas A, Holland PW. Rare genomic changes as a tool for phylogenetics. Trends Ecol Evol. 2000;15:454–459. doi: 10.1016/s0169-5347(00)01967-4. [DOI] [PubMed] [Google Scholar]
- Ogurtsov AY, Sunyaev S, Kondrashov AS. Indel-based evolutionary distance and mouse-human divergence. Genome Res. 2004;14:1610–1616. doi: 10.1101/gr.2450504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck JP. The robustness of two phylogenetic methods: Four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol Biol Evol. 1995;12:843–849. doi: 10.1093/oxfordjournals.molbev.a040261. [DOI] [PubMed] [Google Scholar]
- Holland BR, Jermiin LS, Moulton V. Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005. Improved consensus network techniques for genome-scale phylogeny. Mol Biol Evol. 2006;23:848–855. doi: 10.1093/molbev/msj061. [DOI] [PubMed] [Google Scholar]
- Ren F, Tanaka H, Yang Z. An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol. 2005;54:808–818. doi: 10.1080/10635150500354688. [DOI] [PubMed] [Google Scholar]
- Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- Kingman JFC. The coalescent. Stoch Proc Appl. 1982;13:235–248. [Google Scholar]
- Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nei M. Stochastic errors in DNA evolution and molecular phylogeny. Prog Clin Biol Res. 1986;218:133–147. [PubMed] [Google Scholar]
- Sharp PM, Li WH. On the rate of DNA sequence evolution in Drosophila . J Mol Evol. 1989;28:398–402. doi: 10.1007/BF02603075. [DOI] [PubMed] [Google Scholar]
- Li YJ, Satta Y, Takahata N. Paleo-demography of the Drosophila melanogaster subgroup: Application of the maximum likelihood method. Genes Genet Syst. 1999;74:117–127. doi: 10.1266/ggs.74.117. [DOI] [PubMed] [Google Scholar]
- Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh RS. Population genetics and evolution of species related to Drosophila melanogaster . Annu Rev Genet. 1989;23:425–453. doi: 10.1146/annurev.ge.23.120189.002233. [DOI] [PubMed] [Google Scholar]
- Wiuf C, Zhao K, Innan H, Nordborg M. The probability and chromosomal extent of trans-specific polymorphism. Genetics. 2004;168:2363–2372. doi: 10.1534/genetics.104.029488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hey J, Kliman RM. Interactions between natural selection, recombination and gene density in the genes of Drosophila . Genetics. 2002;160:595–608. doi: 10.1093/genetics/160.2.595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- True JR, Mercer JM, Laurie CC. Differences in crossover frequency and distribution among three sibling species of Drosophila . Genetics. 1996;142:507–523. doi: 10.1093/genetics/142.2.507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Takano-Shimizu T. Local changes in GC/AT substitution biases and in crossover frequencies on Drosophila chromosomes. Mol Biol Evol. 2001;18:606–619. doi: 10.1093/oxfordjournals.molbev.a003841. [DOI] [PubMed] [Google Scholar]
- Wang W, Thornton K, Emerson JJ, Long M. Nucleotide variation and recombination along the fourth chromosome in Drosophila simulans . Genetics. 2004;166:1783–1794. doi: 10.1534/genetics.166.4.1783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM. Linkage disequilibria and the site frequency spectra in the su(s) and su(w(a)) regions of the Drosophila melanogaster X chromosome. Genetics. 2000;156:1837–1852. doi: 10.1093/genetics/156.4.1837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andolfatto P, Wall JD. Linkage disequilibrium patterns across a recombination gradient in African Drosophila melanogaster . Genetics. 2003;165:1289–1305. doi: 10.1093/genetics/165.3.1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braverman JM, Lazzaro BP, Aguade M, Langley CH. DNA sequence polymorphism and divergence at the erect wing and suppressor of sable loci of Drosophila melanogaster and D. simulans . Genetics. 2005;170:1153–1165. doi: 10.1534/genetics.104.033456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zapata C, Alvarez G. On the detection of nonrandom associations between DNA polymorphisms in natural populations of Drosophila . Mol Biol Evol. 1993;10:823–841. doi: 10.1093/oxfordjournals.molbev.a040045. [DOI] [PubMed] [Google Scholar]
- Ohta T, Kimura M. Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics. 1971;68:571–580. doi: 10.1093/genetics/68.4.571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton K, Andolfatto P. Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster . Genetics. 2006;172:1607–1619. doi: 10.1534/genetics.105.048223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bazin E, Glemin S, Galtier N. Population size does not influence mitochondrial genetic diversity in animals. Science. 2006;312:570–572. doi: 10.1126/science.1122033. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution. 1985;39:783–791. doi: 10.1111/j.1558-5646.1985.tb00420.x. [DOI] [PubMed] [Google Scholar]
- Kishino H, Thorne JL, Bruno WJ. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol Biol Evol. 2001;18:352–361. doi: 10.1093/oxfordjournals.molbev.a003811. [DOI] [PubMed] [Google Scholar]
- Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. doi: 10.1038/nature02053. [DOI] [PubMed] [Google Scholar]
- Taylor DJ, Piel WH. An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol Biol Evol. 2004;21:1534–1537. doi: 10.1093/molbev/msh156. [DOI] [PubMed] [Google Scholar]
- Hillis DM, Bull JJ. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol. 1993;42:182–192. [Google Scholar]
- Soltis PS, Soltis DE. Applying the bootstrap in phylogeny reconstruction. Stat Sci. 2003;18:256–267. [Google Scholar]
- Yang Z. How often do wrong models produce better phylogenies? Mol Biol Evol. 1997;14:105–108. doi: 10.1093/oxfordjournals.molbev.a025695. [DOI] [PubMed] [Google Scholar]
- Sullivan J, Swofford DL. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst Biol. 2001;50:723–729. doi: 10.1080/106351501753328848. [DOI] [PubMed] [Google Scholar]
- Akaike H. A new look at the statistical model identification. IEEE Trans Autom Contr. 1974;19:716–723. [Google Scholar]
- Gatesy J, Milinkovitch M, Waddell V, Stanhope M. Stability of cladistic relationships between Cetacea and higher-level artiodactyl taxa. Syst Biol. 1999;48:6–20. doi: 10.1080/106351599260409. [DOI] [PubMed] [Google Scholar]
- Gatesy J, Baker RH. Hidden likelihood support in genomic data: Can forty-five wrongs make a right? Syst Biol. 2005;54:483–492. doi: 10.1080/10635150590945368. [DOI] [PubMed] [Google Scholar]
- Cunningham CW. Can three incongruence tests predict when data should be combined? Mol Biol Evol. 1997;14:733–740. doi: 10.1093/oxfordjournals.molbev.a025813. [DOI] [PubMed] [Google Scholar]
- Wilcox TP, Garcia de Leon FJ, Hendrickson DA, Hillis DM. Convergence among cave catfishes: Long-branch attraction and a Bayesian relative rates test. Mol Phylogenet Evol. 2004;31:1101–1113. doi: 10.1016/j.ympev.2003.11.006. [DOI] [PubMed] [Google Scholar]
- Pollock DD, Zwickl DJ, McGuire JA, Hillis DM. Increased taxon sampling is advantageous for phylogenetic inference. Syst Biol. 2002;51:664–671. doi: 10.1080/10635150290102357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hare M. Prospects for nuclear gene phylogeography. Trends Ecol Evol. 2001;16:700–706. [Google Scholar]
- Akashi H, Ko WY, Piao S, John A, Goel P, et al. Molecular evolution in the Drosophila melanogaster species subgroup: Frequent parameter fluctuations on the timescale of molecular divergence. Genetics. 2006;172:1711–1726. doi: 10.1534/genetics.105.049676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. doi: 10.1016/s1055-7903(03)00057-5. [DOI] [PubMed] [Google Scholar]
- Delsuc F, Phillips MJ, Penny D. Comment on “Hexapod origins: Monophyletic or paraphyletic?”. Science. 2003;301:1482. doi: 10.1126/science.1086558. [DOI] [PubMed] [Google Scholar]
- Galtier N, Gouy M. Inferring phylogenies from DNA sequences of unequal base compositions. Proc Natl Acad Sci U S A. 1995;92:11317–11321. doi: 10.1073/pnas.92.24.11317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galtier N, Gouy M. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol. 1998;15:871–879. doi: 10.1093/oxfordjournals.molbev.a025991. [DOI] [PubMed] [Google Scholar]
- Gu X, Li WH. Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution. Proc Natl Acad Sci U S A. 1998;95:5899–5905. doi: 10.1073/pnas.95.11.5899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Husmeier D. Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models. Bioinformatics. 2005;21((Suppl 2)):ii166–172. doi: 10.1093/bioinformatics/bti1127. [DOI] [PubMed] [Google Scholar]
- Charlesworth D. Balancing selection and its effects on sequences in nearby genome regions. PLoS Gen. 2006;2:e64. doi: 10.1371/journal.pgen.0020064. DOI: 10.1371/journal.pgen.0020064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adoutte A, Balavoine G, Lartillot N, Lespinet O, Prud'homme B, et al. The new animal phylogeny: Reliability and implications. Proc Natl Acad Sci U S A. 2000;97:4453–4456. doi: 10.1073/pnas.97.9.4453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rokas A, Kruger D, Carroll SB. Animal evolution and the molecular signature of radiations compressed in time. Science. 2005;310:1933–1938. doi: 10.1126/science.1116759. [DOI] [PubMed] [Google Scholar]
- Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006;440:341–345. doi: 10.1038/nature04562. [DOI] [PubMed] [Google Scholar]
- Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, et al. The dog genome: Survey sequencing and comparative analysis. Science. 2003;301:1898–1903. doi: 10.1126/science.1086432. [DOI] [PubMed] [Google Scholar]
- Nielsen R. Maximum likelihood estimation of population divergence times and population phylogenies under the infinite sites model. Theor Popul Biol. 1998;53:143–151. doi: 10.1006/tpbi.1997.1348. [DOI] [PubMed] [Google Scholar]
- Edwards SV, Beerli P. Perspective: Gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution Int J Org Evolution. 2000;54:1839–1854. doi: 10.1111/j.0014-3820.2000.tb01231.x. [DOI] [PubMed] [Google Scholar]
- Nielsen R, Wakeley J. Distinguishing migration from isolation: A Markov chain Monte Carlo approach. Genetics. 2001;158:885–896. doi: 10.1093/genetics/158.2.885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. doi: 10.1093/genetics/162.4.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knowles LL, Maddison WP. Statistical phylogeography. Mol Ecol. 2002;11:2623–2635. doi: 10.1046/j.1365-294x.2002.01637.x. [DOI] [PubMed] [Google Scholar]
- Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164:1645–1656. doi: 10.1093/genetics/164.4.1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wall JD. Estimating ancestral population sizes and divergence times. Genetics. 2003;163:395–404. doi: 10.1093/genetics/163.1.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont MA, Rannala B. The Bayesian revolution in genetics. Nat Rev Genet. 2004;5:251–261. doi: 10.1038/nrg1318. [DOI] [PubMed] [Google Scholar]
- Hey J, Nielsen R. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis . Genetics. 2004;167:747–760. doi: 10.1534/genetics.103.024182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. Accuracy of coalescent likelihood estimates: Do we need more sites, more sequences, or more loci? Mol Biol Evol. 2006;23:691–700. doi: 10.1093/molbev/msj079. [DOI] [PubMed] [Google Scholar]
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, et al. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. doi: 10.1101/gr.926603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. doi: 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wall DP, Fraser HB, Hirsh AE. Detecting putative orthologs. Bioinformatics. 2003;19:1710–1711. doi: 10.1093/bioinformatics/btg213. [DOI] [PubMed] [Google Scholar]
- Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birney E, Durbin R. Using GeneWise in the Drosophila annotation experiment. Genome Res. 2000;10:547–548. doi: 10.1101/gr.10.4.547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005;33:D39–D45. doi: 10.1093/nar/gki062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000;17:32–43. doi: 10.1093/oxfordjournals.molbev.a026236. [DOI] [PubMed] [Google Scholar]
- Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996;5:299–314. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.