Abstract
Comparative genomics provides a general methodology for discovering functional DNA elements and understanding their evolution. The availability of many related genomes enables more powerful analyses, but requires rigorous phylogenetic methods to resolve orthologous genes and regions. Here, we use 12 recently sequenced Drosophila genomes and nine fungal genomes to address the problem of accurate gene-tree reconstruction across many complete genomes. We show that existing phylogenetic methods that treat each gene tree in isolation show large-scale inaccuracies, largely due to insufficient phylogenetic information in individual genes. However, we find that gene trees exhibit common properties that can be exploited for evolutionary studies and accurate phylogenetic reconstruction. Evolutionary rates can be decoupled into gene-specific and species-specific components, which can be learned across complete genomes. We develop a phylogenetic reconstruction methodology that exploits these properties and achieves significantly higher accuracy, addressing the species-level heterotachy and enabling studies of gene evolution in the context of species evolution.
Comparative genomics of multiple related species has emerged as a powerful approach for the systematic discovery of evolutionarily conserved functional elements (Mouse Genome Sequencing Consortium 2002; Kellis et al. 2003; Ureta-Vidal et al. 2003; Miller et al. 2004; Richards et al. 2005), and for the identification of duplicated and rapidly evolving genes involved in the emergence of new functions (Jaillon et al. 2004; Kellis et al. 2004; Dehal and Boore 2005). Both types of analysis rely on an accurate mapping of orthologous and paralogous genes and regions across the species compared, accounting for all duplication and loss events (Fitch 1970).
Phylogenetic trees provide a rigorous framework for genome comparison (Woese et al. 1990; Baldauf et al. 2000; Murphy et al. 2001), naturally capturing gene duplication and loss, and allowing varying rates of sequence divergence across evolutionary time (Goodman et al. 1979; Page 1994; Eisen 1998). Phylogenies of orthologous genes across species can be used to study species evolution, each internal node representing a speciation event (Fig. 1A). Similarly, phylogenies of paralogous genes within a species can be used to study gene-family expansions, each internal node representing a gene-duplication event (Fig. 1B). Phylogenetics in the context of multiple complete genomes, known as phylogenomics (Eisen 1998), combines multiple orthologs and paralogs across many species in a general gene tree (Fig. 1C) and enables a much richer set of questions than ortholog trees or paralog trees alone (Ma et al. 2000; Zmasek and Eddy 2002; Storm and Sonnhammer 2003; Arvestad et al. 2004; Dufayard et al. 2005; Durand et al. 2006; Li et al. 2006; Huerta-Cepas et al. 2007). Its internal nodes thus represent both speciation and duplication events, and their ordering dictates the evolutionary history of a gene family across the species compared (Goodman et al. 1979; Page 1994). Ortholog and paralog relationships can be readily inferred by mapping general gene trees to the known phylogeny relating the species (Fig. 1D), in a process known as reconciliation. Reconciliation assumes that the species tree is known, and that the gene tree is correct. However, these assumptions have been found to be frequently violated (Rokas et al. 2003; Li et al. 2006), and erroneous gene trees can lead to incorrect ortholog and paralog assignments and many extraneous duplications and losses (Fig. 1E), thus distorting inferred patterns of gene-family expansion and contraction (Hahn 2007).
In this work, we use the 12 recently sequenced Drosophila genomes (Drosophila 12 Genomes Consortium 2007) and nine publicly available fungal genomes (Wolfe and Shields 1997) to study the properties and reconstruction of gene family evolution in the context of complete genomes. Our work has three key contributions:
We show that many gene-tree incongruences in both flies and fungi are likely due to inaccuracies in phylogenetic reconstruction stemming from the lack of informative sites in the short alignments of individual genes. Indeed, we find that incongruences are most pronounced for short alignments and slow-evolving genes, and lead to the same alternate topologies as found in simulation due to reconstruction inaccuracies, suggesting that they are primarily methodological rather than biological.
We show that the substitution rate of any gene can be expressed as the product of a gene-specific rate, dictated by the selective constraints on the gene’s function (Dickerson 1971; Bromham and Penny 2003), and a species-specific rate, dictated by the reproductive and population dynamics of each lineage (Ohta and Kimura 1971), and we provide specific distributions for the two. This decomposition provides a surprisingly good fit to actual phylogenies in both flies and fungi, and can be used for accurate gene-tree reconstruction.
We present a probabilistic framework for distance-based gene-tree reconstruction in complete genomes, based on rate-distribution parameters learned from alignments of unambiguous orthologs and implemented in a publicly available tool, SPIDIR. We used SPIDIR to infer gene trees in both flies and fungi, and show that it leads to significantly higher reconstruction accuracies. In particular, we find that our strategy can address the long-branch attraction problem for species-level heterotachy (Bergsten 2005; Philippe et al. 2005), by learning to expect longer branches for faster-evolving lineages.
Results
Incongruences and inaccuracies of gene trees for syntenic orthologs
Numerous studies have addressed the accuracy of phylogeny reconstruction methods using mainly simulated alignments (Saitou and Imanishi 1989; Kuhner and Felsenstein 1994; Tateno et al. 1994; Philippe et al. 2005) and, in some cases, microevolution observed experimentally (Hillis et al. 1994; Bull et al. 1997; Woods et al. 2006). With multiple complete genomes, regions of conserved gene order (synteny) provide a natural test for phylogenetic methods (Rokas et al. 2003; Ciccarelli et al. 2006), since all genes within these regions are typically coinherited from a single gene in the common ancestor of the species (Fig. 2A). Therefore, in the absence of horizontal gene transfer, gene conversions, and incomplete lineage sorting (Avise et al. 1983; Koonin et al. 2001), their phylogenies should be perfectly congruent to the species phylogeny. However, recent studies have shown that phylogenetic trees obtained for different orthologs frequently disagree with the species phylogeny, resulting in large-scale incongruences (Rokas et al. 2003; Li et al. 2006; Hahn 2007; Huerta-Cepas et al. 2007).
Indeed, using 5154 syntenic one-to-one orthologs across 12 Drosophila genomes and 739 syntenic one-to-one orthologs across nine fungal genomes (see Methods), we found that existing phylogenetic methods recovered the known species topology (Drosophila 12 Genomes Consortium 2007; Stark et al. 2007), denoted T1, for only a small minority of gene trees (Fig. 2B), between 24% and 42% for flies and between 22% and 31% for fungi. This was true across all methods tested, PHYML (Guindon and Gascuel 2003), DNAML (Felsenstein 2005), MrBayes (Ronquist and Huelsenbeck 2003), BIONJ (Gascuel 1997), Parsimony (Felsenstein 2005), for both protein-coding and nucleotide alignments, and using various substitution models (HKY) (Hasegawa et al. 1985), JTT (Jones et al. 1992), and synonymous-substitution dS (Yang 1997) (see Supplemental information). Moreover, no alternate topology was systematically favored: The next most frequent topologies for PHYML, denoted T2–T5, covered between 4% and 11% of fly trees, and an additional 305 topologies accounted for the remaining 31% of trees (Fig. 2B).
Biological mechanisms proposed for gene-tree incongruence, such as incomplete lineage sorting of pre-speciation alleles (Pollard et al. 2006; Wong et al. 2007), may be contributing to the observed incongruences, but are unlikely to explain all incongruent gene trees. Instead, we found multiple lines of evidence suggesting that algorithmic inaccuracies, rather than biological reasons, are likely responsible for a large fraction of the incongruent gene trees.
First, we found a clear, monotonic increase in recovery of congruent gene trees with the length of the corresponding genes (Fig. 2C), as expected for algorithmic accuracy based on simulation studies (Huelsenbeck 1995). For the length of a typical gene alignment (940 ungapped nucleotides), all methods showed accuracies around 40%. These were as low as 25% for shorter genes (<800 nt) and rose up to 60% for the longest genes (>2300 nt, corresponding to <10% of genes). The observed recovery vs. length correlation continued with increasing alignment lengths (90% for 20,000 nt, obtained by concatenating 20 randomly chosen genes) in agreement with lengths typically recommended to produce accurate species trees (Rokas et al. 2003; Ciccarelli et al. 2006). Of course, such lengths are unrealistic for individual genes, and concatenation is not an option for accurate gene-tree reconstruction.
Second, we found that genes with moderate divergence rates showed the highest performance, while most errors were found in very slow and very fast evolving genes. Reconstruction accuracy peaked for genes with 40%–50% sequence identity (reaching 48% accuracy), but was significantly reduced for slower evolving genes (25% accuracy for 70% identity) or faster evolving genes (35% accuracy for 20% identity; Supplemental Fig. S6). This can also be attributed to a lack of phylogenetically informative sites in slow-evolving genes (lacking sufficient events to resolve phylogenetic divergence order) and also in fast-evolving genes (as sites with many independent substitutions do not distinguish between different topologies). In contrast, incomplete lineage sorting is not expected to show such correlations.
Third, simulated phylogenies with the known species topology and similar branch lengths resulted in the same alternate topologies T1–T5 at comparable frequencies (e.g., 4%–11% vs. 3%–5% for PHYML in flies, Supplemental Fig. S3C), suggesting that even the most frequent incongruent topologies may result from reconstruction errors. In fact, the frequency of T2 + T4 corresponding to previously reported incomplete lineage sorting (Pollard et al. 2006; Wong et al. 2007) only differed by 8% between simulation and real data (9% vs. 17%), providing an estimate of the extent of incomplete lineage sorting. The correct phylogeny was recovered for 72% of simulated gene trees on average (an ∼30% increase over real data), potentially reflecting reduced discrepancies from model assumptions in simulated alignments (since the same model of evolution was used for reconstruction and simulation, while real alignments may violate this model), and potentially attributable to incomplete lineage sorting in true phylogenies. However, even if the increase is entirely due to incomplete lineage sorting in true phylogenies, it would only explain incongruences in, at most, 30% of trees, while 62% of fly trees and 76% of fungal trees were found to be incongruent. Thus, a significant portion of incongruences are likely due to reconstruction inaccuracies.
Lastly, if alternate topologies were due to biological reasons rather than methodological inaccuracies, we would expect them to be recovered with multiple methods, show high bootstrap support, and have significantly higher likelihood, neither of which was the case. In fact, the frequencies of T2–T5 were reduced from 4%–11% to 1%–5% when all methods were required to agree (Supplemental Fig. S3D), and the phylogenetic trees that disagreed with the species topology showed significantly lower bootstrap support values (Supplemental Fig. S17). In fact, amongst the 3102 gene trees where an alternative topology was selected by PHYML, only 5.7% of these had a significantly higher likelihood than the topology congruent to the species tree (SH test P < 0.01) (PAUP) (Shimodaira and Hasegawa 1999), suggesting that many of these alternative topologies have insufficient support.
We conclude that a significant fraction of observed phylogenetic incongruences are due to inaccuracies in phylogenetic reconstruction (attributable to a lack of informative sites in the typical gene), and that additional information is necessary to increase the accuracy of gene-tree reconstruction. When gene trees are studied in isolation, it is likely that such information may not exist. However, in the phylogenomic setting, where thousands of gene trees involve only a relatively small number of species, there is an opportunity to learn common features shared among different gene trees, which can be used to guide gene-tree reconstruction. In the following section, we study fly and fungal gene trees and propose a model capturing their common properties. We then develop a novel inference algorithm that can use this information for accurate gene-tree reconstruction in the phylogenomic setting.
Gene- and species-specific substitution rates in phylogenomics
To take advantage of the phylogenomic setting, we sought to capture the fact that thousands of gene trees all evolve within the same species tree, and explicitly model their common properties. We expressed the substitution rate bi of each gene in each lineage as the product of two independent rates (Fig. 3A): a gene-specific substitution rate g dictated by the selective constraints imposed on the function of the gene (Bromham and Penny 2003), and a species-specific substitution rate si dictated by the time interval and evolutionary dynamics of each lineage i (e.g., population size, generation time, mating behavior, overall mutation rate). Our gene-specific rate is similar to site-specific scaling factors used in previous studies (Yang 1994; Felsenstein and Churchill 1996; Siepel et al. 2005; Kim and Pritchard 2007), and the independence of the two rates agrees with recently reported correlations in mammals and hominids (Cooper et al. 2003; Chimpanzee Sequencing and Analysis Consortium 2005).
To derive the properties and specific distributions for gene- and species-specific substitution rates, we revisited our 5154 syntenic fly orthologs, this time requiring each gene-tree topology to be congruent to the species-tree topology, and inferring branch lengths from pairwise distances by maximum likelihood (see Methods). From our model definition, we expect the gene rate to be proportional to the total branch length for any tree. Thus, for each resulting gene tree, we can estimate the gene rate g as the sum of all “absolute” branch lengths bi, (representing the overall substitution rate across the entire tree), and the species-specific rate si for each branch as its “relative” length bi/g after normalization by the gene rate (representing the fraction of substitutions attributable to that lineage). We found that the gene rate g was distributed as a gamma distribution (Fig. 3B), as expected for a rate (Uzzell and Corbin 1971). We also found that each si was distributed as a normal distribution (Fig. 3C), reflecting small fluctuations around the expected mean rate μi for each branch, given the stochastic nature of nucleotide substitution. Relative branch lengths bi/g showed tighter distributions than absolute branch lengths bi (typically with standard deviations between 1/3 and 1/4 of the mean; Supplemental Figs. S7–S12).
Our model makes the assumption that species-specific rates are independent of each other and of the gene rate. This assumption implies specific properties of gene trees, which we found to hold in the fungal and fly genomes. First, we would expect gene trees to be uniformly longer or shorter across different orthologs, appearing as scaled versions of an average gene tree, due to the common species rates si across gene trees. Indeed, pairs of gene trees showed strong correlations to each other: For example, Merlin and abnormal spindle showed correlation r = 0.96 (Fig. 3A), and 93% of genes showed correlations above r = 0.8 to the average gene tree (tree with average branch lengths across all genes) (Supplemental Fig. S14). Second, we would expect strong correlations between the absolute branch lengths from any pair of species, stemming from the common gene rate g. Indeed, the average pairwise correlation of absolute branch lengths was 0.61 across all pairs of species (Fig. 3F). Lastly, we would expect relative branch lengths, representing species-specific rates, to be independent of each other if their correlation was truly due primarily to the common gene-specific rate g, and indeed, we found that the average pairwise correlation of relative branch lengths dropped to 0.09 after normalization by the gene rate (Fig. 3G). For example, the correlation between Drosophila ananassae and Drosophila virilis was r = 0.81 for absolute branch lengths and 0.082 for relative branch lengths after normalization (Fig. 3D,E). All of these relationships, reported here for flies, also held for mammals and fungi (see Supplemental material).
SPIDIR: A machine-learning framework for phylogenomic gene-tree reconstruction
Our results suggest that substitution rates can be decoupled into gene-specific and species-specific rates g and si, and that these are well approximated by independent gamma and normal distributions, respectively. Based on these properties, we develop a generative model for gene-tree evolution across multiple complete genomes: A gene tree is generated as the product of a gene-specific rate g, sampled from a gamma distribution g∼G = Γ(α, β), and a species-specific rate si for each lineage, sampled from a normal distrtibution si∼Si = N(μi, σi2), each distributed independently of each other.
We used this generative model to develop a novel phylogenetic reconstruction method, called SPIDIR, for SPecies-Informed DIstance-based Reconstruction. Similarly to other likelihood-based methods, we search through a large number of gene tree topologies (Huelsenbeck and Ronquist 2001; Guindon and Gascuel 2003), evaluate the likelihood of each, and guide the search toward a maximum-likelihood tree. In contrast to existing methods, both phylogenetic (Gascuel 1997; Guindon and Gascuel 2003; Ronquist and Huelsenbeck 2003; Felsenstein 2005) and phylogenomic (Ma et al. 2000; Zmasek and Eddy 2002; Storm and Sonnhammer 2003; Arvestad et al. 2004; Dufayard et al. 2005; Durand et al. 2006), our algorithm works in two stages, first learning a model of gene and species evolution based on unambiguous orthologs, and then using this model for gene-tree reconstruction.
In the first stage (learning), we estimate the parameters of gene- and species-rate distributions based on alignments of unambiguous one-to-one orthologs across the species compared. As we focus on gene-tree reconstruction, we assume that the species tree is known, or can be reliably inferred using genome-scale information (e.g., using multigene analyses) (Rokas et al. 2003; Gadagkar et al. 2005; Ané et al. 2007; Edwards et al. 2007), or based on unique transposon insertion events (Kriegs et al. 2006). We also assume that a training set of genes with clear one-to-one orthology can be established, with phylogenies that are most likely congruent to the species tree (e.g., by using syntenic one-to-one orthologs, which, in the 12 fly species, include about one-third of all genes). Using the known species tree and multiple alignments of these unambiguous orthologs, we construct gene trees that are congruent to the species topology and estimate their absolute branch lengths bi using least-square error (see Methods). As each gene tree has exactly one gene from each species, we use the total tree length Σi(bi) as an estimate of the gene rate g, and the relative branch lengths bi/g as estimates of individual si (see Methods). This results in thousands of g and si estimates, to which we fit gamma and normal distributions, respectively, to infer (α, β, μi, σi) parameters.
In the second stage (inference), we use our model to reconstruct phylogenies of the remaining genes, which may contain duplication and loss events (Fig. 4). In this work, we use our model for distance-based reconstruction, and thus, the input for the inference stage is a pairwise distance matrix M (Fig. 4A) inferred from multiple sequence alignments of the genes in question (extensions directly incorporating sequence characters are possible, but will be the subject of future work). We then search across many proposed topologies to find the maximum-likelihood gene tree. For each proposed gene-tree topology T, branch lengths b are estimated from the distance matrix M using least-square error (Bryant and Waddell 1998) (Fig. 4B,E), and the likelihoods of these branch lengths bi are calculated according to our model, based on the learned parameters (α, β, μi, σi). When the proposed gene tree is congruent to the species tree (Fig. 4B–D), the rate estimation and probability calculation are straightforward: The probability of the observed branch lengths is simply the product of probabilities of the overall gene rate and the observed relative branch lengths: P(b|G,S) = P(g|G) ∏i [P(bi/g|si)], each gene-tree branch bi uniquely mapping to a species-tree branch. When the proposed gene tree contains duplications and losses (Fig. 4E–G), our gene-rate estimate accounts for the missing data (see Methods), and the probability of relative branch lengths is estimated according to derived rate distributions, possibly spanning multiple species branches (see Methods). The nature of gene and species rate distributions within our model enables efficient likelihood computations for any gene-tree to species-tree reconciliation (see Methods).
As each tree is evaluated according to the distributions learned across the genomes, this framework allows us to distinguish between gene trees with unlikely branch lengths and gene trees whose observed branch lengths fit the learned distributions. For example, the correct gene-tree topology T1 (Kriegs et al. 2006) for orthologous mammalian hemoglobin-beta genes showed more than a 3.5-fold higher likelihood than an alternative topology T2 (Fig. 4), each branch providing a much closer fit to the expected rate distributions, and thus resulting in consistently higher likelihood values (Fig. 4I): The ancestral rodent branch alone showed a twofold increase in likelihood for the correct topology, as the observed length b is much closer to the mean of the corresponding distribution, while the corresponding branch length z in the alternate topology is significantly shorter than would be expected if the gene-tree topology was truly T2. In contrast, all traditional methods systematically selected the incorrect topology T2 for the hemoglobin-beta genes because of long-branch attraction due to the faster-evolving rodent branch (tree rooted by fish ortholog): Neighbor-joining showed eightfold higher bootstrap support for T2, parsimony showed a slight preference for T2, and traditional maximum likelihood showed 100-fold higher likelihood for T2 (Fig. 4H; Supplemental Fig. S18). This effect has also been observed for other genes (Cannarozzi et al. 2006). Our method was able to resolve the correct topology T1 because it expected a longer branch for the rodent lineage, as they have an overall longer species-rate distribution Si.
When paralogs were compared, and the correct gene-tree topology differed from the species-tree topology, our method again led to the correct answer (Supplemental Fig. S16): Comparing rodent hemoglobin-beta to the paralogous human and dog hemoglobin-alpha correctly resulted in T2, since hemoglobin-alpha and -beta are paralogs resulting from an ancestral duplication well before the mammalian speciation. In this case, the likelihood of topology T1 dropped 50-fold, while the likelihood of topology T2 increased, leading to 14-fold higher likelihood for T2. Thus, our method was not biased to always select the species topology when the correct gene-tree topology differed, and was able to resolve paralogous gene trees even in the presence of gene duplication and loss.
Learning species-specific rates leads to increased accuracy
We implemented and tested two versions of SPIDIR (available online at http://compbio.mit.edu/spidir), one using solely rate-based information to guide the reconstruction without penalizing duplication and loss, and one with an explicit penalty for duplication, similar in nature to those described elsewhere (Goodman et al. 1979; Page and Charleston 1997; Durand et al. 2006) and derived from previously reported gene duplication rates of 0.0023 and 0.0013 dup/gene/myr (Lynch and Conery 2000; Hahn et al. 2007) (see Methods).
We tested both versions extensively using the 12 fly and nine fungal genomes. We trained our evolutionary models using 500 randomly selected fly trees and 200 fungal trees, and used the remaining 4654 fly trees and 539 fungal trees to test our performance compared with existing algorithms. As the vast majority of these gene trees is likely congruent to the species tree, we evaluated accuracy as the ability to recover the expected gene tree in each case. We evaluated accuracy separately for the nine fly genomes with >7× sequence coverage from a single strain, and for the complete set of 12 fly species, which includes two species sequenced at 3× coverage and one mosaic genome assembled using seven different strains (Drosophila 12 Genomes Consortium 2007) (see Methods).
We found significantly increased performance over existing methods for both flies and fungi, and for both single-copy and duplicated genes (Fig. 5A). For the nine fly genomes, SPIDIR recovered the correct gene tree for 62% of genes, significantly higher than the leading existing reconstruction methods (BIONJ at 48% and PHYML at 40%); this increased to 74% and 86% with inclusion of an explicit parameter for a gene-duplication probability (0.5 and 0.1, respectively). For the full set of 12 genomes, SPIDIR also showed a clear improvement (5% over BIONJ and 9% over PHYML), although the low-coverage lineages showed increased reconstruction errors (Supplemental Fig. S3B), which may be due to sequencing errors affecting our rate estimates for the very short branches of low-coverage species. For the nine fungi, SPIDIR recovered the correct gene tree for 42% of orthologs, a 10% increase over MrBayes, and 18% increase over PHYML, the leading existing methods; again, this increased to 62% and 78% with use of an explicit duplication parameter of 0.5 and 0.1. Lastly, we used doubly syntenic orthologs arising from whole-genome duplication (“ohnologs”) (Wolfe and Shields 1997; Kellis et al. 2004), to test SPIDIR’s ability to capture gene duplication and loss, inferring model parameters from 739 single-copy syntenic genes and testing performance on 138 duplication-containing gene families; again, we found a 10% improvement on the correct placement of each duplicate pair, which increased by an additional 10% with inclusion of an explicit duplication parameter. Each of these performance improvements was also seen for partial correctness of the gene tree, measured using Robinson-Foulds error (Robinson and Foulds 1981) (Supplemental Fig. S5).
Moreover, SPIDIR accuracy correlated with the number of informative sites, suggesting that it uses available information fully: Performance monotonically increased with increasing gene lengths (Fig. 5B), and peaked for genes with moderate sequence divergence (Supplemental Fig. S6), surpassing existing methods for all length and divergence intervals. In addition, we found that reconstruction accuracy was consistently high, regardless of the gene function: Of 3700 GO terms, SPIDIR had higher reconstruction accuracy than PHYML for 3469 (93%), and of the remaining 231 GO terms (7%), none showed significant enrichment for alternate topologies (P > 0.185 hypergeometric). This suggests that the evolutionary parameters learned in our training set held across all genes tested, regardless of their specific function.
Finally, we found that our method showed no systematic biases toward the species topology. We simulated evolution according to the 10 most frequent ML topologies T1–T10, and asked which topology was inferred by the different prediction algorithms, summarizing the results in a “confusion matrix” (Fig. 5C; Methods). We found that T1 constituted only 6.2% of the SPIDIR-inferred trees for simulated topologies T2–T10, which is similar to PHYML (also 6.2%), confirming that our method is not biased. With duplication and loss penalties, the percentage of T1 increased to 15% for D = 0.5 and 30% for D = 0.1, as expected, but this may be desirable when, in fact, the species tree is known. In addition, both SPIDIR and PHYML identified 62%–65% of all alternate topologies T2–T10 correctly, although SPIDIR was only trained on T1. These trends should be a necessary test for phylogenetic methods that use species-level information to ensure lack of systematic biases.
Discussion
We showed that gene trees are subject to two complementary forces of evolution: a gene-specific component, summarizing the selective pressures on individual gene functions, and a species-specific component, reflecting the divergence times and evolutionary dynamics of the species compared. We found gene- and species-specific substitution rates are independent and can be described by simple distributions that provide a very good fit to actual phylogenies. For both fly and fungal species, we found that a single gene rate was sufficient to model gene trees across the entire clades studied, although larger evolutionary distances and more diverse species groups may require modeling lineage-specific variations in this rate. More generally, we expect that the study of diverse groups of multiple complete genomes will reveal additional properties of gene and species phylogenies, enabling further increases in accuracy and potentially revealing new insights into gene evolution.
We used the decoupling of gene and species rates to introduce a novel approach for phylogenetic reconstruction that is specifically tailored for application in complete genomes. In contrast to existing methods, which treat each gene-tree reconstruction problem in isolation, our approach enables learning across hundreds of phylogenies to improve the accuracy of reconstructing any gene tree involving these species. We tested our method extensively and showed consistent improvements over existing methods for both flies and fungi, lack of bias with respect to the species topology, and increased performance across all lengths, functional categories, and in the presence of gene duplication and loss. Although we have applied our model solely for distance-based reconstruction, it is also applicable to character-based reconstruction. Specifically, our model can be viewed as specifying a prior probability on gene-tree branch lengths, which could replace the uniform branch length prior that is commonly used in maximum likelihood and Bayesian approaches, providing a promising direction for future development.
Several other models have been developed for modeling gene and species evolution simultaneously. One class of models has primarily addressed the inference of a species tree from many gene trees (Ané et al. 2007; Edwards et al. 2007), typically by considering only orthologous genes and assuming that every incongruent node is due to deep coalescence and incomplete lineage sorting. The second class of models has addressed gene-tree reconstruction, typically by assuming that the prevalent reasons for incongruences are gene duplication and loss (Ma et al. 2000; Zmasek and Eddy 2002; Storm and Sonnhammer 2003; Arvestad et al. 2004; Dufayard et al. 2005; Durand et al. 2006). Our model fits within the second class, demonstrating the effectiveness of learning branch-length distributions for gene-tree reconstruction in a generative model for gene tree evolution. A potential future direction for both types of work may be a joint modeling of deep coalescence and gene duplication/loss.
The methodology introduced here, although general, allowed us to address the problem of long-branch attraction at the species level (Bergsten 2005; Philippe et al. 2005). It is known that when fast-evolving lineages are intermixed with slowly evolving lineages, the longer branches tend to cluster together and join further back in evolutionary time, due to increased rates of homoplasy in rapidly evolving lineages. However, addressing long-branch attraction is still a major challenge in phylogenetics. Our decoupling of evolutionary rates allows us to capture heterotachy at the species level, since fast-evolving lineages are uniformly faster across the entire genome. As illustrated in our mammalian example, our model can learn to expect longer branches for faster-evolving lineages, thus recovering the true topology even when all existing methods suffer from long-branch attraction.
Although in this study we have focused our attention on phylogenetic reconstruction accuracy, decoupling gene-specific and species-specific rates can also be used to identify unusual cases of evolutionary change. In particular, it enables us to distinguish whether a long branch is due to simply an overall faster gene rate, a fast-evolving species, or specific acceleration for a particular gene in a given lineage. This is applicable at the level of individual genes, or for sets of genes within a functional category, to recognize evolutionary adaptation of individual genes or pathways. Such studies of acceleration or deceleration can be coupled with studies of positive selection (e.g., Ka/Ks), to detect lineage-specific changes in selective pressures.
The Saccharomycete and Drosophila groups are only the first two in an increasingly long series of groups of related species scheduled for dense sequencing, including 32 mammals, five worms, dozens of fungi, hundreds of bacteria, and thousands of viruses. The increasing number of species in comparative studies should lead to increased power, both for biological signal discovery and for evolutionary studies, but these will require increasingly rigorous methods for genome comparison, which can scale to many species. Single-gene phylogenetic methods are unlikely to scale reliably to dozens of species, while phylogenomic methods should benefit from the abundance of information in complete genomes. The methodologies presented here are general and likely to significantly contribute in the comparison and understanding of many complete genomes.
Methods
Genomic sequences
We selected the two largest groups of fully sequenced closely related species with long-range synteny across the entire group. The Drosophila genus includes D. melanogaster (Adams 2000), D. pseudoobscura (Richards 2005), and the 10 recently sequenced species D. sechellia, D. simulans, D. yakuba, D. erecta, D. ananassae, D. persimilis , D. willistoni, D. mojavensis, D. virilis, and D. grimshawi (Drosophila 12 Genomes Consortium 2007). Our fungal clade includes nine species: Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. bayanus, S. castellii (Cliften 2003; Kellis 2003), Candida glabrata, Kluyveromyces lactis (Dujon 2004), Ashbya gossypii (Dietrich 2004), and Kluyveromyces waltii (Kellis 2004).
Identifying syntenic orthologous genes
We constructed syntenic regions for the 12 flies by defining a syntenic block to be at least three genes within 200 kb of each other with no other blocks in between. For the fungal data set, we required at least three genes per block with a maximum uninterrupted gene separation of 20 kb. Syntenic blocks were filtered to keep only those containing exactly one gene from each species, thus removing potential segmental duplications. For our whole-genome duplication data set, we used S. cerevisiae, S. castellii, and C. glabrata ohnologs from the Yeast Gene Order Browser (YGOB) (Byrne and Wolfe 2005). Ohnologs were clustered by best reciprocal BLAST hits. Clusters were filtered such that exactly one ohnolog pair from each species is present. Clusters were extended to include genes from the remaining species using our syntenic alignments. Alignments were manually curated to remove possible gene conversion events.
Species phylogeny
The currently accepted fly species phylogeny (Drosophila 12 Genomes Consortium 2007) is shown in Supplemental Figure S1. The major features of the fungal phylogeny are also widely accepted (Rokas et al. 2003; Hittinger et al. 2004; Byrne and Wolfe 2005); however, there is less agreement on the branch orders of the preduplication species K. waltii, K. lactis, and A. gossypii. The branching order we used (Supplemental Fig. S1) is the most frequent topology for all methods (ML, MAP, MP) on nucleotide alignments.
Alignments
We study phylogenies of 5154 unambiguous fly orthologs and 739 unambiguous fungal orthologs. These genes are selected from regions of synteny that are filtered to be free of tandem duplications. For each of these ortholog sets, we produced multiple alignments of their protein sequences using MUSCLE (Edgar 2004). To attain nucleotide alignments, we map the nucleotide sequence on to peptide alignments, substituting every amino acid by the corresponding codon and every gap by a triplet of gaps.
Model parameter learning
For each ortholog alignment, a rooted tree congruent to the species was constructed and fitted with our implementation of least-square error on distances estimated by PUZZLE-TREE (Schmidt et al. 2002) using an HKY model. To be consistent, the root is placed at the midpoint of the rooting branch. The branches of each tree were normalized by the total tree length. To estimate the parameters of our model, the mean and variance of relative branches were calculated for each species (4n-2 parameters for n species), and the alpha- and beta-parameters for the gene-specific rate were calculated with maximum-likelihood estimates from the total absolute branch lengths.
Generative model of gene-tree evolution
To define a generative model for gene-tree evolution with an arbitrary number of duplications and losses, we use a more general definition of reconciliation than is commonly used (Goodman et al. 1979; Page 1994). We define a reconciliation R to be a mapping from gene nodes bl to a species node i and duplication point kl: R(bl) = (i, kl).
If gene node bl is a duplication, kl defines the fraction along the species branch at which the gene duplication occurred: kl = ε if the duplication occurs immediately after speciation of species parent(R(bl)), and kl = 1-ε if the duplication occurs immediately before species R(bl). If bl represents a speciation, we define kl to be 1. We define kl to be distributed uniformly over (0,1), unless an ancestor bl2 of bl reconciles to the same species with duplication point kl2, in which case k∼ Uniform(kl2, 1).
For our model, we also define a reconciliation Rb that maps gene branches to species branches. One complication is that a gene branch may map to a path of species branches and may use only a portion of the starting and ending species branch. Thus, we define: Rb(bl) = ((s1, s2,. . . , sm), (p1,p2, . . . , pm)), where the vector s1, . . . , sm defines the path of branches in the species tree and p1, . . . , pm defines the portion of each species branch used by Rb(bl). Notice that the internal branch portions p2, . . . , pm-1, if they exist, are always 1. Defining duplication points kl immediately imply the values of pj, and vice versa (see Supplemental Methods).
Above, we have presented the generative model for a gene branch that reconciles to exactly one species branch ((si), (p1 = 1)), namely, bl ∼ G Si = Γ(α, β) N(μi,σi2), where G and Si are the gene- and species-specific rates. Here, we specify how to generate a branch length bl that reconciles across multiple species branches. We model such a bl to be a product of a gene rate g and a relative rate xl, that itself is the sum of m independent random variables yj, each with the distribution: yj ∼ N(pj μj, pj σj2). Thus, each branch bl in a gene tree is distributed as: bl ∼ G ∑j yj = Γ(α, β) ∑j N(pj μj, pj σj2) = Γ(α, β) N(∑j pj μj, ∑j pj σj2).
In the Supplemental Methods, we present an algorithm that uses this generative model to search for the gene-tree topology with maximum likelihood given its branch lengths, based on a heuristic search over gene-tree topologies.
Performance comparison
We compared our algorithm against a variety of the most popular and successful phylogeny programs. For a maximum-likelihood method, we used PHYML v2.4.4 (Guindon and Gascuel 2003). Nucleotides substitutions were modeled with the HKY model and peptide substitutions were modeled with JTT. For parsimony methods, we used PHYLIP's DNAPARS and PROTPARS programs. MrBayes v3.1.1, a Bayesian-based method, was used to find the maximum a posteriori phylogenetic tree (Ronquist and Huelsenbeck 2003). We used four chains, an automatic stop rule, a 25% burn-in, sampled every 10 generations from a total of 10,000 generations, a fixed BLOSUM model for peptides, a 4by4 model for nucleotides, and we ensured MrBayes reported the most likely binary tree. Lastly, we used the Neighbor-Joining program BIONJ (Gascuel 1997) with a variety of substitution models, including HKY built with PUZZLE-TREE (Schmidt et al. 2002), JTT built with PROTDIST (Felsenstein 2005), and dS built with the YN00 program from PAML v3.15 (Yang 1997). For all programs, unless stated, default options were used. Parsimony methods perform consistently worse than other methods (parsimony is not statistically consistent) (Felsenstein 1978), as well as the synonymous-substitution metric dS, which unfortunately saturates at the evolutionary distances studied.
Duplication probability
Lynch and Conery (2000) and Hahn et al. (2007) estimated the rate of gene duplication to be 0.0023 dup/gene/myr and 0.0013 dup/gene/myr, respectively. Given that the fly tree has a depth of roughly 40 myr and a total length that is approximately six times that, we expect that a gene duplication occurs in any gene family with a probability of 0.5 (0.0023 * 40 * 6) and 0.312 (0.0013 * 40 * 6).
Simulation
We used simulated sequence evolution to evaluate the accuracy of all of the phylogenetic methods we tested. Unlike other phylogenetic methods, SPIDIR uses a generative model of gene-tree evolution to calculate the likelihood of a phylogeny. Therefore, we must simulate sequences such that they behave like real gene families. To do this, we combined an existing sequence simulation program, ROSE (Stoye et al. 1998), with our generative model of gene-tree evolution. If we use a model trained on fly one-to-one orthologs, we can create simulated fly gene trees that have the same gene-specific and species-specific substitution rates as real fly gene families.
For each simulation, we fix the desired topology (T1–T10). We then use our generative model to choose branch lengths for the topology as described in the generative model. In addition, if a negative branch length is generated, it is discarded and a new one is drawn from the distribution.
Once we have a tree with branch lengths, ROSE is used to simulate sequence evolution down each branch using the HKY model. Base frequencies (A = 0.258, C = 0.267, G = 0.266, T = 0.209), transition bias (3.18), gene length (861 bp), total tree length (1.82 sub/site), and alignment percent identity (0.368) are matched to that of real data.
Acknowledgments
We thank Marcia Lara, Antonis Rokas, and Bruce Birren at the Broad Institute for useful comments and feedback. We thank Mike Lin, Alex Stark, Joshua Grochow, Pouya Kheradpour, and Radek Sklarczyk for help, advice, and comments. We are indebted to the fly community for early release of data and annotations for use in our benchmarks. We also thank the NIAID and our collaborators at the Broad Institute, and especially Bruce Birren and Christina Cuomo for discussions. This work was funded by the Ruth L. Kirschstein National Research Service Award (NRSA).
Footnotes
[Supplemental material is available online at www.genome.org and http://compbio.mit.edu/spidir/.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.7105007
References
- Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Li P.W., Hoskins R.A., Galle R.F., Hoskins R.A., Galle R.F., Galle R.F., et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. doi: 10.1126/science.287.5461.2185. [DOI] [PubMed] [Google Scholar]
- Ané C., Larget B., Baum D.A., Smith S.D., Rokas A., Larget B., Baum D.A., Smith S.D., Rokas A., Baum D.A., Smith S.D., Rokas A., Smith S.D., Rokas A., Rokas A. Bayesian estimation of concordance among gene trees. Mol. Biol. Evol. 2007;24:412–426. doi: 10.1093/molbev/msl170. [DOI] [PubMed] [Google Scholar]
- Arvestad L., Berglund A., Lagergren J., Sennblad B., Berglund A., Lagergren J., Sennblad B., Lagergren J., Sennblad B., Sennblad B. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. Proceedings of the Eighth Annual International Conference on Computational Molecular Biology. 2004:326–335. [Google Scholar]
- Avise J.C., Shapira J., Daniel S.W., Aquadro C.F., Lansman R.A., Shapira J., Daniel S.W., Aquadro C.F., Lansman R.A., Daniel S.W., Aquadro C.F., Lansman R.A., Aquadro C.F., Lansman R.A., Lansman R.A. Mitochondrial DNA differentiation during the speciation process in Peromyscus. Mol. Biol. Evol. 1983;1:38–56. doi: 10.1093/oxfordjournals.molbev.a040301. [DOI] [PubMed] [Google Scholar]
- Baldauf S.L., Roger A.J., Wenk-Siefert I., Doolittle W.F., Roger A.J., Wenk-Siefert I., Doolittle W.F., Wenk-Siefert I., Doolittle W.F., Doolittle W.F. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science. 2000;290:972–977. doi: 10.1126/science.290.5493.972. [DOI] [PubMed] [Google Scholar]
- Bergsten J. A review of long-branch attraction. Cladistics. 2005;21:163–193. doi: 10.1111/j.1096-0031.2005.00059.x. [DOI] [PubMed] [Google Scholar]
- Bromham L., Penny D., Penny D. The modern molecular clock. Nat. Rev. Genet. 2003;4:216–224. doi: 10.1038/nrg1020. [DOI] [PubMed] [Google Scholar]
- Bryant D., Waddell P., Waddell P. Rapid evaluation of least-squares and minimum-evolution criteria on phylogenetic trees. Mol. Biol. Evol. 1998;15:1346–1359. [Google Scholar]
- Bull J.J., Badgett M.R., Wichman H.A., Huelsenbeck J.P., Hillis D.M., Gulati A., Ho C., Molineux I.J., Badgett M.R., Wichman H.A., Huelsenbeck J.P., Hillis D.M., Gulati A., Ho C., Molineux I.J., Wichman H.A., Huelsenbeck J.P., Hillis D.M., Gulati A., Ho C., Molineux I.J., Huelsenbeck J.P., Hillis D.M., Gulati A., Ho C., Molineux I.J., Hillis D.M., Gulati A., Ho C., Molineux I.J., Gulati A., Ho C., Molineux I.J., Ho C., Molineux I.J., Molineux I.J. Exceptional convergent evolution in a virus. Genetics. 1997;147:1497–1507. doi: 10.1093/genetics/147.4.1497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrne K.P., Wolfe K.H., Wolfe K.H. The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 2005;15:1456–1461. doi: 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cannarozzi G.M., Schneider A., Gonnet G., Schneider A., Gonnet G., Gonnet G. A phylogenomic study of human, dog and mouse. PLoS Computat. Biol. 2006 doi: 10.1371/journal.pcbi.0030002. e2.eor. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chimpanzee Sequencing and Analysis Consortium Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
- Ciccarelli F.D., Doerks T., v Mering C., Creevey C.J., Snel B., Bork P., Doerks T., v Mering C., Creevey C.J., Snel B., Bork P., v Mering C., Creevey C.J., Snel B., Bork P., Creevey C.J., Snel B., Bork P., Snel B., Bork P., Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. doi: 10.1126/science.1123061. [DOI] [PubMed] [Google Scholar]
- Cliften P., Sudarsanam P., Desikan A., Fulton L., Fulton B., Majors J., Waterston R., Cohen B.A., Johnston M., Sudarsanam P., Desikan A., Fulton L., Fulton B., Majors J., Waterston R., Cohen B.A., Johnston M., Desikan A., Fulton L., Fulton B., Majors J., Waterston R., Cohen B.A., Johnston M., Fulton L., Fulton B., Majors J., Waterston R., Cohen B.A., Johnston M., Fulton B., Majors J., Waterston R., Cohen B.A., Johnston M., Majors J., Waterston R., Cohen B.A., Johnston M., Waterston R., Cohen B.A., Johnston M., Cohen B.A., Johnston M., Johnston M. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003;301:71–76. doi: 10.1126/science.1084337. [DOI] [PubMed] [Google Scholar]
- Cooper G.M., Brudno M., Program N.I.S.C.C.S., Green E.D., Batzoglou S., Sidow A., Brudno M., Program N.I.S.C.C.S., Green E.D., Batzoglou S., Sidow A., Program N.I.S.C.C.S., Green E.D., Batzoglou S., Sidow A., Green E.D., Batzoglou S., Sidow A., Batzoglou S., Sidow A., Sidow A. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 2003;13:813–820. doi: 10.1101/gr.1064503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dehal P., Boore J.L., Boore J.L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005;3:e314. doi: 10.1371/journal.pbio.0030314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickerson R.E. The structures of cytochrome c and the rates of molecular evolution. J. Mol. Evol. 1971;1:26–45. doi: 10.1007/BF01659392. [DOI] [PubMed] [Google Scholar]
- Dietrich F.S., Voegeli S., Brachat S., Lerch A., Gates K., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Voegeli S., Brachat S., Lerch A., Gates K., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Brachat S., Lerch A., Gates K., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Lerch A., Gates K., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Gates K., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Steiner S., Mohr C., Pöhlmann R., Luedi P., Choi S., Mohr C., Pöhlmann R., Luedi P., Choi S., Pöhlmann R., Luedi P., Choi S., Luedi P., Choi S., Choi S., et al. Comparison of the sequence of laboratory yeast to that of a distant relative, a small filamentous fungus, reveals that yeast evolved via genome duplication and separate evolution of homologous genes. Science. 2004;9:304–307. [Google Scholar]
- Drosophila 12 Genomes Consortium Genomics on phylogeny: Evolution of genes and genomes in the genus Drosophila. Nature. 2007 doi: 10.1038/nature06341. (in press) [DOI] [PubMed] [Google Scholar]
- Dufayard J.-F., Duret L., Penel S., Gouy M., Rechenmann F., Perriere G., Duret L., Penel S., Gouy M., Rechenmann F., Perriere G., Penel S., Gouy M., Rechenmann F., Perriere G., Gouy M., Rechenmann F., Perriere G., Rechenmann F., Perriere G., Perriere G. Tree pattern matching in phylogenetic trees: Automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. doi: 10.1093/bioinformatics/bti325. [DOI] [PubMed] [Google Scholar]
- Dujon B., Sherman D., Fischer G., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Sherman D., Fischer G., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Fischer G., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., de Montigny J., Marck C., Neuvéglise C., Talla E., Marck C., Neuvéglise C., Talla E., Neuvéglise C., Talla E., Talla E., et al. Genome evolution in yeasts. Nature. 2004;430:35–44. doi: 10.1038/nature02579. [DOI] [PubMed] [Google Scholar]
- Durand D., Halldorsson B.V., Vernot B., Halldorsson B.V., Vernot B., Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. 2006;13:320–335. doi: 10.1089/cmb.2006.13.320. [DOI] [PubMed] [Google Scholar]
- Edgar R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards S.V., Liu L., Pearl D.K., Liu L., Pearl D.K., Pearl D.K. High-resolution species trees without concatenation. Proc. Natl. Acad. Sci. 2007;104:5936–5941. doi: 10.1073/pnas.0607004104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eisen J.A. Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8:163–167. doi: 10.1101/gr.8.3.163. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 1978;27:401–410. [Google Scholar]
- Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. 2005. [Google Scholar]
- Felsenstein J., Churchill G.A., Churchill G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 1996;13:93–104. doi: 10.1093/oxfordjournals.molbev.a025575. [DOI] [PubMed] [Google Scholar]
- Fitch W.M. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–113. [PubMed] [Google Scholar]
- Gadagkar S.R., Rosenberg M.S., Kumar S., Rosenberg M.S., Kumar S., Kumar S. Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree. J. Exp. Zoolog. B Mol. Dev. Evol. 2005;304:64–74. doi: 10.1002/jez.b.21026. [DOI] [PubMed] [Google Scholar]
- Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 1997;14:685–695. doi: 10.1093/oxfordjournals.molbev.a025808. [DOI] [PubMed] [Google Scholar]
- Goodman M., Czelusniak J., Moore G., Romero-Herrera A., Matsuda G., Czelusniak J., Moore G., Romero-Herrera A., Matsuda G., Moore G., Romero-Herrera A., Matsuda G., Romero-Herrera A., Matsuda G., Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 1979;28:132–163. [Google Scholar]
- Guindon S., Gascuel O., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 2003;52:696–704. doi: 10.1080/10635150390235520. [DOI] [PubMed] [Google Scholar]
- Hahn M. Bias in phylogenetic tree reconciliation methods: Implications for vertebrate genome evolution. Genome Biol. 2007;8:R141. doi: 10.1186/gb-2007-8-7-r141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn M.W., Han M.V., Han S.-G., Han M.V., Han S.-G., Han S.-G. Gene family evolution across 12 Drosophila genomes. PLoS Genet. 2007;3:e197. doi: 10.1371/journal.pgen.0030197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasegawa M., Kishino H., Yano T., Kishino H., Yano T., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
- Hillis D.M., Huelsenbeck J.P., Cunningham C.W., Huelsenbeck J.P., Cunningham C.W., Cunningham C.W. Application and accuracy of molecular phylogenies. Science. 1994;264:671–677. doi: 10.1126/science.8171318. [DOI] [PubMed] [Google Scholar]
- Hittinger C.T., Rokas A., Carroll S.B., Rokas A., Carroll S.B., Carroll S.B. Parallel inactivation of multiple GAL pathway genes and ecological diversification in yeasts. Proc. Natl. Acad. Sci. 2004;101:14144–14149. doi: 10.1073/pnas.0404319101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huelsenbeck J. The robustness of two phylogenetic methods: Four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 1995;12:843–849. doi: 10.1093/oxfordjournals.molbev.a040261. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck J.P., Ronquist F., Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17:754–755. doi: 10.1093/bioinformatics/17.8.754. [DOI] [PubMed] [Google Scholar]
- Huerta-Cepas J., Dopazo H., Dopazo J., Gabaldon T., Dopazo H., Dopazo J., Gabaldon T., Dopazo J., Gabaldon T., Gabaldon T. The human phylome. Genome Biol. 2007;8:R109. doi: 10.1186/gb-2007-8-6-r109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaillon O., Aury J.M., Brunet F., Petit J.L., Stange-Thomann N., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Aury J.M., Brunet F., Petit J.L., Stange-Thomann N., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Brunet F., Petit J.L., Stange-Thomann N., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Petit J.L., Stange-Thomann N., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Stange-Thomann N., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Mauceli E., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Bouneau L., Fischer C., Ozouf-Costaz C., Bernot A., Fischer C., Ozouf-Costaz C., Bernot A., Ozouf-Costaz C., Bernot A., Bernot A., et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004;431:946–957. doi: 10.1038/nature03025. [DOI] [PubMed] [Google Scholar]
- Jones D.T., Taylor W.R., Thornton J.M., Taylor W.R., Thornton J.M., Thornton J.M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 1992;8:275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
- Kellis M., Patterson N., Endrizzi M., Birren B., Lander E.S., Patterson N., Endrizzi M., Birren B., Lander E.S., Endrizzi M., Birren B., Lander E.S., Birren B., Lander E.S., Lander E.S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. doi: 10.1038/nature01644. [DOI] [PubMed] [Google Scholar]
- Kellis M., Birren B.W., Lander E.S., Birren B.W., Lander E.S., Lander E.S. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
- Kim S.Y., Pritchard J.K., Pritchard J.K. Adaptive evolution of conserved noncoding elements in mammals. PLoS Genet. 2007;3:e147. doi: 10.1371/journal.pgen.0030147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koonin E.V., Makarova K.S., Aravind L., Makarova K.S., Aravind L., Aravind L. Horizontal gene transfer in prokaryotes: Quantification and classification. Annu. Rev. Microbiol. 2001;55:709–742. doi: 10.1146/annurev.micro.55.1.709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kriegs J.O., Churakov G., Kiefmann M., Jordan U., Brosius J., Schmitz J., Churakov G., Kiefmann M., Jordan U., Brosius J., Schmitz J., Kiefmann M., Jordan U., Brosius J., Schmitz J., Jordan U., Brosius J., Schmitz J., Brosius J., Schmitz J., Schmitz J. Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol. 2006;4:e91. doi: 10.1371/journal.pbio.0040091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuhner M.K., Felsenstein J., Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 1994;11:459–468. doi: 10.1093/oxfordjournals.molbev.a040126. [DOI] [PubMed] [Google Scholar]
- Li H., Coghlan A., Ruan J., Coin L.J., Heriche J.-K., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Coghlan A., Ruan J., Coin L.J., Heriche J.-K., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Ruan J., Coin L.J., Heriche J.-K., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Coin L.J., Heriche J.-K., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Heriche J.-K., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Osmotherly L., Li R., Liu T., Zhang Z., Bolund L., Li R., Liu T., Zhang Z., Bolund L., Liu T., Zhang Z., Bolund L., Zhang Z., Bolund L., Bolund L., et al. TreeFam: A curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–D580. doi: 10.1093/nar/gkj118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch M., Conery J.S., Conery J.S. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. doi: 10.1126/science.290.5494.1151. [DOI] [PubMed] [Google Scholar]
- Ma B., Kowloon H., Li M., Waterloo O., Zhang L., Center B., Kowloon H., Li M., Waterloo O., Zhang L., Center B., Li M., Waterloo O., Zhang L., Center B., Waterloo O., Zhang L., Center B., Zhang L., Center B., Center B. From gene trees to species trees. SIAM J. Comput. 2000;30:729–752. [Google Scholar]
- Miller W., Makova K.D., Nekrutenko A., Hardison R.C., Makova K.D., Nekrutenko A., Hardison R.C., Nekrutenko A., Hardison R.C., Hardison R.C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 2004;5:15–56. doi: 10.1146/annurev.genom.5.061903.180057. [DOI] [PubMed] [Google Scholar]
- Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
- Murphy W.J., Eizirik E., Johnson W.E., Zhang Y.P., Ryder O.A., O'Brien S.J., Eizirik E., Johnson W.E., Zhang Y.P., Ryder O.A., O'Brien S.J., Johnson W.E., Zhang Y.P., Ryder O.A., O'Brien S.J., Zhang Y.P., Ryder O.A., O'Brien S.J., Ryder O.A., O'Brien S.J., O'Brien S.J. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. doi: 10.1038/35054550. [DOI] [PubMed] [Google Scholar]
- Ohta T., Kimura M., Kimura M. On the constancy of the evolutionary rate of cistrons. J. Mol. Evol. 1971;1:18–25. doi: 10.1007/BF01659391. [DOI] [PubMed] [Google Scholar]
- Page R. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol. 1994;43:58–77. [Google Scholar]
- Page R.D., Charleston M.A., Charleston M.A. From gene to organismal phylogeny: Reconciled trees and the gene tree/species tree problem. Mol. Phylogenet. Evol. 1997;7:231–240. doi: 10.1006/mpev.1996.0390. [DOI] [PubMed] [Google Scholar]
- Philippe H., Zhou Y., Brinkmann H., Rodrigue N., Delsuc F., Zhou Y., Brinkmann H., Rodrigue N., Delsuc F., Brinkmann H., Rodrigue N., Delsuc F., Rodrigue N., Delsuc F., Delsuc F. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. 2005;5:50. doi: 10.1186/1471-2148-5-50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pollard D.A., Iyer V.N., Moses A.M., Eisen M.B., Iyer V.N., Moses A.M., Eisen M.B., Moses A.M., Eisen M.B., Eisen M.B. Widespread discordance of gene trees with species tree in Drosophila: Evidence for incomplete lineage sorting. PLoS Genet. 2006;2:e173. doi: 10.1371/journal.pgen.0020173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richards S., Liu Y., Bettencourt B.R., Hradecky P., Letovsky S., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Liu Y., Bettencourt B.R., Hradecky P., Letovsky S., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Bettencourt B.R., Hradecky P., Letovsky S., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Hradecky P., Letovsky S., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Letovsky S., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Nielsen R., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Thornton K., Hubisz M.J., Chen R., Meisel R.P., Hubisz M.J., Chen R., Meisel R.P., Chen R., Meisel R.P., Meisel R.P., et al. Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution. Genome Res. 2005;15:1–18. doi: 10.1101/gr.3059305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson D.F., Foulds L.R., Foulds L.R. Comparison of phylogenetic trees. Math. Biosci. 1981;53:131–147. [Google Scholar]
- Rokas A., Williams B.L., King N., Carroll S.B., Williams B.L., King N., Carroll S.B., King N., Carroll S.B., Carroll S.B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. doi: 10.1038/nature02053. [DOI] [PubMed] [Google Scholar]
- Ronquist F., Huelsenbeck J.P., Huelsenbeck J.P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
- Saitou N., Imanishi T., Imanishi T. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol. 1989;6:514–525. [Google Scholar]
- Schmidt H.A., Strimmer K., Vingron M., v Haeseler A., Strimmer K., Vingron M., v Haeseler A., Vingron M., v Haeseler A., v Haeseler A. TREE-PUZZLE: Maximum-likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002;18:502–504. doi: 10.1093/bioinformatics/18.3.502. [DOI] [PubMed] [Google Scholar]
- Shimodaira H., Hasegawa M., Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 1999;16:1114–1116. [Google Scholar]
- Siepel A., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Clawson H., Spieth J., Hillier L.W., Richards S., Spieth J., Hillier L.W., Richards S., Hillier L.W., Richards S., Richards S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stark A., Lin M.F., Kheradpour P., Pedersen J.S., Parts L., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Lin M.F., Kheradpour P., Pedersen J.S., Parts L., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Kheradpour P., Pedersen J.S., Parts L., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Pedersen J.S., Parts L., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Parts L., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Carlson J.W., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Crosby M.A., Rasmussen M.D., Roy S., Deoras A.N., Rasmussen M.D., Roy S., Deoras A.N., Roy S., Deoras A.N., Deoras A.N., et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 2007 doi: 10.1038/ nature06340. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storm C.E.V., Sonnhammer E.L.L., Sonnhammer E.L.L. Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res. 2003;13:2353–2362. doi: 10.1101/gr1305203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoye J., Evers D., Meyer F., Evers D., Meyer F., Meyer F. Rose: Generating sequence families. Bioinformatics. 1998;14:157–163. doi: 10.1093/bioinformatics/14.2.157. [DOI] [PubMed] [Google Scholar]
- Tateno Y., Takezaki N., Nei M., Takezaki N., Nei M., Nei M. Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 1994;11:261–277. doi: 10.1093/oxfordjournals.molbev.a040108. [DOI] [PubMed] [Google Scholar]
- Ureta-Vidal A., Ettwiller L., Birney E., Ettwiller L., Birney E., Birney E. Comparative genomics: Genome-wide analysis in metazoan eukaryotes. Nat. Rev. Genet. 2003;4:251–262. doi: 10.1038/nrg1043. [DOI] [PubMed] [Google Scholar]
- Uzzell T., Corbin K.W., Corbin K.W. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]
- Woese C.R., Kandler O., Wheelis M.L., Kandler O., Wheelis M.L., Wheelis M.L. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. 1990;87:4576–4579. doi: 10.1073/pnas.87.12.4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolfe K.H., Shields D.C., Shields D.C. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. doi: 10.1038/42711. [DOI] [PubMed] [Google Scholar]
- Wong A., Jensen J.D., Pool J.E., Aquadro C.F., Jensen J.D., Pool J.E., Aquadro C.F., Pool J.E., Aquadro C.F., Aquadro C.F. Phylogenetic incongruence in the Drosophila melanogaster species group. Mol. Phylogenet. Evol. 2007;43:1138–1150. doi: 10.1016/j.ympev.2006.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woods R., Schneider D., Winkworth C.L., Riley M.A., Lenski R.E., Schneider D., Winkworth C.L., Riley M.A., Lenski R.E., Winkworth C.L., Riley M.A., Lenski R.E., Riley M.A., Lenski R.E., Lenski R.E. Tests of parallel molecular evolution in a long-term experiment with Escherichia coli. Proc. Natl. Acad. Sci. 2006;103:9107–9112. doi: 10.1073/pnas.0602917103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
- Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- Zmasek C.M., Eddy S.R., Eddy S.R. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. [DOI] [PMC free article] [PubMed] [Google Scholar]