Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Nov 28.
Published in final edited form as: Methods Mol Biol. 2012;856:10.1007/978-1-61779-585-5_3. doi: 10.1007/978-1-61779-585-5_3

GENOME-WIDE COMPARATIVE ANALYSIS OF PHYLOGENETIC TREES: THE PROKARYOTIC FOREST OF LIFE

Pere Puigbò 1, Yuri I Wolf 1, Eugene V Koonin 1,§
PMCID: PMC3842619  NIHMSID: NIHMS519038  PMID: 22399455

Abstract

Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolutionary genomics, and a variety of approaches for such comparison have been developed. In this article we present several methods for comparative analysis of large numbers of phylogenetic trees. To compare phylogenetic trees taking into account the bootstrap support for each internal branch, the Boot-Split Distance (BSD) method is introduced as an extension of the previously developed Split Distance (SD) method for tree comparison. The BSD method implements the straightforward idea that comparison of phylogenetic trees can be made more robust by treating tree splits differentially depending on the bootstrap support. Approaches are also introduced for detecting tree-like and net-like evolutionary trends in the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for conserved genes of prokaryotes. The principal method employed for this purpose includes mapping quartets of species onto trees to calculate the support of each quartet topology and so to quantify the tree and net contributions to the distances between species. We describe the applications methods used to analyze the FOL and the results obtained with these methods. These results support the concept of the Tree of Life (TOL) as a central evolutionary trend in the FOL as opposed to the traditional view of the TOL as a ‘species tree’.

Keywords: Forest of life, tree of life, phylogenomic methods, tree comparison, map of quartets

1. INTRODUCTION

With the advances of genomics, phylogenetics entered a new era that is noted by the availability of extensive collections of phylogenetic trees for thousands of individual genes. Examples of such tree collections are the phylomes that encompass trees for all sufficiently widespread genes in a given genome (1-4) or the “Forest of Life” (FOL) that consists of all trees for widespread genes in a representative set of organisms (5). It has been known since the early days of phylogenetics that trees built on the same set of species often have different topologies, especially when the set includes distant species, most notably, in prokaryotes (6, 7). The availability of “forests” consisting of numerous phylogenetic trees exacerbated the problem as an enormous diversity of tree topologies has been revealed. The inconsistency between trees has several major sources: 1) problems with ortholog identification caused primarily by cryptic paralogy, 2) various artifacts of phylogenetic analysis, such as long branch attraction (LBA), 2) horizontal gene transfer (HGT), 3) other evolutionary processes distorting the vertical, tree-like pattern such as incomplete lineage sorting and hybridization (1, 8-10). In order to obtain robust results in genome-level phylogenetic analysis, for instance, to classify phylogenetic trees into clusters with (partially) congruent topologies or to identify common trends among multiple trees, reliable methods for comparing trees are indispensable.

The number and diversity of tree comparison methods and software have substantially increased in the last few years. The tree comparison methods variously use tree bipartitions, such as partition, or symmetric difference metrics (11) and split distance (12); distance between nodes such as the path length metrics (13), nodal distance (12, 14) and nodal distance for rooted trees (15); comparison of evolutionary units such as triplets and quartets (16); subtransfer operations such as subtree transfer distance (17), nearest-neighbor interchanging (18), Subtree Prune and Regraft (SPR) using a rooted reference tree (19), SPR for unrooted trees (20) and Tree Bisection and Reconnection (TBR) (17); (dis)agreement methods such as agreement subtrees (21), disagree (12), corresponding mapping (22) and congruence index (23); tree reconciliation (24); and topological and branch lengths methods such as K-tree score (25). Several algorithms have been proposed to analyze with multi-family trees.For example,the FMTS algorithm systematically prunes each gene copy from a multi-family tree to obtain all possible single-gene trees (12) and an algorithm implemented in TreeKO prunes nodes from the input rooted trees in which duplication and speciation events are labeled (26).However, to the best of our knowledge, none of the available metrics for tree comparison takes into account the robustness of the branches, a feature that appears important to minimize the impact of artefacts (unreliable parts of a tree) on the outcome of comparative tree analysis. Here, we present the Boot-Split Distance (BSD) method that calculates distances between phylogenetic trees with weighting based on bootstrap values. This method is implemented in the program TOPD/FMTS (12). In our recent research, we used the BSD method combined with classical multidimensional scaling (CMDS) analysis to explore the main trends in the phylogenetic FOL and to explore the “Tree of Life” (TOL) concept in light of comparative genomics (5, 27).

Since the time (ca 1838) when Darwin drew the famous sketch of an evolutionary tree in his notebook on transmutation of species, with the legend “I think…”, the thinking on the “Tree of Life” (TOL) has evolved substantially. The first phylogenetic revolution, brought about by the pioneering work of Zuckerkandl and Pauling (28), and later Woese and coworkers (29), was the establishment of molecular sequences as the principal material for phylogenetic tree construction. The second revolution has been triggered by the advent of comparative genomics when it has been realized that HGT, at least among prokaryotes, was much more common than previously suspected. The first revolution was a triumph of the tree thinking, when a well resolved TOL started to appear within reach. The second revolution undermines the very foundation of the TOL concept and threatens to destroy it altogether (30-32).

The current views of evolutionary biologists on the TOL span the entire range from acceptance to complete rejection, with a host of moderate positions. The following rough classification may be used to summarize these positions a) acceptance of the TOL as the dominant trend in evolution: HGT is considered to be rare and overhyped, and most of the observed “transfers” are deemed to be artifacts (33-36); b) the TOL is the common history of the (nearly) non-transferable core of genes, surrounded by “vines” of HGT (37-48); c) each gene has its own evolutionary history blending HGT and vertical inheritance; a statistical trend might exist in the maze of gene histories, and it could even be tree-like (5, 49-51), and d) Ubiquity of HGT renders the TOL concept totally obsolete (prokaryotic species and higher taxa do not exist, and microbial “taxonomy” is created by a pattern of biased HGT) (30, 32, 52-57).

We found that, although different trends and patterns have to be invoked to describe the FOL in its entirety, the main, most robust trend is the “statistical TOL”, i.e., the signal of coherent topology that is discernible in a large fraction of the trees in the FOL, in particular, among the Nearly Universal Trees (NUTs).

Recently, we further explored the FOL by analysis of species quartets (58). A quartet is a group of 4 species which is the minimum evolutionary unit in unrooted phylogenetic trees; each quartet can assume three unrooted tree topologies (16). We described a quantitative measure of the tree and net signals in evolution that is derived from an analysis of all quartets of species in all trees of the FOL. The results of this analysis indicate that, although diverse routes of net-like evolution jointly dominate the FOL, the pattern of tree-like evolution that recapitulates the consensus topology of the NUTs is the single most prominent, coherent trend. Here, we report an extended version of these methodologies introduced to analyze the FOL and its trends, as well as new concepts of prokaryotic evolution under the FOL perspective (Supplementary figure 1).

2. MATERIALS

2.1. The Forest of Life (FOL) and Nearly Universal Trees (NUTs)

We analyzed the set of 6901 phylogenetic trees from (5) that were obtained as follows. Clusters of orthologous genes were obtained from the COG (59) and EggNOG (60) databases from 100 prokaryotic species (59 bacteria and 41 archaea). The species were selected to represent the taxonomic diversity of Archaea and Bacteria (for the complete list of species, see Table S1). The BeTs algorithm (59) was used to identify the orthologs with the highest mean similarity to other members of the same cluster (“index orthologs”), so the final clusters contained 100 or fewer genes, with no more than one representative of each species. The sequences in each cluster were aligned using the Muscle program (61) with default parametersand refined using Gblocks (62). The program Multiphyl (63), which selects the best of 88 amino acid substitution models, was used to reconstruct the maximum likelihood tree of each cluster. The Nearly Universal Trees (NUTs) are defined as trees from COGs that are represented in more than 90% of the species included in the study.

3. METHODS

3.1. Boot-split distance: a method to compare phylogenetic trees taking into account bootstrap support

Boot Split Distance (BSD)

The BSD method compares trees based on the original Split Distance (SD) (12) method. Both methods work by collecting all possible binary splits of the two compared trees and calculating the fraction of equal splits, i.e. those splits that are present in both trees (different split refer to splits that are present in only one of the two trees).Instead of considering all branches as being equal as is the case in SD, the BSD method takes into account the bootstrap values to increase or decrease the SD value proportionally to the robustness of individual internal branches. The BSD value is the average of the BSD in the equal splits (eBSD) and the BSD in the different splits [Equation 1]. Equations 2 and 3 give the formulas to calculate the eBSD and dBSD values, respectively.

BSD=eBSD+dBSD2 (1)
eBSD=1[eaMe] (2)
dBSD=daMd (3)

Here e is the sum of bootstrap values of equal splits, d is the sum of bootstrap value of different splits, a is the sum of all bootstrap values, Me is the mean bootstrap value of equal splits, and Md is the mean bootstrap value of different splits.

The BSD algorithm proceeds in 4 basic steps to compare pairs of trees (Supplementary figure 2). The first step is to obtain all possible splits from both trees. This procedure implies a binary split of the tree at each internal branch, so that the tree is partitioned into two parts each of which contains at least two species. Then, the common -set of leaves between the two trees is obtained, that is, the set of shared species. Only trees with a common leaf-set of at least 4 species can be compared. The third step consists in pruning all splits to the common leaf-set of species; at this step, species that are present in only one of the two compared trees are removed from the split list. After this procedure, in partially overlapping trees, the algorithm checks whether each of the splits remains a valid partition, that is, a partition that separates at least two species from the rest of the tree. If a split is not a valid partition, it is removed. Finally, the algorithm calculates the BSD using the equations 1, 2 and 3.

The BSD algorithm

There are three possible types of comparisons for trees that do not include paralogs, that is, include one and only one sequence from each of the constituent species (Figure 1). In the first case, the two trees completely overlap, that is, consist of the same set of species (Figure 1a). In this case, step 2, the pruning procedure, is not necessary, and the comparison involves only obtaining all possible splits and the calculation of the BSD. In the second case, one of the compared trees is a subset of the other tree (Figure 1b). In this case, the splits are only pruned and occasionally removed from the bigger tree. In the third case, when the two trees partially overlap or when a tree is a subset of another tree, a pruning procedure is required. In the example shown in Figure 2, after the pruning procedure (step 3), there is only one remaining split (split: AB∣CD) that is repeated several times in both trees. The remaining AB∣CD split in Tree 1 is separated by 4 nodes that have different bootstrap values. In this case, the bootstrap of the remaining split is calculated using the Equation 4, where n is the total number of nodes between the two sides of the split and BSi is the bootstrap value (adjusted to the 0 to 1 range) of the node i.

Bootstrap=1i=1n(1BSi) (4)
Figure 1. Examples of the BSD algorithm in single family trees.

Figure 1

a) Two trees of the same size.

b) Tree 1 is a subtree of the tree 2.

c) Two trees that partially overlap

SD: Split Distance, BSD: Boot Split Distance, eBSD: BSD of equal splits, dBSD: BSD of different splits, p: number of equal splits, q: number of different splits, m: total number of splits, a: sum of bootstraps in all splits, e: sum of bootstraps in equal splits, d: sum of bootstraps in different splits, Ma: mean bootstrap value, Me: mean bootstrap value in equal splits, Md: mean bootstrap value in different splits.

Figure 2. Calculation of BSD for trees with an unequal numbers of species.

Figure 2

The larger tree (1) is pruned prior to the calculation of BSD. The bootstrap value for the only shared internal branch is calculated according to the Equation (4).

The bootstrap value associated with a particular branch of a binary tree is taken as a measure of the probability that the four subtrees on the opposite ends of this branch are partitioned correctly. To estimate the probability of the correct partitioning of an arbitrary set of four subtrees, the internal branch of the quartet tree is mapped onto each of the internal branches of the original tree. The quartet is considered to be resolved correctly if it is resolved correctly relative to any of these branches. Under the assumption that bootstrap probabilities on individual branches are independent, Eq 4 is obtained as the estimate of the bootstrap probability for the internal branch of the quartet tree.

Using a bootstrap threshold: pros and cons

The key question regarding the BSD method is: what is the best approach to phylogenetic tree comparison: using all branches, reliable or not, with the appropriate weighting, or using only branches supported by high bootstrap values? The first option is illustrated in Figure 1, whereas Figure 3 shows an example of a tree comparison that employs a bootstrap threshold of 70, i.e. only branches supported by a higher bootstrap are taken into account in the comparison. The second procedure appears reasonable and can be recommended in some cases. However, it is not advisable as a general approach because, when two large trees with varying bootstrap values are compared, using a strict threshold restricts the comparison to a small subset of robust branches, resulting in an artificially low BSD value. In other words, this procedure artificially inflates the similarity between the two trees by depreciating a large fraction of the branches. In addition, before considering the use of only most supported branches, one should take into account that the BSD method already uses bootstrap values to adjust the distance between trees, so if two trees are topologically similar (low SD) but supported by low bootstrap, the distance value increases (higher BSD), which is one of the advantages of the BSD method (see equations 2 and 3).

Figure 3. Example of the BSD algorithm using a bootstrap cut-off.

Figure 3

The figure shows the comparison of two phylogenetic trees that takes into account only those branches with bootstrap support greater than 70. SD: Split Distance, BSD: Boot Split Distance, eBSD: BSD of equal splits, dBSD: BSD of different splits, p: number of equal splits, q: number of different splits, m: total number of splits, a: sum of bootstraps in all splits, e: sum of bootstraps in equal splits, d: sum of bootstraps in different splits, Ma: mean bootstrap value, Me: mean bootstrap value in equal splits, Md: mean bootstrap value in different splits.

Testing the BSD method

The performance of the BSD method was compared with that of the original SD method implemented in the TOPD/FMTS program (12). Supplementary figure 3 shows the correlation of SD and BSD for trees with a number of species from 4 to 15 (a) and from 16 to 100 (b) from a recent large scale analysis of the FOL (5). The three-way comparison of SD, BSD and tree size (number of species) shows a positive correlation between SD and BSD for all tree sizes (R2= 0.8613 for trees with 4 to 16 species and R2 = 0.7055 for trees with 16 to 100 species) (Supplementary figure 3c). However, the SD follows a discrete distribution, which obviously is most conspicuous in the comparisons of small trees (Supplementary figure 3a) whereas, thanks to the use of the bootstrap values, the BSD distribution is continuous (Figure 4).

Figure 4. Comparisons of trees with 6 taxa.

Figure 4

Bootstrap values were assigned randomly in each comparison.

Figure 4 shows an example of the comparison (all-against-all) of three trees with 6 species each that differ in 1, 2 and 3 splits, resulting in SD values of 0.33, 0.66 and 1, respectively (Figure 4a). Also, each tree was compared to itself resulting in a SD of 0. Then, bootstrap values were assigned randomly to the trees in order to compare the trees using the BSD method, and this procedure was repeated 1000 times. The resulting plot (Figure 4b) shows that, for the comparison of trees with SD of 0 and 1, the BSD values ranged from 0 to 0.5 and from 0.5 to 1, respectively, and in principle, could assume all intermediate values. In the case of the comparisons that differed in one split (SD=0.33), the BSD value was greater than 0.33 in 75% of the comparison, whereas for the comparisons that differed in two splits (SD=0.67), 25% of the BSD values were greater than 0.67. Thus, the BSD method for tree comparison offers a better resolution than the SD method, especially, for trees with a small number of species.

Figure 5a shows the results of analysis of 6 simulated alignments with an increasing level of noise (divergence respect to the initial alignment) in each alignment, i.e. from the alignment 0 (without noise and producing trees with bootstrap values of 100) to alignment 5 with the maximum level of noise. For each alignment, a tree was constructed using the UPGMA method from the web-server DendroUPGMA (http://genomes.urv.cat/UPGMA). Distances were calculated using the Jaccard coefficient, and bootstraps were generated from 100 replicates. The results of the tree comparison (Figure 5b) using three different methods, namely, Nodal Distance (ND), SD and BSD, show that the BSD method presents a continuous distribution resulting in a better resolution of the distances than the other two methods Indeed, the SD and ND methods fail to discern the similarity between trees after 6 changes, whereas the BSD method still reports discernible similarity (Figure 5b). In order to compare the three tree comparison methods, the distance reported by each method were normalized to the maximum value in each case, i.e. after 46 changes (maximum number of changes in the simulation), the distance to the initial tree is 1.41, 0.30 and 0.42 for ND, SD and BSD respectively. All three distance values indicate that the trees are similar far above the random expectation, supporting the robustness of all methods, but the BSD method presents a better resolution in the tree comparison.

Figure 5. Comparison of 6 trees constructed from alignments with increasing noise levels.

Figure 5

a) Comparison of trees from 6 simulated alignments. The UPGMA tree from each alignment was reconstructed with the web server DendroUPGMA (http://genomes.urv.cat/UPGMA) using the Jaccard coefficient as the measure of distance and generating 100 bootstraps replicates. Alignment 0 corresponds to the initial alignment without noise that perfectly separates all branches, resulting in a tree with bootstrap values of 100 for all internal nodes. Alignments 1 to 5 correspond to the derivatives of the initial alignment with increasing noise levels at each step.

b) Results of the comparison of each tree (1 to 5) with the initial tree (0). The trees were compared using three methods: Split distance (SD), Nodal Distance (ND), and Boot Split Distance (BSD). For the purpose of comparison, the results obtained with each of the three methods were normalized to the maximum value in each case.

Analysis of random trees and the significance of BSD results

To assess the significance of the tree comparison by the BSD method, we performed several tree comparisons using random trees containing between 4 and 100 species (Figure 6). Each test is an all-against-all comparison of 1000 random trees (for complete results see Additional file 1). The results from random tree comparison have to be used to determine whether the detected similarities or differences between trees are significantly different from chance (12). Figure 6 shows that the distance between random trees monotonically increases with the tree size up to a value of approximately 0.75 for BSD and approximately 0.999 for SD. In other words, although BSD is an extension of the SD method, the results obtained by the two methods are not directly comparable. Therefore, to assess whether the similarity between two trees is better than chance, one must consider the method used for the tree comparison (e.g. SD or BSD) and the size of the tree. For example, consider two trees with 15 species each for which the SD method reports a distance of 0.75. This value is far below randomness (Figure 6), so the conclusion would be that the two trees are non-randomly similar. However, if the same distance value (0.75) is reported by the BSD method, the conclusion would be the opposite, namely, that the two trees are no more similar than two random trees of 15 species.

Figure 6. Random BSD and SD depending on the tree size.

Figure 6

Results of the tree comparison of random trees (with different sizes ranging from 4 to 100 species) show that the BSD and SD increase up to 0.75 and 0.999 respectively.

Another and probably the most important problem of the comparison of phylogenetic trees is how to interpret the results from a biological perspective. To address this issue, we generated random trees containing from 4 to 100 species and performed 1 to 100 permutations (swap of a pair of branches) in each tree. The resulting tree was then compared with the source tree (Figure 7a,b). The results show the number of permutations required to obtain a particular BSD value for different tree sizes (number of species). For instance, BSD=0.3 in the comparison of two trees with 20 species indicates that the two trees are separated by one permutation whereas BSD=0.6 indicates that the trees are separated by approximately 9 permutations (for the complete listing of equivalences between BSD, SD and the number of permutations, see Additional file 2). Considering that each permutation corresponds to an HGT event, the BSD may be construed as the measure of the extent of HGT contributing to the topological difference between the compared trees. Given the discrete distribution of SD values, this measure cannot be used to infer the number of permutations with the same precision as BSD.

Figure 7. The number of permutations and the BSD.

Figure 7

a) BSD depending on the number of permutations and tree size.

b) Mean and standard deviation of the BSD for up to 100 permutations for trees with 20 species.

3.2. Analysis of topological trends in a set of phylogenetic trees

Calculation of the tree inconsistency

A key characteristic of the FOL is the degree of the topological (in)consistency between the constituent trees. To quantify this trend, we introduced the inconsistency score (IS), which is the fraction of the times that the splits from a given tree are found in all N trees that comprise the FOL (cite). Thus, the IS may be naturally taken as a measure of how representative of the entire FOL is the topology of the given tree. The IS is calculated using equations 5-7, where N is the total number of trees, X is the number of splits in the given tree, and Y is the number of times the splits from the given tree are found in all trees of the FOL.

IS=1YISminISmax (5)
ISmin=1XN (6)
ISmax=1XISmin (7)

In addition to the calculation of a single value of IS for a given tree by comparing its topology to the topologies of rest of trees in the FOL, IS can be calculated along the depth of the trees, namely, split depth and phylogenetic depth. The split depth was calculated for each unrooted tree according to the number of splits from the tips to the center of the tree. The value of split depth ranged from 1 to 49 ([100 species/2] – 1). The phylogenetic depth was obtained from the branch lengths of a rescaled ultrametric tree, rooted between archaeal and bacterial species, and ranged from 0 to 1. The topology of the ultrametric tree was obtained from the supertree of the 102 NUTs using the CLANN program (64). The branch lengths from each of the 6901 trees were used to calculate the average distance between each pair of species. The obtained matrix was used to calculate the branch lengths of the supertree of the NUTs. This supertree with branch lengths was then used to construct an ultrametric tree using the program KITSCH from the Phylip package (65) and rescaled to the depth range from 0 to 1. The resulting ultrametric tree was used for the analysis of the dependence of tree inconsistency on phylogenetic depth.

Classical multidimensional scaling analysis

The Classical MultiDimensional Scaling (CMDS), also known as principal coordinate analysis, is the multifactorial method best suited to analyze matrices obtained from tree comparison methods like BSD, and identify the main trends in a large set of phylogenetic trees. The CMDS embeds n data points implied by a [n × n] distance matrix into an m-dimensional space (m <n) such that, for any k ∈ [1, m], the embedding into the first k dimensions is the best in terms of preserving the original distances between the points (66, 67). In our analysis, the data points are distances between trees obtained using the BSD method. The choice of the optimal number of clusters is made using the gap statistics algorithm (68). The number of clusters for which the value of the gap function for cluster k + 1 is not significantly higher than that for cluster k (z-score below 1.96, corresponding to 0.05 significance level) is considered optimal. The CMDS analysis was performed using the kmeans function of the R package that implements the K-means algorithm. The CMDS approach has been previously employed by Hillis et al. for phylogenetic tree comparison, with the distances between trees calculated using the Robinson-Foulds distance (69).

3.3. Analysis of quartets of species

Definition of quartets and mapping quartets onto trees

The minimum evolutionary unit in unrooted phylogenetic trees is defined by groups of 4 species (or quartets), and each quartet may be best represented by the three possible unrooted tree topologies (Supplementary figure 4a). A quartet defined by the set of species A, B, C and D has three possible unrooted topologies: 1) AB∣CD; 2) AC∣BD and 3 AD∣BC. To analyze which quartet topology (QT) best represents the relationships among the 4 species in a quartet, each quartet was compared against the entire set of phylogenetic trees from 100 species (the FOL).

For 100 species, there are 3,921,225 quartets, and accordingly, 11,763,675 topologies (Figure S11b). A mapping of quartets onto trees is produced using the SD method (12). A binary version of this method was employed to compare quartets and trees (a quartet is represented in a tree when SD=0 and not represented when SD>0). Figure 8a shows an example of quartet mapping onto a set of 10 trees. Here q1 is a resolved quartet, with the topology q1t1 supported by 8 of the 10 trees. By contrast, for q2, three quartet topologies are equally supported, i.e., the topology of this quartet remains unresolved.

Figure 8. Mapping quartets.

Figure 8

a) Mapping quartets onto a set of 10 trees.

b) A schematic of the procedure used to reconstruct a species matrix from the map of quartets.

To analyze which of the three possible topologies best represents the almost four million quartets in the FOL, each quartet topology was compared with the entire set of 6901 trees, resulting in a total number of 8.12×1010 tree comparisons (Supplementary figure 4b), and the number of trees that support each quartet topology was counted for the entire FOL or for the set of 102 NUTs (Supplementary figure 4b).

Distance matrices and heat maps

Using the quartet support values for each quartet, a 100×100 between-species distance matrix was calculated as dij = 1-Sij/Qij where dij is the distance between two species, Sij is the number of trees containing quartets in which the two species are neighbors, and Qij is the total number of quartets containing the given two species. Then, this distance matrix was used to construct different heat maps using the matrix2png web-server ((70) Figure 8b). In contrast to the BSD method, which is best suited for the analysis of the evolution of individual genes, the distance matrices derived from maps of quartets are used to analyze the evolution of species and to disambiguate tree-like evolutionary relationships and ‘highways’ (preferential routes) of HGT.

The Tree-Net Trend (TNT)

The quartet-based between-species distances were used to calculate the Tree-Net Trend (TNT) score. The TNT score is calculated by re-scaling each matrix of quartet distances to a 0 to 1 scale between the supertree-derived matrix (which is taken to represent solely the tree-like evolution signal, hence the distance of 0) and the matrix obtained from permuted trees, with distance values around the random expectation of 0.67 (Supplementary figure 5). Two situations may occur in the calculation of the TNT score depending on the relationship between the distance in the supertree matrix (Ds) and the distance in the random matrix (Dr=0.67). When Ds>Dr (e.g., in comparisons of archaea versus bacteria), STNT = (d-Dr) / (Ds-Dr), where STNT is the TNT score and d is the distance between the two compared species in the matrix. When Ds < Dr (in comparisons between closely related species), STNT = 1 – ((d-Ds) / (Dr-Ds).

4. PHYLOGENETIC CONCEPTS IN LIGHT OF PERVASIVE HORIZONTAL GENE TRANSFER

4.1. Patterns in the phylogenetic Forest of Life

The reconstruction of the evolutionary trends in the FOL is based on the idea that prokaryotes, effectively, share a common gene pool. This gene pool consists of genes with widely different ranges of phyletic spread, from universal to rare ones only present in a few species (71). Thus, genes, as the elements of this gene pool, have their distinct evolutionary histories blending HGT and vertical inheritance (Figure 9). In principle, the Forest of Life (FOL) encompasses the complete set of phylogenetic trees for all genes from all genomes. However, a comprehensive analysis of the entire FOL is computationally prohibitive (with over 1000 archaeal and bacterial genomes now available and the computational resources accessible to the authors, estimation of the phylogenetic tree for each gene represented in all these genomes would take weeks of computer time) so a representative subset of the trees needs to be selected and analyzed. Previously (5), we defined such a subset by selecting 100 archaeal and bacterial genomes, which are representative of all major prokaryote groups, and building 6901 maximum likelihood (ML) trees for all genes with a sufficient number of homologs and sufficient level of sequence conservation in this set of genomes; for brevity, we refer to this set of trees as the FOL.In this set of almost 7000 trees, only a very small portion of the forest is represented by nearly universal trees (Figure 9). Furthermore, bacterial and archaeal universal trees are rare as well, as reflected in Figure 9 by the small peaks around 41 and 59 species, i.e. all archaea and all bacteria, respectively. The dominant pattern in the major part of the FOL is completely different: the FOL is best represented by numerous small trees, with about 2/3 of the trees including <20 species (Figure 9).

Figure 9. The Forest of Life (FOL).

Figure 9

The distribution of the trees in the FOL by the number of species. Modified from ref. (5).

4.2. The nearly universal trees (NUTs)

We define the Nearly Universal Trees (NUTs) as trees for those COGs that were represented in more than 90% of the included prokaryotes. This definition yielded 102 NUTs. Not surprisingly, the great majority of the NUTs are genes encoding proteins involved in translation and the core aspects of transcription (Figure 10). Among the NUTs, only 14 corresponded to COGs that consist of strict 1:1 orthologs (all of them ribosomal proteins), whereas the rest of NUTs included paralogs in some organisms (only the most conserved paralogs were used for tree construction (5)). The 1:1 NUTs were similar to the rest of the NUTs in terms of the connectivity in tree similarity (1-BSD) networks and their positions in the single cluster of NUTs obtained using CMDS.

Figure 10. Distribution of the gene functions among the NUTs.

Figure 10

The functional classification of genes was from the COG database (59).

The 102 NUTs were compared to trees produced by analysis of concatenations of universal proteins (47). The results showed that most of the NUTs were topologically similar to a tree obtain by the concatenation of 31 universal orthologous genes(5) —in other words, the ‘Universal Tree of Life’ constructed by Ciccarelli et al. (47) was statistically indistinguishable from the NUTs and showed properties of a consensus topology. Not surprisingly, the 1:1 ribosomal protein NUTs were even more similar to the universal tree than the rest of the NUTs, in part because these proteins were used for the construction of the universal tree and, in part, presumably because of the low level of HGT among ribosomal proteins.

4.3. The Tree of Life (TOL) as a central trend in the FOL

We analyzed the matrix of all-against-all tree comparisons of the NUTs by embedding them into a 30-dimensional tree space using the CMDS procedure (66, 67). The gap statistics analysis (68) reveals a lack of significant clustering among the NUTs in the tree space. Thus, all the NUTs seem to belong to one unstructured cloud of points scattered around a single centroid. This organization of the tree space is best compatible with individual trees randomly deviating from a single, dominant topology (which may be denoted the TOL), apparently as a result of random HGT (but in part possibly due to random errors in the tree-construction procedure). Therefore, there is an unequivocal general trend among the NUTs. Although the topologies of the NUTs were, for the most part, not identical, so that the NUTs could be separated by their degree of inconsistency (a proxy for the amount of HGT), the overall high consistency level indicated that the NUTs are scattered in the close vicinity of a consensus tree, with HGT events distributed randomly (5).

Thus, the NUTs present a unique and strong signal of unity that seems to reflect the TOL pattern of evolution. The inconsistency score (IS) among the NUTs ranged from 1.4 to 4.3%, whereas the mean IS value for an equivalent set (102) of randomly generated trees with the same number of species was approximately 80%, indicating that the topologies of the NUTs are highly consistent and non-random (5).

To further assess the potential contribution of phylogenetic analysis artifacts to observed inconsistencies between the NUTs, we analyzed these trees with different bootstrap support thresholds (that is, only splits supported by bootstrap values above the respective threshold value were compared). Particularly low IS levels were detected for splits with high-bootstrap support, but the inconsistency was never eliminated completely, suggesting that HGT is a significant contributor to the observed inconsistency among the NUTs (IS ranges from 0.3% to 2.1% and 0.3% to 1.8% for splits with a bootstrap value higher than 70 and 90 respectively) (5).

Analysis of the supernetwork built from the 102 NUTs (5) showed that the incongruence among these trees is mainly concentrated at the deepest levels, with a much greater congruence at shallow phylogenetic depths. The major exception is the unambiguous archaeal-bacterial split that is observed despite the apparent substantial interdomain HGT. Evidence of probable HGT between archaea and bacteria was obtained for approximately 44% of the NUTs (13% from archaea to bacteria, 23% from bacteria to archaea and 8% in both directions), with the implication that HGT is likely to be even more common between the major branches within the archaeal and bacterial domains (5). These results are compatible with previous reports on the apparently random distribution of HGT events in the history of highly conserved genes, in particular those encoding proteins involved in translation (72, 73), and on the difficulty of resolving the phylogenetic relationships between the major branches of bacteria (74-76) and archaea (5, 77, 78). More specifically, archaeal-bacterial HGT has been inferred for 83% of the genes encoding aminoacyl-tRNA synthetases (compared with the overall 44%), essential components of the translation machinery that are known for their horizontal mobility (40, 79). In contrast, no HGT has been predicted for any of the ribosomal proteins, which belong to an elaborate molecular complex, the ribosome, and hence appear to be non-exchangeable between the two prokaryotic domains (40, 73). In addition to the aminoacyl-tRNA synthetases, and in agreement with many previous observations ((80) and references therein), evidence of HGT between archaea and bacteria was seen also for the few metabolic enzymes that belonged to the NUTs, including undecaprenyl pyrophosphate synthase, glyceraldehyde-3-phosphate dehydrogenase, nucleoside diphosphate kinase, thymidylate kinase, and others.

4.4. The NUTs topologies as the central trend and detection distinct evolutionary patterns in the FOL

Using the BSD method, we compared the topologies of the NUTs to those of the rest of the trees in the FOL. Notably, 2,615 trees (~38% of the FOL) showed a greater than 50% similarity (P-value < 0.05) to at least one of the NUTs, being the mean similarity of the trees to the NUTs approximately 50% (Figure 11). For a set of 102 randomized trees of the same size as the NUTs, only about 10% of the trees in the FOL showed the same or greater similarity, indicating that the NUTs were strongly and non-randomly connected to the rest of the FOL.

Figure 11. Topological similarity between the NUTs and the rest of the FOL.

Figure 11

Percentage of trees connected to the NUTs at a different % of similarity. (Modified from Puigbò et al. 2009.)

We then analyzed the structure of the FOL by embedding the 3,789 COG trees into a 669-dimensional space using the CMDS procedure (66, 67). A CMDS clustering of the entire set of 6,901 trees in the FOL was beyond the capacity of the R software package used for this analysis; however, the set of COG trees included most of the trees with a large number of species for which the topology comparison is most informative. A gap statistics analysis (66, 67) of K-means clustering of these trees in the tree space revealed distinct clusters of trees in the forest. The FOL is optimally partitioned into 7 clusters of trees (the smallest number of clusters for which the gap function did not significantly increase with the increase of the number of clusters) (Figure 12). Clusters 1, 4, 5 and 6 were enriched for bacterial-only trees, all archaeal-only trees belonged to clusters 2 and 3, and cluster 7 consisted entirely of mixed archaeal-bacterial clusters; notably, all the NUTs form a compact group inside cluster 6.

Figure 12. Clusters and patterns in the FOL.

Figure 12

The 7 clusters identified in the FOL using the CMDS method and the mean similarity values between the 102 NUTs and all trees from each of the 7 clusters are shown. (Modified from Puigbò et al. 2009.)

The results of the CMDS clustering (Figure 12) support the existence of several distinct ‘attractors’ in the FOL, However, we have to emphasize caution in the interpretation of this clustering because trivial separation of the trees by size could be an important contribution. The approaches to the delineation of distinct ‘groves’ within the forest merit further investigation. The most salient observation for the purpose of the present study is that all the NUTs occupy a compact and contiguous region of the tree space and, unlike the complete set of the trees, are not partitioned into distinct clusters by the CMDS procedure. Taken together with the high mean topological similarity between the NUTs and the rest of the FOL, these findings indicate that the NUTs represent a valid central trend in the FOL.

4.5. The tree and net components of prokaryote evolution

The TNT map of the NUTs was dominated by the tree-like signal (green in Figure 13a): the mean TNT score for the NUTs was 0.63 (Figure 14b), so the evolution of the nearly universal genes of prokaryotes appears to be almost “two-third tree-like” (i.e., reflects the topology of the supertree). The rest of the FOL stood in a stark contrast to the NUTs, being dominated by the net-like evolution, with the mean TNT value of 0.39 (Figure 14c) (about “60% net-like”). Remarkably, areas of tree-like evolution were interspersed with areas of net-like evolution across different parts of the FOL (Figure 13b). The major net-like areas observed among the NUTs were retained but additional ones became apparent including Crenarchaeota that showed a pronounced signal of a non-tree-like relationship with diverse bacteria as well as some Euryarchaeota (Figure 13b). The distribution of the tree and net evolutionary signals among different groups of prokaryotes showed a striking split among the NUTs: among the archaea, the tree signal was heavily dominant (mean TNTNUTs_Archaea = 0.80 ± 0.20) whereas among bacteria the contributions of the tree and net signals were nearly equal (mean TNTNUTs_Bacteria = 0.51 ± 0.38). Among the rest of the trees in the FOL, archaea also showed a stronger tree signal than bacteria but the difference was much less pronounced than it was among the NUTs (mean TNTFOL_Archaea = 0.47 ± 0.11 and mean TNTFOL_Bacteria = 0.34 ± 0.08). The conclusions on the tree-like and net-like components of evolution made here are based on the assumption that the supertree of the NUTs represents the tree-like (vertical) signal. We did not perform direct tests of the robustness of these conclusions to the supertree topology. However, observations presented previously (5) suggest that the results are likely to be robust given the coherence of the NUTs topologies as well as the similarity of the supertree topology and the topologies of the individual NUTs to the ‘tree of life’ obtained from concatenated sequences of universally conserved ribosomal proteins (47).

Figure 13. The Tree/NetworkTrend (TNT) score heatmaps.

Figure 13

(A) The 102 NUTs.

(B) The FOL without the NUTs (6,799 trees). The TNT increases from red (low score, close to random, an indication of net-like evolution) to green (high score, close to the supertree topology, an indication of tree-like evolution). The species are ordered according to the topology of the supertree of the 102 NUTs. In (A), the major groups of archaea and bacteria are denoted. ). (Modified from Puigbò et al. 2010.)

Figure 14. The Tree/NetworkTrends in the FOL and in the NUTs.

Figure 14

a) a hypothetical equilibrium between the tree and net trends.

b) A schematic representation of the tree tendency in the NUTs.

c) A schematic representation of the net tendency in the FOL.

5. Conclusions

The analysis of the phylogenetic FOL is a logical strategy for studying the evolution of prokaryotes because each set of orthologous genes presents its own evolutionary history and no single topology may represent the entire forest. Thus, the methods introduced in this article that compare trees without the use of a pre-conceived representative topology for the entire FOL may be of wide utility in phylogenomics.

We have shown that, although no single topology may represent the entire FOL and several distinct evolutionary trends are detectable, the NUTs contain a strong tree-like signal. Although the tree-like signal is quantitatively weaker than the sum total of the signals from HGT, it is the most pronounced single pattern in the entire FOL.

Under the FOL perspective, the traditional TOL concept (a single “true” tree topology) is invalidated and should be replaced by a statistical definition. In other words, the TOL only makes sense as a central trend in the phylogenetic forest.

Supplementary Material

Additional file 1
Additional file 2
Figures S1-S5

Supplementary figure 1 – A schematic of the methods and concepts involved in the FOL analysis

Supplementary figure 2 – The main algorithm of the BSD method

The algorithm to calculate the BSD between two trees includes four basic steps: 1) split both trees in all possible partitions, 2) read the common set of species of both trees, 3) prune the splits according with the common leaf-set and 4) calculate the BSD.

Supplementary figure 3 – Correlation of BSD and SD from the all-against-all tree comparisons of 6901 phylogenetic trees.

a) Trees containing 4 to 15 species.

b) Trees containing 16 to 100 species.

c) SD, BSD and tree size for trees containing between 16 and 100 species.

Supplementary figure 4 – Quartets and quartet topologies

a) Each quartet (qi) is defined by a set of four species (different colors denote species) and may be represented by three possible unrooted tree topologies (qiti).

b) Quartet topologies (QT). In 100 species, the total number of quartets (Q) is 3,921,225. Each quartet may be represented by 3 distinct QTs, resulting in a total of 11,763,735 QTs. Each QT was mapped onto the FOL, i.e. for each QT, it was determined which of the three topologies is represented in each phylogenetic tree in the FOL (8.12×1010 comparisons). (Modified from Puigbò et al. 2010.)

Supplementary figure 5 – The Tree/NetworkTrend (TNT)

The figure shows a schematic of the TNT calculation and the re-scaling procedure. (Modified from Puigbò et al. 2010.)

ACKNOWLEDGEMENTS

The authors’ research is supported by the Department of Health and Human Services intramural program (NIH, National Library of Medicine).

ABBREVIATIONS

CMDS

classical multidimensional scaling

COG

clusters of orthologous genes

BSD

boot-split distance

FOL

forest of life

HGT

horizontal gene transfer

ND

nodal distance

NUTs

nearly universal trees

QT

quartet topology

TNT

tree-net trend

TOL

tree of life

SD

split distance

Footnotes

Additional file 1 – Results of random analysis

Random test were performed for trees between 4 through 100 species and each test is the result of 1000 random trees compared all-against-all.

Additional file 2 – Results of the equivalence between topological distance and permutations.

Number of permutations required to obtain any distance value (BSD o SD).

EXERCISES

1. Calculate the split distance SD and boot-split distance (BSD) of the following two trees:

(((A,B)61,C)53,D,E);(((A,C)76,B)38,D,E);

2. Calculate the Inconsistency Score of the tree X in the ‘forest of trees’ Y.

X = (((A,B),C),D,E);

Y = (((A,B),C),D,E); (A,B,(E,D); (((A,C),B),D,E); (A,C,(B,D); (A,B,(C,D); (A,B,(C,E); (A,E,(B,D); (((A,C),D),E,F); (((A,B),D),E,C); (((E,F),A),B,C);

REFERENCES

  • 1.Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldon T. The human phylome. Genome Biol. 2007;8:R109. doi: 10.1186/gb-2007-8-6-r109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Huerta-Cepas J, Bueno A, Dopazo J, Gabaldon T. PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res. 2008;36:D491–496. doi: 10.1093/nar/gkm899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Frickey T, Lupas AN. PhyloGenie: automated phylome generation and analysis. Nucleic Acids Res. 2004;32:5231–5238. doi: 10.1093/nar/gkh867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sicheritz-Ponten T, Andersson SG. A phylogenomic approach to microbial evolution. Nucleic Acids Res. 2001;29:545–552. doi: 10.1093/nar/29.2.545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Puigbo P, Wolf YI, Koonin EV. Search for a Tree of Life in the thicket of the phylogenetic forest. J Biol. 2009;8:59. doi: 10.1186/jbiol159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Felsenstein J. Inferring Phylogenies. Sinauer Associates; Sunderland, MA: 2004. [Google Scholar]
  • 7.Nei M, Kumar S. Molecular Evolution and Phylogenetics. Oxford Univ.; Oxford: 2001. [Google Scholar]
  • 8.Castresana J. Topological variation in single-gene phylogenetic trees. Genome Biol. 2007;8:216. doi: 10.1186/gb-2007-8-6-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Soria-Carrasco V, Castresana J. Estimation of phylogenetic inconsistencies in the three domains of life. Mol Biol Evol. 2008;25:2319–2329. doi: 10.1093/molbev/msn176. [DOI] [PubMed] [Google Scholar]
  • 10.Marcet-Houben M, Gabaldon T. The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS ONE. 2009;4:e4357. doi: 10.1371/journal.pone.0004357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147. [Google Scholar]
  • 12.Puigbo P, Garcia-Vallve S, McInerney JO. TOPD/FMTS: a new software to compare phylogenetic trees. Bioinformatics. 2007;23:1556–1558. doi: 10.1093/bioinformatics/btm135. [DOI] [PubMed] [Google Scholar]
  • 13.Steel MA, Penny D. Distribution of tree comparison metrics - some new results. Systematic Biol. 1993;42:126–141. [Google Scholar]
  • 14.Bluis J, Shin D-G. Proceedings of the third IEEE symposium on bioInformatics and bioEngineering. IEEE Computer Society; 2003. Nodal distance algorithm: calculating a phylogenetic tree comparison metric; pp. 87–94. [Google Scholar]
  • 15.Cardona G, Llabres M, Rossello F, Valiente G. Nodal distances for rooted phylogenetic trees. J Math Biol. 2009 doi: 10.1007/s00285-009-0295-2. [DOI] [PubMed] [Google Scholar]
  • 16.Estabrook GF, McMorris FR, Meachan A. Comparison of undirected phylogenetic trees based on subtree of four evolutionary units. Syst Zool. 1985;34:193–200. [Google Scholar]
  • 17.Allen L, Steel M. Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees Annals of Combinatorics. 2001;5:1–15. [Google Scholar]
  • 18.Waterman MS, Steel M. On the similarity of dendrograms. J Theor Biol. 1978;73:789–800. doi: 10.1016/0022-5193(78)90137-6. [DOI] [PubMed] [Google Scholar]
  • 19.Beiko RG, Hamilton N. Phylogenetic identification of lateral genetic transfer events. BMC Evol Biol. 2006;6:15. doi: 10.1186/1471-2148-6-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hickey G, Dehne F, Rau-Chaplin A, Blouin C. SPR Distance Computation for Unrooted Trees. Evol Bioinform Online. 2008;4:17–27. doi: 10.4137/ebo.s419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kubicka E, Kubicki G, McMorris FR. An algorithm to find agreement subtrees. J Classification. 1995;12:91–99. [Google Scholar]
  • 22.Nye TM, Lio P, Gilks WR. A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics. 2006;22:117–119. doi: 10.1093/bioinformatics/bti720. [DOI] [PubMed] [Google Scholar]
  • 23.de Vienne DM, Giraud T, Martin OC. A congruence index for testing topological similarity between trees. Bioinformatics. 2007;23:3119–3124. doi: 10.1093/bioinformatics/btm500. [DOI] [PubMed] [Google Scholar]
  • 24.Cotton JA, Page RD. Going nuclear: gene family evolution and vertebrate phylogeny reconciled. Proc Biol Sci. 2002;269:1555–1561. doi: 10.1098/rspb.2002.2074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Soria-Carrasco V, Talavera G, Igea J, Castresana J. The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics. 2007;23:2954–2956. doi: 10.1093/bioinformatics/btm466. [DOI] [PubMed] [Google Scholar]
  • 26.Marcet-Houben M, Gabaldon T. TreeKO: a duplication-aware algorithm for the comparison of phylogenetic trees. Nucleic Acids Res. 2011;39:e66. doi: 10.1093/nar/gkr087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Koonin EV, Wolf YI, Puigbo P. The phylogenetic forest and the quest for the elusive tree of life. Cold Spring Harb Symp Quant Biol. 2009;74:205–213. doi: 10.1101/sqb.2009.74.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zuckerkandl E, Pauling L. Molecular evolution. In: Kasha M, B. P, editors. Horizons in Biochemistry. Academic Press; New York: 1962. pp. 189–225. [Google Scholar]
  • 29.Woese CR. Bacterial evolution. Microbiol Rev. 1987;51:221–271. doi: 10.1128/mr.51.2.221-271.1987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bapteste E, O’Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, et al. Prokaryotic evolution and the tree of life are two different things. Biol Direct. 2009;4:34. doi: 10.1186/1745-6150-4-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Doolittle WF. Uprooting the tree of life. Sci Am. 2000;282:90–95. doi: 10.1038/scientificamerican0200-90. [DOI] [PubMed] [Google Scholar]
  • 32.Doolittle WF, Bapteste E. Pattern pluralism and the Tree of Life hypothesis. Proc Natl Acad Sci U S A. 2007;104:2043–2049. doi: 10.1073/pnas.0610699104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kurland CG, Canback B, Berg OG. Horizontal gene transfer: A critical view. Proc Natl Acad Sci U S A. 2003;100:9658–9662. doi: 10.1073/pnas.1632870100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kurland CG. What tangled web: barriers to rampant horizontal gene transfer. Bioessays. 2005;27:741–747. doi: 10.1002/bies.20258. [DOI] [PubMed] [Google Scholar]
  • 35.Logsdon JM, Faguy DM. Thermotoga heats up lateral gene transfer. Curr Biol. 1999;9:R747–751. doi: 10.1016/s0960-9822(99)80474-6. [DOI] [PubMed] [Google Scholar]
  • 36.Genereux DP, Logsdon JM., Jr. Much ado about bacteria-to-vertebrate lateral gene transfer. Trends Genet. 2003;19:191–195. doi: 10.1016/S0168-9525(03)00055-6. [DOI] [PubMed] [Google Scholar]
  • 37.Kunin V, Goldovsky L, Darzentas N, Ouzounis CA. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 2005;15:954–959. doi: 10.1101/gr.3666505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Daubin V, Moran NA, Ochman H. Phylogenetics and the cohesion of bacterial genomes. Science. 2003;301:829–832. doi: 10.1126/science.1086568. [DOI] [PubMed] [Google Scholar]
  • 39.Lerat E, Daubin V, Moran NA. From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the gamma-Proteobacteria. PLoS Biol. 2003;1:E19. doi: 10.1371/journal.pbio.0000019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Woese CR, Olsen GJ, Ibba M, Soll D. Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol Mol Biol Rev. 2000;64:202–236. doi: 10.1128/mmbr.64.1.202-236.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Fitz-Gibbon ST, House CH. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 1999;27:4218–4222. doi: 10.1093/nar/27.21.4218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hanage WP, Fraser C, Spratt BG. Sequences, sequence clusters and bacterial species. Philos Trans R Soc Lond B Biol Sci. 2006;361:1917–1927. doi: 10.1098/rstb.2006.1917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Eisen JA, Fraser CM. Phylogenomics: intersection of evolution and genomics. Science. 2003;300:1706–1707. doi: 10.1126/science.1086292. [DOI] [PubMed] [Google Scholar]
  • 44.Salzberg SL, White O, Peterson J, Eisen JA. Microbial genes in the human genome: lateral transfer or gene loss? Science. 2001;292:1903–1906. doi: 10.1126/science.1061036. [DOI] [PubMed] [Google Scholar]
  • 45.Galtier N. A model of horizontal gene transfer and the bacterial phylogeny problem. Syst Biol. 2007;56:633–642. doi: 10.1080/10635150701546231. [DOI] [PubMed] [Google Scholar]
  • 46.Galtier N, Daubin V. Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond B Biol Sci. 2008;363:4023–4029. doi: 10.1098/rstb.2008.0144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. doi: 10.1126/science.1123061. [DOI] [PubMed] [Google Scholar]
  • 48.Choi IG, Kim SH. Global extent of horizontal gene transfer. Proc Natl Acad Sci U S A. 2007;104:4489–4494. doi: 10.1073/pnas.0611557104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Koonin EV, Wolf YI, Puigbo P. The Phylogenetic Forest and the Quest for the Elusive Tree of Life. Cold Spring Harb Symp Quant Biol. 2009 doi: 10.1101/sqb.2009.74.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Dagan T, Martin W. Getting a better picture of microbial evolution en route to a network of genomes. Philos Trans R Soc Lond B Biol Sci. 2009;364:2187–2196. doi: 10.1098/rstb.2009.0040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Boucher Y, Douady CJ, Papke RT, Walsh DA, Boudreau ME, Nesbo CL, et al. Lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet. 2003;37:283–328. doi: 10.1146/annurev.genet.37.050503.084247. [DOI] [PubMed] [Google Scholar]
  • 52.Bucknam J, Boucher Y, Bapteste E. Refuting phylogenetic relationships. Biol Direct. 2006;1:26. doi: 10.1186/1745-6150-1-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Schliep K, Lopez P, Lapointe FJ, Bapteste E. Harvesting evolutionary signals in a forest of prokaryotic gene trees. Mol Biol Evol. 2011;28:1393–1405. doi: 10.1093/molbev/msq323. [DOI] [PubMed] [Google Scholar]
  • 54.Beiko RG, Doolittle WF, Charlebois RL. The impact of reticulate evolution on genome phylogeny. Syst Biol. 2008;57:844–856. doi: 10.1080/10635150802559265. [DOI] [PubMed] [Google Scholar]
  • 55.Doolittle WF, Zhaxybayeva O. On the origin of prokaryotic species. Genome Res. 2009;19:744–756. doi: 10.1101/gr.086645.108. [DOI] [PubMed] [Google Scholar]
  • 56.Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005;3:679–687. doi: 10.1038/nrmicro1204. [DOI] [PubMed] [Google Scholar]
  • 57.Gogarten JP, Doolittle WF, Lawrence JG. Prokaryotic evolution in light of gene transfer. Mol Biol Evol. 2002;19:2226–2238. doi: 10.1093/oxfordjournals.molbev.a004046. [DOI] [PubMed] [Google Scholar]
  • 58.Puigbo P, Wolf YI, Koonin EV. The tree and net components of prokaryote evolution. Genome Biol Evol. 2010;2:745–756. doi: 10.1093/gbe/evq062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, et al. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36:D250–254. doi: 10.1093/nar/gkm796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–552. doi: 10.1093/oxfordjournals.molbev.a026334. [DOI] [PubMed] [Google Scholar]
  • 63.Keane TM, Naughton TJ, McInerney JO. MultiPhyl: a high-throughput phylogenomics webserver using distributed computing. Nucleic Acids Res. 2007;35:W33–37. doi: 10.1093/nar/gkm359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Creevey CJ, McInerney JO. Clann: investigating phylogenetic information through supertree analyses. Bioinformatics. 2005;21:390–392. doi: 10.1093/bioinformatics/bti020. [DOI] [PubMed] [Google Scholar]
  • 65.Felsenstein J. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 1996;266:418–427. doi: 10.1016/s0076-6879(96)66026-1. [DOI] [PubMed] [Google Scholar]
  • 66.Torgerson WS. Theory and Methods of Scaling. Wiley; New York: 1958. [Google Scholar]
  • 67.Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–328. [Google Scholar]
  • 68.Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2001;63:411–423. [Google Scholar]
  • 69.Hillis DM, Heath TA, St John K. Analysis and visualization of tree space. Syst Biol. 2005;54:471–482. doi: 10.1080/10635150590946961. [DOI] [PubMed] [Google Scholar]
  • 70.Pavlidis P, Noble WS. Matrix2png: a utility for visualizing matrix data. Bioinformatics. 2003;19:295–296. doi: 10.1093/bioinformatics/19.2.295. [DOI] [PubMed] [Google Scholar]
  • 71.Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36:6688–6719. doi: 10.1093/nar/gkn668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Ge F, Wang LS, Kim J. The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol. 2005;3:e316. doi: 10.1371/journal.pbio.0030316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Brochier C, Bapteste E, Moreira D, Philippe H. Eubacterial phylogeny based on translational apparatus proteins. Trends Genet. 2002;18:1–5. doi: 10.1016/s0168-9525(01)02522-7. [DOI] [PubMed] [Google Scholar]
  • 74.Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. Trends Genet. 2002;18:472–479. doi: 10.1016/s0168-9525(02)02744-0. [DOI] [PubMed] [Google Scholar]
  • 75.Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology. 2001;1 doi: 10.1186/1471-2148-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Creevey CJ, Fitzpatrick DA, Philip GK, Kinsella RJ, O’Connell MJ, Pentony MM, et al. Does a tree-like phylogeny only exist at the tips in the prokaryotes? Proc Biol Sci. 2004;271:2551–2558. doi: 10.1098/rspb.2004.2864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Brochier-Armanet C, Boussau B, Gribaldo S, Forterre P. Mesophilic Crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. Nat Rev Microbiol. 2008;6:245–252. doi: 10.1038/nrmicro1852. [DOI] [PubMed] [Google Scholar]
  • 78.Elkins JG, Podar M, Graham DE, Makarova KS, Wolf Y, Randau L, et al. A korarchaeal genome reveals new insights into the evolution of the Archaea. Proc Natl Acad Sci USA. 2008 doi: 10.1073/pnas.0801980105. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Wolf YI, Aravind L, Grishin NV, Koonin EV. Evolution of aminoacyl-tRNA synthetases--analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res. 1999;9:689–710. [PubMed] [Google Scholar]
  • 80.Koonin EV. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Rev Microbiol. 2003;1:127–136. doi: 10.1038/nrmicro751. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1
Additional file 2
Figures S1-S5

Supplementary figure 1 – A schematic of the methods and concepts involved in the FOL analysis

Supplementary figure 2 – The main algorithm of the BSD method

The algorithm to calculate the BSD between two trees includes four basic steps: 1) split both trees in all possible partitions, 2) read the common set of species of both trees, 3) prune the splits according with the common leaf-set and 4) calculate the BSD.

Supplementary figure 3 – Correlation of BSD and SD from the all-against-all tree comparisons of 6901 phylogenetic trees.

a) Trees containing 4 to 15 species.

b) Trees containing 16 to 100 species.

c) SD, BSD and tree size for trees containing between 16 and 100 species.

Supplementary figure 4 – Quartets and quartet topologies

a) Each quartet (qi) is defined by a set of four species (different colors denote species) and may be represented by three possible unrooted tree topologies (qiti).

b) Quartet topologies (QT). In 100 species, the total number of quartets (Q) is 3,921,225. Each quartet may be represented by 3 distinct QTs, resulting in a total of 11,763,735 QTs. Each QT was mapped onto the FOL, i.e. for each QT, it was determined which of the three topologies is represented in each phylogenetic tree in the FOL (8.12×1010 comparisons). (Modified from Puigbò et al. 2010.)

Supplementary figure 5 – The Tree/NetworkTrend (TNT)

The figure shows a schematic of the TNT calculation and the re-scaling procedure. (Modified from Puigbò et al. 2010.)

RESOURCES