Skip to main content
Genome Biology and Evolution logoLink to Genome Biology and Evolution
. 2023 May 29;15(6):evad096. doi: 10.1093/gbe/evad096

Phylogenomic Testing of Root Hypotheses

Fernando D K Tria 1,b,#, Giddy Landan 2,#, Devani Romero Picazo 3, Tal Dagan 4,
Editor: Davide Pisani
PMCID: PMC10262964  PMID: 37247390

Abstract

The determination of the last common ancestor (LCA) of a group of species plays a vital role in evolutionary theory. Traditionally, an LCA is inferred by the rooting of a fully resolved species tree. From a theoretical perspective, however, inference of the LCA amounts to the reconstruction of just one branch—the root branch—of the true species tree and should therefore be a much easier task than the full resolution of the species tree. Discarding the reliance on a hypothesized species tree and its rooting leads us to reevaluate what phylogenetic signal is directly relevant to LCA inference and to recast the task as that of sampling the total evidence from all gene families at the genomic scope. Here, we reformulate LCA and root inference in the framework of statistical hypothesis testing and outline an analytical procedure to formally test competing a priori LCA hypotheses and to infer confidence sets for the earliest speciation events in the history of a group of species. Applying our methods to two demonstrative data sets, we show that our inference of the opisthokonta LCA is well in agreement with the common knowledge. Inference of the proteobacteria LCA shows that it is most closely related to modern Epsilonproteobacteria, raising the possibility that it may have been characterized by a chemolithoautotrophic and anaerobic life style. Our inference is based on data comprising between 43% (opisthokonta) and 86% (proteobacteria) of all gene families. Approaching LCA inference within a statistical framework renders the phylogenomic inference powerful and robust.

Keywords: phylogenetics, species tree, rooting, last common ancestor (LCA), proteobacteria


Significance.

Inference of the last common ancestor (LCA) for a group of taxa is central in the study of species evolution. We present a novel approach for the inference of the LCA without reconstructing a species tree and demonstrate its applicability using opisthokonta and proteobacteria data sets. Approaching LCA inference within a statistical framework renders the phylogenomic inference powerful and robust.

Introduction

Inference of the last common ancestor (LCA) for a group of taxa is central for the study of evolution of genes, genomes, and organisms. Prior to the assignment of an LCA, represented by the root node, phylogenetic trees are devoid of a time direction, and the temporal order of divergences is undetermined. Discoveries and insights from the reconstruction of LCAs span a wide range of taxonomic groups and time scales. For example, studies of the last universal common ancestor (LUCA) of all living organisms inferred that it was an anaerobic prokaryote whose energy metabolism was characterized by CO2 fixing, H2-dependent with a Wood–Ljungdahl pathway, N2 fixing, and thermophile (Weiss et al. 2016). Nonetheless, due to the inherent difficulty of ancient LCA inference, it is frequently the center of evolutionary controversies, such as, for instance, the debate concerning the two versus three domains of life (Williams et al. 2020), the cyanobacterial root position (Hammerschmidt et al. 2021), the LCA of vertebrates (Okamoto et al. 2017), or the LCA of hominids (Lovejoy et al. 2009).

The identity of the LCA is traditionally inferred from a species tree that is reconstructed unrooted and is then rooted at the final step. The root may be inferred by several methods, including out-group rooting (Kluge and Farris 1969), midpoint rooting (Farris 1972), minimal ancestor deviation (MAD) (Tria et al. 2017), minimal variance rooting (Mai et al. 2017), relaxed clock models (Lepage et al. 2007), tree reconciliation (Szöllősi et al. 2012), and nonreversible substitution models for nucleotide data (Huelsenbeck et al. 2002; Williams et al. 2015; Cherlin et al. 2018; Bettisworth and Stamatakis 2021) or amino acids data (Naser-Khdour et al. 2022). A correct LCA inference in methods that rely on the availability of the species tree is thus dependent on the accuracy of the species tree topology. One approach for the reconstruction of a species tree is to use a single gene as a proxy for the species tree topology, for example, 16S ribosomal RNA subunit for prokaryotes (Fox et al. 1980) or the Cytochrome C for eukaryotes (Fitch and Margoliash 1967). This approach is, however, limited in its utility due to possible differences between the gene evolutionary history and the species phylogeny. Similarly, methods relying on ancient paralogs for root inferences (e.g., Gogarten et al. 1989; Iwabe et al. 1989) rely on a small number of genes and assume an identity of the gene and species evolutionary history.

Phylogenomics offer an alternative to the single-gene approach by aiming to utilize the whole genome rather than a single gene for the phylogenetic reconstruction (Eisen 2003). In the most basic approach, the species tree is reconstructed from the genes that are shared among all the species under study, termed here as complete gene families. Approaches for the reconstruction of a species tree from multiple complete single-copy (CSC) genes include tree reconstruction from concatenated alignments (e.g., Ciccarelli et al. 2006; Hug et al. 2016; Parks et al. 2018) as well as calculation of consensus trees (e.g., Dagan et al. 2013). However, these approaches are often restricted in their data sample as they exclude partial gene families not present in all members of the species set (e.g., due to differential loss) and multicopy gene families present in multiple copies in one or more species (e.g., as a result of gene duplications or gene acquisition). Thus, the drawback in alignment concatenation or consensus tree approaches is that the inference becomes limited to gene sets that do not represent the entirety of genomes. This issue tends to become more acute the more diverse the species set is or when it includes species with reduced genomes. In extreme cases, no single-copy and complete gene family exists (Medini et al. 2005). Super-tree approaches offer an alternative as they enable to include also partial gene families (Pisani et al. 2007; Whidden et al. 2014; Williams et al. 2017; Zhu et al. 2019); however, those approaches also exclude multicopy gene families (partial and complete). These requirements produce the phenomenon of “trees of 1%” of the genes (Dagan and Martin 2006; and see our table 1). Thus, although the major aim of phylogenomic approaches is to improve the accuracy of phylogenetic inference by increasing the sample size, methodologies based on single-copy genes suffer from several inference problems, with some elements that are common to all of them. The first is the limited sample sizes due to the number of single copy genes. In super-tree approaches based on concatenation and consensus, there is no room for the inclusion of multicopy gene families. Furthermore, reduction of families including paralogs into ortholog-only subsets, for example, using tree reconciliation (reviewed in Szöllősi et al. 2015; Smith and Hahn 2021), requires an a priori assumed species tree topology. Finally, because the aforementioned approaches yield unrooted species trees, the inference of the root is performed as the last step in the analysis; hence, the sample size of trees used for the LCA inference is essentially one (i.e., a single tree).

Table 1.

Illustrative Data Sets Used in This Study

Opisthokonta Proteobacteria
A. Number of species 31 72
B. Total number of gene families 13,036 9,686
CSC: Complete single-copy gene families, present as single-copy in all members of a species set. 170
(1.30%)
50
(0.52%)
CMC: Complete multicopy gene families, present in all species, but having multiple copies in at least one species. 612
(4.69%)
70
(0.72%)
PSC: Partial single-copy gene families,
absent from some species and present as single-copy in the others.
7,773
(59.63%)
5,586
(57.67%)
PMC: Partial multicopy gene families,
absent from some species and having multiple copies in at least one other species.
4,481
(34.37%)
3,980
(41.09%)
C. Consensus MAD root in CSC gene trees 77.70%
(132 out of 170 trees)
{fungi,
metazoa}
33.17%
(16.58* out of 50 trees)
{ɛ-proteobacteria,
other proteobacteria}

Note.—(A) Number of species in each data set (for the complete list of species, see supplementary table S1, Supplementary Material online). (B) Classification of gene families in the data sets according to their presence and absence pattern in the species genomes. The proportion of gene families in each category from the total gene families is shown in parenthesis. (C) Number of CSC gene trees where the inferred root matches the consensus rooting using minimal ancestor deviation (MAD). *Note that the number of supporting trees may include fractions since trees with competing roots contribute the fraction of supporting root splits to the consensus MAD root.

The inference of the LCA from a single species tree can be robust and accurate only if the underlying species tree is reliable. Unfortunately, this is rarely the case, as can be frequently seen in the plurality of gene tree topologies and their disagreement with species trees, especially for prokaryotes due to frequent gene transfers (e.g., Doolittle and Bapteste 2007; Linz et al. 2007; and see our fig. 5a). Indeed, recent implementations of tree reconciliation approaches aim to accommodate the presence of heterogeneous topologies due to gene duplication or gain by inferring the effect of such events on the tree topology (e.g., Coleman et al. 2021; Morel et al. 2022). However, such applications required an a priori assumption of the relative frequency of gene duplication and gene transfer that is bound to have a significant effect on the resulting tree topology (including the root position) (Bremer et al. 2022). We propose that the identification of an LCA for a group of species does not require the reconstruction of a fully resolved species tree. Instead, the LCA can be defined as the first speciation event, that is, tree node, for the group of species. In this formulation, the topological resolution of the entire species tree is immaterial, and the only phylogenetic conclusion needed is the partitioning of the species into two disjoint monophyletic groups or the species root partition.

Fig. 5.


Fig. 5.

—LCA inference by sequential elimination in the proteobacteria data set. (a) Phylogenetic split network of the 50 CSC gene trees; (b) trace of the sequential elimination process (see supplementary table S1b, Supplementary Material online for candidate partition definitions). Selected partitions are indicated by gray arcs in a and bold numbers in b.

Here, we present a novel approach for the LCA inference without reconstructing a species tree. Our approach considers the total evidence from unrooted gene trees for all protein families from a set of taxa, including partial families as well as multiplecopy gene families. The approach utilizes the measure of ancestor deviation (AD) that is the basis for the MAD rooting method (Tria et al. 2017). The MAD rooting method assumes a clock-like evolutionary rate of protein families, which has been shown to be a reasonable assumption, at least for prokaryotic protein families where ∼70% families do not deviate significantly from clock-like evolutionary rate (Novichkov et al. 2004; Dagan et al. 2010). The AD measure is calculated for each branch in a given phylogenetic tree as the mean relative deviation from the molecular clock expectation, when the root is positioned on that branch. The branch that minimizes the relative deviations from the molecular clock assumption (i.e., is assigned the minimal AD) is the best candidate to contain the root node. The AD calculation can be performed for any given tree topology, regardless of the tree reconstruction approach. The comparison of branch ADs calculated for gene trees of all protein families in a group of species enables us to formulate and test hypotheses regarding the LCA of the species tree.

Results

For the presentation of the statistical framework for testing root hypotheses, we used illustrative rooting problems for two species sets: opisthokonta and proteobacteria (table 1A and B). The root partition is well established for the opisthokonta species sets, and it serves here as a positive control. The root of the proteobacteria species set is still debated; hence, this data set serves to demonstrate the power of the proposed approach. The opisthokonta data set comprises 14 metazoa and 17 fungi species, and the known root is a partition separating fungi from metazoa species (Stechmann and Cavalier-Smith 2002; Katz 2012). The proteobacteria data set includes species from five taxonomic classes in that phylum (Ciccarelli et al. 2006; Pisani et al. 2007; Lang et al. 2013; but see Waite et al. 2017). The Proteobacteria data set poses a harder root inference challenge than opisthokonta as previous results suggest that three different branches are comparably likely to harbor the root node, a situation best described as a root neighborhood (Tria et al. 2017).

Phylogenomic Rooting as Hypothesis Testing

Our LCA inference approach differs from existing ones in several aspects: 1) No species tree is reconstructed or assumed. 2) Phylogenetic information is extracted from gene trees reconstructed from partial and multicopy gene families in addition to CSC gene families. 3) The analysis uses unrooted gene trees, and no rooting operations are performed, of either gene trees or species trees. 4) Any LCA hypothesis can be tested, including species partitions that do not occur in any of the trees. LCA inference deals with abstractions of similar but distinct types of phylogenetic roots: species tree roots and gene tree roots. We reserve the terms root branch and root split to refer to gene tree roots, while species root partition refers to species trees.

Before describing our approach, we first demonstrate the limitations of a simpler phylogenomic rooting procedure that uses only CSC gene families and infers the root by a consensus derived from the rooted trees of the CSC genes. In this procedure, only the root branch in each gene tree is considered for the root inference. We then show how to consider all branches from each unrooted gene tree, not only the root branch of the rooted trees. The incorporation of all branches, not considered by a simple consensus of rooted trees, leads to a statistical test to decide between two competing root hypotheses.

Next, we show how information from partial and multicopy gene families can be used within the same statistical framework, greatly increasing the sample size and inference power. We then extend the pairwise formulation and consider multiple competing root partitions by testing all partition pairs, one pair at a time.

Finally, we modify the pairwise test to a comparison of one root partition against all alternatives simultaneously (a one-to-many test) and present a sequential elimination process that infers a minimal root neighborhood, that is, a confidence set of LCA partitions.

Phylogenomic Consensus Rooting

The consensus approach infers the root partition of a species set from a sample of rooted CSC gene trees. Root branches are collected from all trees, and the operational taxonomic units (OTUs) split induced by the most frequent root branch in the sample is the inferred species root partition for the species set. In species sets with a strong root signal, this majority-rule approach is sufficient to determine a clear root partition for the species set. This circumstance is observed in the opisthokonta illustrative data set. Using MAD (Tria et al. 2017) to root the individual gene trees, the consensus species root partition was inferred as the root branch found in >70% of the CSC gene trees (see table 1). In the proteobacteria, in contrast, the most frequent root branch was inferred in just 33% of the CSC gene trees. The performance of the consensus approach is thus hindered by three factors. First, majority-rule voting considers just one branch from each gene tree, ignoring a large measure of the phylogenetic signal present in the gene trees. In addition, the quality of the root inference varies among the gene trees and is quantifiable, but this information is not utilized by the consensus approach. Lastly, simple voting cannot be satisfactorily tested for statistical significance.

The Root Support Test for Two Alternative Root Partitions

The first step in our approach is a formulation of a test to select between two competing species root partitions (see fig. 1 for a road-map of the procedure). To that end, we do not infer a single root for each gene tree, but consider every branch of a gene tree as a possible root branch, and assign it a score that quantifies the relative quality of different root positions. In the current study, we use the AD measure to assess the relative strength of alternative roots of the same gene tree. The AD statistic quantifies the amount of lineage rate heterogeneity that is induced by postulating a branch as harboring the root of the tree. We have previously shown that AD measures provide robust evidence for the inference of the root of a single gene tree (Tria et al. 2017).

Fig. 1.


Fig. 1.

—Outline of the analytical procedure. Stages are depicted clockwise from top-left. The input for the analysis is (1) gene trees of all protein families for a group of species, including the information of AD per branch as calculated by MAD. Protein families are classified into complete and partial, single-copy, or multicopy families according to the gene copy number per species. (2) Branch ADs in the gene trees supply evidence for hypothetical root partitions in the species tree; these are collected in the (3) root support matrix. The information in the root support matrix is used to identify candidates for the species root partition (including the consensus root partition, if exists). (4) The comparison of root candidates is done by comparing the distribution of their ADs in all gene trees in a pairwise test. (5) If several root partitions are similarly supported by ADs, these can be analyzed in the context of a root neighborhood, where weakly supported partitions are sequentially eliminated from the root partitions set. (6) The remaining root partitions comprise the species LCA confidence set.

Collecting AD values from a set of gene trees, we obtain a paired sample of support values. In figure 2, we present the joint distribution of AD values for the two most likely root partitions in the eukaryotic data set. A null hypothesis of equal support can be tested by the Wilcoxon signed-rank test, and rejection of the null hypothesis indicates that the root partition with smaller AD values is significantly better supported than the competitor (fig. 2b).

Fig. 2.


Fig. 2.

—Pairwise testing of competing root hypotheses in the opisthokonta data set. (a) The two most frequent root branches among the CSC gene trees (supplementary table S1a, Supplementary Material online). (b) CSC gene families, and (c) all gene families. Colormaps are the joint distribution of paired AD values. Smaller ADs indicate better support, whereby candidate 1 outcompete candidate 2 above the diagonal and candidate 2 wins below the diagonal. P values are for the two-sided Wilcoxon signed-rank test used to compare paired branch AD values. Note the gain in power concomitant to larger sample size.

As in all statistical inferences, the power of the test ultimately depends on the sample size. Considering only CSC gene families often limits rooting analyses to a small minority of the available sequence data (e.g., table 1). Paired AD support values, however, can be extracted also from partial and multicopy gene families, resulting in much larger sample size and statistical power (fig. 2c).

Rooting Support from Partial and Multicopy Gene Trees

To deal with non-CSC gene trees, we must decouple the notion of OTU split (i.e., tree branch) from the notion of species partition. In CSC gene trees, the correspondence between tree branches and root partitions is direct and one to one (fig. 3a). In trees of partial gene families, a single OTU split may correspond to several species partitions. In multiplecopy gene families, some tree branches do not correspond to any possible species partition.

Fig. 3.


Fig. 3.

—Correspondence of OTU splits and tested root partitions. (a) In CSC gene trees; (b) in PSC gene trees; and (c) in CMC gene trees. PMC gene trees entail both the b and c operations. OTU splits refer to branches in gene trees and are represented as black circles and white squares. Species partitions refer to possible branches in the hypothetical species tree, with unknown topology, and are represented as gray shades. In CSC gene trees, all branches (including internal and external) can be mapped to species partitions in a one-to-one manner (green arrows in a; note that only several splits are illustrated). For mapping branches from PSC gene trees (b) to species partitions, we remove from the species partitions the species missing in the gene tree and term the reduced version of the species partitions as OTU partitions. In CMC gene trees (c), only branches that form species splits can be mapped onto species partitions. A species splits in a CMC gene tree is a branch for which all gene copies of any one species are present on the same side of the split.

In order to find the branches in a partial gene tree that correspond to the tested root partitions, we reduce root partitions from species to OTUs by removing the species that are missing in the gene tree (Semple and Steel 2001). The root partitions are then assigned AD support by matching their reduced OTU version to the OTU splits of the gene tree (fig. 3b).

In multicopy gene trees, one or more species are represented multiple times as an OTU (Swenson and El-Mabrouk 2012). Each branch of a multicopy gene tree splits the OTUs into two groups, and the two groups may be mutually exclusive or overlapping in terms of species. We refer to mutually exclusive splits in multicopy gene trees as species splits which can be mapped to specific root partitions. Overlapping splits, on the other hand, cannot correspond to any root partition (fig. 3c). Mapping of tree splits from partial multicopy (PMC) gene trees entail both operations: identification of species splits and reduction of root partitions. We note that the ability to gain information on root partitions from multicopy gene trees depends on the quality of gene family clustering with regards to the accuracy of orthology assessment. The presence of paralogs (especially ancient paralogs) in the set of multicopy gene trees will lead to a low proportion of splits that can be mapped to root partitions.

Candidate root partitions, or their reduced versions, may be absent from some gene trees and will be missing support values from these trees. We distinguish between two such cases: informative and uninformative missing values. A gene family is uninformative relative to a specific species root partition when its species composition includes species from only one side of the species partition. In such cases, the candidate root partition cannot be observable in any tree reconstructed from the gene family. We label the gene trees of such families as uninformative relative to the specific candidate root partition and exclude them from tests involving that partition (note that such a gene tree may be still used to test other candidate root partitions). In contrast, when a gene family includes species from both sides of a candidate species root partition but the gene tree lacks a corresponding branch, we label the gene tree as informative relative to the partition. This constitutes evidence against the candidate partition and should not be ignored in the ensuing tests. In such cases, we replace the missing support values by a pseudocount consisting of the maximal (i.e., worst) AD value in the gene tree. This assignment of a default worst-case support value also serves to enable the pairwise testing of incompatible root partitions, where no gene tree can include both partitions (Semple and Steel 2001).

Complete gene families are always informative relative to any candidate root partitions. Partial gene families, however, may be uninformative for some root candidates. When testing two candidate root hypotheses against each other, the exclusion of uninformative partial gene trees thus leads to a reduction of sample size from the full complement of gene families. Furthermore, one branch of a partial gene tree may be identical to the reduced versions of two or more species root partitions, whereby the tree is informative relative to the several candidates yet their support values are tied.

Root Inference and Root Neighborhoods

The pairwise test is useful when the two competing root hypotheses are given a priori, as often happens in specific evolutionary controversies. More generally, however, one wishes to infer the species LCA, or root partition, with no prior hypotheses. In principle, the pairwise test may be carried out over all pairs of possible root partitions, while controlling for multiple testing. Such an exhaustive approach is practically limited to very small rooting problems, as the number of possible partitions grows exponentially with the number of species. A possible simplification is to restrict the analysis to test only pairs of root partitions from a pool of likely candidates. We propose that a reasonable pool of candidate root partitions can be constructed by collecting the set of root splits that are inferred as the root in any of the CSC gene trees.

When one species root partition is significantly better supported than any of the other candidates, the root is fully determined. Such is the result for the opisthokonta data set, for which the known root partition is the best candidate among all pairwise comparisons (supplementary table S2a, Supplementary Material online). In more difficult situations, the interpretation of all pairwise P values is not straightforward due to the absence of a unanimous best candidate root partition. This situation is exemplified with the CSC subset of the proteobacteria data set where no candidate has better support than all the alternative candidates (fig. 4 and supplementary table S2b, Supplementary Material online). The absence of a clear best candidate suggests the existence of a root neighborhood in the species set. Thus, a rigorous procedure for the inference of a confidence set for LCA is required.

Fig. 4.


Fig. 4.

—Cumulative distribution plots of AD in the proteobacteria data set. Left: cumulative distribution of unpaired AD values for the 25 candidate root partitions. Right: cumulative distribution of paired differences to candidate 1 (i.e., the most frequent candidate), whereby positive differences indicate better support for candidate 1 and negative values better support for candidate i (see supplementary table S1b, Supplementary Material online for candidate partition definitions). The results from gene family (i.e., tree) classes are stacked vertically. P values of the least significant among the contrasts to candidate 1 are shown in red (FDR adjusted for all 300 pairwise comparisons; details in supplementary table S2b, Supplementary Material online).

One-to-Many Root Support Test

To assess the support for root partitions in the full context of all other candidate root partitions, we modify the pairwise test to a test contrasting one root partition to a set of many alternatives. The one-to-many test consists of comparing the distribution of root support values for one focal partition to the extreme support values among all the other candidates and is inherently asymmetric. A “better than best” version takes the minimal (i.e., best) value among the AD values of the alternatives, while the “worse than worst” version considers the maximal (i.e., worst) among the alternatives’ ADs. As expected, the “better than best” variant is always less powerful than any of the pairwise tests and will not be considered further. The “worse than worst” variant, on the other hand, can be used to trim down a set of candidates while being more conservative than the pairwise tests. In the one-to-many test, each gene tree provides one AD value for the focal partition and one AD value for the worst among the alternative root partitions. The worst AD values are assigned while considering only partitions with nonmissing values (see above). Note that the worst alternative root partition may vary across gene trees. To maximize the sample size, that is, number of trees used for the statistical tests, gene trees including informative missing values for all root candidate partitions are included in the analysis with the largest (i.e., worst) AD found in the entire tree. We test for differences in the magnitude of paired AD values using the one-sided Wilcoxon signed-rank test, with the null hypothesis that the focal ADs are equal or smaller than the maximal ADs for the complementary set and the alternative hypothesis that the focal ADs are larger still than the maximum. A rejection of the null hypothesis is interpreted to mean that the focal root partition is significantly worse supported than the complementary set of candidates taken as a whole.

Inference of a Minimal Root Neighborhood

To infer a root neighborhood, that is, a confidence set of LCA hypotheses, we start with a reasonably constructed large set of n candidate partitions and reduce it by a stepwise elimination procedure. At each step, we employ the one-to-many test to contrast each of the remaining candidates to its complementary set. We control for false discovery rate (FDR; Benjamini and Hochberg 1995) due to multiple testing, and if at least one test is significant at the specified FDR level, the focal partition with the smallest P value (i.e., largest z-statistic) is removed from the set of candidates. The iterative process is stopped when none of the retained candidates is significantly less supported than the worst support for the other members of the set or when the set is reduced to a single root partition. To be conservative, we use a cumulative FDR procedure where at the first step, we control for n tests, in the next round for 2n − 1 tests, and, when not stopped earlier, for n*(n − 1)/2 − 1 at the last iteration.

We demonstrate the sequential elimination procedure for the proteobacteria data set in figure 5. The splits network reconstructed for the proteobacteria data set exemplifies the plurality of incongruent splits in the CSC gene trees and hence the dangers in assuming a single species tree. In this data set, the initial candidate set consisted of the 25 different root partitions found in the 50 CSC gene trees, and the elimination process terminated with a neighborhood of size 1, a species root partition separating the Epsilonproteobacteria species from the other proteobacteria classes. This LCA is indeed the most frequent one among the CSC gene trees but with a low frequency of only one in three gene trees. It is noteworthy that the order of elimination does not generally follow the frequency of partitions in the CSC set. For example, the last alternative to be rejected (number 19) was inferred as a root branch in just one tree where it is tied with two other branches, whereas the second and third most frequent CSC roots are rejected already at iterations 19 and 20.

The elimination order is determined by the P value of the one-to-many test, which in turn reflects both the effect size of worse support and the power of the test, where the latter is a function of sample size. Hence, candidate partitions for which a smaller number of gene trees are informative are more difficult to reject. In particular, the testing of an LCA hypothesis of a single basal species partitioned from the other species is limited to those gene families that include the basal species. The last two partitions rejected in figure 5 are indeed single species partitions, and the number of gene trees that are informative relative both to them and to the remaining candidates drops drastically in comparison with the earlier iterations. Yet, even at the last iteration, the number of gene trees that bear upon the conclusion is an order of magnitude larger than the number of CSC gene families.

The full complement of the proteobacteria data set consists of 9,686 gene families. The final conclusion—determination of a single LCA partition—is arrived at by extracting ancestor–descendant information from 86% of the gene families. The gene families that do not provide any evidence consist of 1,113 partial single-copy (PSC) gene families, mainly very small ones (e.g., due to recent gene origin), and 214 PMC families, mostly small families and some with abundant paralogs (e.g., due to gene duplication prior to the LCA).

In the current analysis, we used the AD measure to provide the strength of the root signal in gene tree branches. We note that the statistical frameworks can accommodate other measures that quantify the root signal for all gene tree branches. A fundamental element in our approach is the prior definition of a pool of candidate root partitions. We advocate deriving the initial set from roots inferred for CSC gene trees. A yet larger but manageable initial set may be constructed of splits frequently observed in the CSC gene trees. Importantly, the initial set need not be limited to observed partitions but can be augmented by a priori hypotheses informed by current phylogenetic and taxonomical precepts.

The inferred species root partition for the proteobacteria data set indicates that the proteobacteria LCA was more closely related to modern Epsilonproteobacteria in comparison with the other classes; characteristics of present-day species in that group can therefore be used to hypothesize about the biology of the proteobacteria LCA. Epsilonproteobacteria species show versatile biochemical strategies to fix carbon, enabling members of this class to colonize extreme environments such as deep-sea hydrothermal vents (Campbell et al. 2006). Epsilonproteobacteria residing in deep-sea habitats are generally anaerobes, and many are characterized as chemolithoautotrophs (Takai et al. 2005). The possibility that the ancient proteobacteria LCA had a chemolithoautotrophic and anaerobic life style for ancient lineages is in line with the scenario of life's early phase as predicted by the hydrothermal-vent theory for the origin of life (Martin et al. 2008).

Discussion

From a purely theoretical perspective, the inference of the LCA for a group of species amounts to the reconstruction of just one branch—the root branch—of the true unrooted species tree and should therefore be a much easier task than the full resolution of the rooted species tree. Approaches that pose the LCA problem in terms of rooting of a resolved species tree require the solution of a much harder problem as a prerequisite for addressing the easier task. Methods where the input information passes through a bottleneck of a single inferred species tree have the disadvantage that the actual inference of the LCA is based on a sample of size one.

An alternative approach for rooting species trees has emerged from the use of “gene-tree–species-tree reconciliation” models (or simply tree reconciliations) (Szöllősi et al. 2012; Williams et al. 2017; Coleman et al. 2021). These models operate on topological differences between gene trees and a species tree, and the analyses attempt to bring the tree differences into agreement by invoking gene transfers, gene duplications, and gene losses, in any combination as needed. To discern among alternative reconciliation scenarios, the evolutionary rates for gene transfers, gene duplications, and gene losses need to be estimated from the data. Rate estimation is, however, challenging, and a previous study showed that evolutionary rates automatically estimated by a popular tree reconciliation model are often unrealistic and that the use of incorrect rates leads to incorrect root inferences (Bremer et al. 2022). Indeed, evolutionary rates as estimated by tree reconciliation analyses often contradict more conservative estimates obtained by independent studies (Treangen and Rocha 2011; Tria and Martin 2021), which may indicate possible biases within reconciliation models. By contrast, the rooting approach presented here does not rely on a priori estimates of rates of gene duplication and transfer, and as such, it offers a simpler solution to the species root problem. Furthermore, and contrary to tree reconciliations, our approach can perform rooting when the species tree is altogether unknown (albeit a candidate root partition is still necessary), which makes our tests especially appealing to rooting prokaryotic phylogenies for which species tree inferences are typically challenging.

Avoiding the reliance on a species tree prompt us to reevaluate what phylogenetic signal is directly relevant to LCA inference and to recast the task as that of sampling the total evidence from all gene families at the genomic scope. Moreover, dispensing with a single rooting operation of a single species tree facilitates the reformulation of LCA and root inference in the framework of statistical hypothesis testing. The analytical procedure we outline allows formally to test competing a priori LCA hypotheses and to infer confidence sets for the earliest speciation events in the history of a group of species. Indeed, our approach relies on the AD measure that may be sensitive to large deviations from clock-like evolutionary rates. Nonetheless, previous studies suggested that the majority of protein families in prokaryotic genomes are characterized by clock-like evolutionary rates (Novichkov et al. 2004; Dagan et al. 2010). Additionally, we propose that the use of large gene tree samples assists in overcoming bias in the root inference that are due to conflicting signals in the data. Possible biases in the inference of species root partition may arise due to methodological artifacts (e.g., tree reconstruction errors) and confounding evolutionary processes such as lateral gene transfer, gene duplication, and gene loss. Indeed, our root inferences were consistent across samples of different gene tree categories (CSC, complete multicopy [CMC], PSC, and PMC), each category bearing different types, and degrees of conflicting signals. For instance, gene duplications are more frequent in multicopy gene trees (CMC and PMC), whereas gene losses are likely more frequent in partial trees (PSC and PMC). Phylogenomic inferences utilizing all gene trees for a species set are expected to increase the robustness of the root inference.

Our analyses of the demonstrative data sets show that different species sets present varying levels of LCA signal: The opisthokonta data set shows a strong root signal; the proteobacteria data set has a moderate LCA signal. Data sets with weak signal are better described in terms of a confidence sets for root partitions, reflecting the inherent uncertainties and avoiding the pitfalls in forcing a single-hypothesis result.

The LCA inferences presented here utilized 43–86% of the total number of gene families for root partition inferences. This is in stark contrast to the 0.5–1.3% of the gene families that are CSC and can be utilized by traditional approaches. The number of genes families considered in our tests corresponds to the number of genes encoded in modern genomes, supplying “total evidence” for LCA inferences.

Data and Methods

Protein families for the opisthokonta and proteobacteria data sets were extracted from EggNOG version 4.5 (Huerta-Cepas et al. 2016). Protein families were filtered based on the number of species, gene copy number, number of OTUs, and sequence length, as follows. Protein families present in less than four species were discarded. Suspected outlier sequences were detected based on their length relative to the median length: Sequences were removed if shorter than half or longer than twice the median. Species with more than ten copies of a gene were removed from the corresponding gene family. Multicopy gene families were discarded if the number of species was smaller than half the total number of OTUs (table 1).

Protein sequences of the resulting protein families were aligned using MAFFT version 7.027b with L-INS-i alignment strategy (Katoh and Standley 2013). Phylogenetic trees were reconstructed using iqtree version 1.6.6 with the model selection parameters “-mset LG -madd LG4X” (Nguyen et al. 2015). The phylogenetic network (fig. 5) was reconstructed using SplitsTree4 version 4.14.6 (Huson and Bryant 2006). Branch ancestral deviation (AD) values and roots for the consensus analysis were inferred using mad.py version 2.21 (Tria et al. 2017).

Supplementary Material

evad096_Supplementary_Data

Acknowledgments

We thank Maxime Godfroid for fruitful discussions. The study was supported by CAPES (Coordination for the Improvement of Higher Education Personnel–Brazil) (awarded to F.D.K.T.) and the European Research Council (Grant No. 281357 awarded to T.D.). D.R.P. is supported by the DFG funded CRC1182 the origin and function of metaorganisms.

Contributor Information

Fernando D K Tria, Institute of General Microbiology, Kiel University, Kiel, Germany.

Giddy Landan, Institute of General Microbiology, Kiel University, Kiel, Germany.

Devani Romero Picazo, Institute of General Microbiology, Kiel University, Kiel, Germany.

Tal Dagan, Institute of General Microbiology, Kiel University, Kiel, Germany.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Data Availability

The source code to run the phylogenomic rooting as well as the unrooted trees with AD values used in this study are found in the following repository: https://github.com/deropi/PhyloRooting.git. Additionally, R code to replicate some of the figures in this paper is also provided.

Literature Cited

  1. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 57:289–300. [Google Scholar]
  2. Bettisworth B, Stamatakis A. 2021. Root digger: a root placement program for phylogenetic trees. BMC Bioinform 22:225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bremer N, Knopp M, Martin WF, Tria FDK. 2022. Realistic gene transfer to gene duplication ratios identify different roots in the bacterial phylogeny using a tree reconciliation method. Life 12:995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Campbell BJ, Engel AS, Porter ML, Takai K. 2006. The versatile ε-proteobacteria: key players in sulphidic habitats. Nat Rev Microbiol. 4:458–468. [DOI] [PubMed] [Google Scholar]
  5. Cherlin S, Heaps SE, Nye TMW, Boys RJ, Williams TA, Embley TM. 2018. The effect of nonreversibility on inferring rooted phylogenies. Mol Biol Evol. 35:984–1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. 2006. Toward automatic reconstruction of a highly resolved tree of life. Science 311:1283–1287. [DOI] [PubMed] [Google Scholar]
  7. Coleman GA, Davín AA, Mahendrarajah TA, Szánthó LL, Spang A, Hugenholtz P, Szöllősi GJ, Williams TA. 2021. A rooted phylogeny resolves early bacterial evolution. Science 372:eabe0511. [DOI] [PubMed] [Google Scholar]
  8. Dagan T, Martin W. 2006. The tree of one percent. Genome Biol. 7:118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dagan T, Roettger M, Bryant D, Martin W. 2010. Genome networks root the tree of life between prokaryotic domains. Genome Biol Evol. 2:379–392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dagan T, Roettger M, Stucken K, Landan G, Koch R, Major P, Gould SB, Goremykin VV, Rippka R, Tandeau de Marsac N, et al. 2013. Genomes of Stigonematalean cyanobacteria (subsection V) and the evolution of oxygenic photosynthesis from prokaryotes to plastids. Genome Biol Evol. 5:31–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Doolittle WF, Bapteste E. 2007. Pattern pluralism and the tree of life hypothesis. Proc Natl Acad Sci U S A. 104:2043–2049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Eisen JA. 2003. Phylogenomics: intersection of evolution and genomics. Science 300:1706–1707. [DOI] [PubMed] [Google Scholar]
  13. Farris JS. 1972. Estimating phylogenetic trees from distance matrices. Am Nat. 106:645–668. [Google Scholar]
  14. Fitch WM, Margoliash E. 1967. Construction of phylogenetic trees. Science 155:279–284. [DOI] [PubMed] [Google Scholar]
  15. Fox GE, Stackebrandt E, Hespell RB, Gibson J, Maniloff J, Dyer TA, Wolfe RS, Balch WE, Tanner RS, Magrum LJ, et al. 1980. The phylogeny of prokaryotes. Science 209:457–463. [DOI] [PubMed] [Google Scholar]
  16. Gogarten JP, Kibak H, Dittrich P, Taiz L, Bowman EJ, Bowman BJ, Manolson MF, Poole RJ, Date T, Oshima T, et al. 1989. Evolution of the vacuolar H+-ATPase: implications for the origin of eukaryotes. Proc Natl Acad Sci U S A. 86:6661–6665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hammerschmidt K, Landan G, Domingues Kümmel Tria F, Alcorta J, Dagan T. 2021. The order of trait emergence in the evolution of cyanobacterial multicellularity. Genome Biol Evol. 13:evaa249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Huelsenbeck JP, Bollback JP, Levine AM. 2002. Inferring the root of a phylogenetic tree. Syst Biol. 51:32–43. [DOI] [PubMed] [Google Scholar]
  19. Huerta-Cepas J, Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, et al. 2016. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44:D286–D293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, Butterfield CN, Hernsdorf AW, Amano Y, Ise K, et al. 2016. A new view of the tree of life. Nat Microbiol. 1:1–6. [DOI] [PubMed] [Google Scholar]
  21. Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 23:254–267. [DOI] [PubMed] [Google Scholar]
  22. Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. 1989. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A. 86:9355–9359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Katoh K, Standley DM. 2013. MAFFT Multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Katz LA. 2012. Origin and diversification of eukaryotes. Annu Rev Microbiol. 66:411–427. [DOI] [PubMed] [Google Scholar]
  25. Kluge AG, Farris JS. 1969. Quantitative phyletics and the evolution of anurans. Syst Biol. 18:1–32. [Google Scholar]
  26. Lang JM, Darling AE, Eisen JA. 2013. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS One 8:e62510-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lepage T, Bryant D, Philippe H, Lartillot N. 2007. A general comparison of relaxed molecular clock models. Mol Biol Evol. 24:2669–2680. [DOI] [PubMed] [Google Scholar]
  28. Linz S, Radtke A, von Haeseler A. 2007. A likelihood framework to measure horizontal gene transfer. Mol Biol Evol. 24:1312–1319. [DOI] [PubMed] [Google Scholar]
  29. Lovejoy CO, Suwa G, Simpson SW, Matternes JH, White TD. 2009. The great divides: Ardipithecus ramidus reveals the postcrania of our last common ancestors with African apes. Science 326:100–106. [PubMed] [Google Scholar]
  30. Mai U, Sayyari E, Mirarab S. 2017. Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS One 12:e0182238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Martin W, Baross J, Kelley D, Russell MJ. 2008. Hydrothermal vents and the origin of life. Nat Rev Microbiol. 6:805–814. [DOI] [PubMed] [Google Scholar]
  32. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. 2005. The microbial pan-genome. Curr Opin Genet Dev. 15:589–594. [DOI] [PubMed] [Google Scholar]
  33. Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. 2022. SpeciesRax: a tool for Maximum likelihood Species tree inference from gene family trees under duplication, transfer, and loss. Mol Biol Evol. 39:msab365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Naser-Khdour S, Quang Minh B, Lanfear R. 2022. Assessing confidence in root placement on phylogenies: an empirical study using nonreversible models for mammals. Syst Biol. 71:959–972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Novichkov PS, Omelchenko MV, Gelfand MS, Mironov AA, Wolf YI, Koonin EV. 2004. Genome-wide molecular clock and horizontal gene transfer in bacterial evolution. J Bacteriol. 186:6575–6585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Okamoto E, Kusakabe R, Kuraku S, Hyodo S, Robert-Moreno A, Onimaru K, Sharpe J, Kuratani S, Tanaka M. 2017. Migratory appendicular muscles precursor cells in the common ancestor to all vertebrates. Nat Ecol Evol. 1:1731–1736. [DOI] [PubMed] [Google Scholar]
  38. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz P. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 36:996–1004. [DOI] [PubMed] [Google Scholar]
  39. Pisani D, Cotton JA, McInerney JO. 2007. Supertrees disentangle the chimerical origin of eukaryotic genomes. Mol Biol Evol. 24:1752–1760. [DOI] [PubMed] [Google Scholar]
  40. Semple C, Steel M. 2001. Tree reconstruction via a closure operation on partial splits. In: Gascuel O, Sagot M-F, editors. Computational Biology. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. p. 126–134. [Google Scholar]
  41. Smith ML, Hahn MW. 2021. New approaches for inferring phylogenies in the presence of paralogs. Trends Genet. 37:174–187. [DOI] [PubMed] [Google Scholar]
  42. Stechmann A, Cavalier-Smith T. 2002. Rooting the eukaryote tree by using a derived gene fusion. Science 297:89–91. [DOI] [PubMed] [Google Scholar]
  43. Swenson KM, El-Mabrouk N. 2012. Gene trees and species trees: irreconcilable differences. BMC Bioinform. 13:S15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Szöllősi GJ, Boussau B, Abby SS, Tannier E, Daubin V. 2012. Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations. Proc Natl Acad Sci U S A. 109:17513–17518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Szöllősi GJ, Tannier E, Daubin V, Boussau B. 2015. The inference of gene trees with species trees. Syst Biol. 64:e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Takai K, Campbell BJ, Cary SC, Suzuki M, Oida H, Nunoura T, Hirayama H, Nakagawa S, Suzuki Y, Inagaki F, et al. 2005. Enzymatic and genetic characterization of carbon and energy metabolisms by deep-sea hydrothermal chemolithoautotrophic isolates of Epsilonproteobacteria. Appl Environ Microbiol. 71:7310–7320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Treangen TJ, Rocha EPC. 2011. Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 7:e1001284-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Tria FDK, Landan G, Dagan T. 2017. Phylogenetic rooting using minimal ancestor deviation. Nat Ecol Evol. 1:0193. [DOI] [PubMed] [Google Scholar]
  49. Tria FDK, Martin WF. 2021. Gene duplications are at least 50 times less frequent than gene transfers in prokaryotic genomes. Genome Biol Evol. 13:evab224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Waite DW, Vanwonterghem I, Rinke C, Parks DH, Zhang Y, Takai K, Sievert SM, Simon J, Campbell BJ, Hanson TE, et al. 2017. Comparative genomic analysis of the class Epsilonproteobacteria and proposed reclassification to Epsilonbacteraeota (phyl. nov). Front Microbiol. 8:4962–4919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin WF. 2016. The physiology and habitat of the last universal common ancestor. Nat Microbiol 1:16116. [DOI] [PubMed] [Google Scholar]
  52. Whidden C, Zeh N, Beiko RG. 2014. Supertrees based on the subtree prune-and-regraft distance. Syst Biol. 63:566–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Williams TA, Cox CJ, Foster PG, Szöllősi GJ, Embley TM. 2020. Phylogenomics provides robust support for a two-domains tree of life. Nat Ecol Evol. 4:138–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Williams TA, Heaps SE, Cherlin S, Nye TMW, Boys RJ, Embley TM. 2015. New substitution models for rooting phylogenetic trees. Philos Trans R Soc B Biol Sci. 370:20140336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Williams TA, Szöllősi GJ, Spang A, Foster PG, Heaps SE, Boussau B, Ettema TJG, Embley TM. 2017. Integrative modeling of gene and genome evolution roots the archaeal tree of life. Proc Natl Acad Sci U S A. 114:E4602–E4611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Zhu Q, Mai U, Pfeiffer W, Janssen S, Asnicar F, Sanders JG, Belda-Ferre P, Al-Ghalith GA, Kopylova E, McDonald D, et al. 2019. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 10:5477. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

evad096_Supplementary_Data

Data Availability Statement

The source code to run the phylogenomic rooting as well as the unrooted trees with AD values used in this study are found in the following repository: https://github.com/deropi/PhyloRooting.git. Additionally, R code to replicate some of the figures in this paper is also provided.


Articles from Genome Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES