Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2015 Dec 1;22(12):1129–1142. doi: 10.1089/cmb.2015.0146

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Olga Chernomor 1,,2,, Bui Quang Minh 1, Arndt von Haeseler 1,,2
PMCID: PMC4663649  PMID: 26448206

Abstract

In phylogenomic analysis the collection of trees with identical score (maximum likelihood or parsimony score) may hamper tree search algorithms. Such collections are coined phylogenetic terraces. For sparse supermatrices with a lot of missing data, the number of terraces and the number of trees on the terraces can be very large. If terraces are not taken into account, a lot of computation time might be unnecessarily spent to evaluate many trees that in fact have identical score. To save computation time during the tree search, it is worthwhile to quickly identify such cases. The score of a species tree is the sum of scores for all the so-called induced partition trees. Therefore, if the topological rearrangement applied to a species tree does not change the induced partition trees, the score of these partition trees is unchanged. Here, we provide the conditions under which the three most widely used topological rearrangements (nearest neighbor interchange, subtree pruning and regrafting, and tree bisection and reconnection) change the topologies of induced partition trees. During the tree search, these conditions allow us to quickly identify whether we can save computation time on the evaluation of newly encountered trees. We also introduce the concept of partial terraces and demonstrate that they occur more frequently than the original “full” terrace. Hence, partial terrace is the more important factor of timesaving compared to full terrace. Therefore, taking into account the above conditions and the partial terrace concept will help to speed up the tree search in phylogenomic inference.

Key words: : nearest neighbor interchange, partial terraces, phylogenetic terraces, subtree pruning and regrafting, tree bisection and reconnection

1. Introduction

In phylogenomics, one aims to reconstruct a phylogenetic species tree from multiple genes. One popular approach is to infer the trees from the concatenated gene alignment, the so-called supermatrix (Sanderson et al., 1998; De Queiroz and Gatesy, 2007). Here, if a gene sequence is not available for some taxon, it is represented by the sequence of unknown characters and is referred to as missing data. Several studies (van der Linde et al., 2010; Pyron and Wiens, 2011; Pyron et al., 2011; Nyakatura and Bininda-Emonds, 2012; Springer et al., 2012; Hedtke et al., 2013) use quite sparse supermatrices in their analysis and the percentage of missing data sometimes constitutes up to 95% (Peters et al., 2011).

Recently, it has been shown that missing data can hamper the tree search via existence of phylogenetic terraces (Sanderson et al., 2011), a collection of trees with exactly the same likelihood or parsimony score. Terraces occur in the analysis with partitioned data, that is, when distinct blocks of a supermatrix are treated differently (e.g., when each gene corresponding to one block evolves under its own evolutionary model). Two trees are said to belong to one terrace if the collections of their induced partition trees are exactly the same. Here, the induced partition tree is obtained by pruning the taxa on species tree, which have no sequence for the corresponding partition block.

Since the number of trees on one terrace can be quite large (Sanderson et al., 2011), accounting for terraces in tree search algorithms can potentially save a lot of computation time. During the tree search, one explores the tree space by moving from one candidate tree to another by means of topological rearrangements. If the topological rearrangement does not change any of the induced partition trees, then the two trees belong to the same terrace and a recomputation of objective function (maximum likelihood or maximum parsimony) used in the tree search is not necessary in order to evaluate a new tree.

Here, we first specify the conditions under which the topological rearrangements applied to the species tree change the corresponding induced partition trees. Using these conditions, one can quickly identify whether it is necessary to recompute the objective function for a given partition or not as a consequence of one of the three widely used rearrangements: nearest neighbor interchange (NNI), subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR) (Felsenstein, 2004).

We further generalize the concept of terrace to partial terrace, which is even more useful in practical phylogenetic analysis. We analyze several published alignments by examining NNI neighborhoods of random trees and trees encountered during the tree search using IQ-TREE (Nguyen et al., 2015). We show that for large number of taxa partial terraces are mainly determined by the missing data and less dependent on the actual tree topology analyzed. By taking into account partial terraces, it will be possible to speed up the tree search algorithms even in the absence of terraces.

The outline of the article is the following. We first introduce the notations and then discuss the important features of NNI, SPR, and TBR. Next, we specify the conditions when these topological rearrangements do not change the topology of induced partition trees. We further elucidate why such conditions are helpful even in the absence of terraces and define the concept of partial terrace. We analyze several published alignments to point out that partial terraces do occur in practice. Finally, we discuss the additional practical advantages of using induced partition trees in the maximum likelihood framework.

2. Background

2.1. Basic definitions and notations

In this section we provide basic definitions and notations used throughout the article. For a complete overview, see chapters 2, 3, and 6 in Semple and Steel (2003).

Definition 2.1. Let X be a taxon set. A phylogenetic tree T of X is a leaf-labeled tree with a bijection map from X into the set of leaves of T.

In the following, we work only with bifurcating phylogenetic trees; that is, all internal nodes have exactly three adjacent edges.

Definition 2.2. A split, denoted by A|B, is a bipartition of X into two nonempty, nonoverlapping sets A and B, where AB = X.

Note that A|B and B|A are equivalent. Every edge of T is associated with a split. When cutting an edge e of T, we obtain two subtrees with leaf labels X1 and X2, and then a split corresponding to e is defined as X1| X2. We denote this with e = X1| X2

We denote by Σ(T) a collection of all splits corresponding to edges of T.

The symmetric difference of two sets A and B, denoted AΔB, is given by (A\B) ∪ (B\A), or the union of taxa present in A but not B, and vice versa.

Definition 2.3. Let T1 and T2 be the two leaf-labeled trees with the same label set X, and Σ(T1) and Σ(T2) be the collections of splits of T1 and, T2, respectively. Then the Robinson–Foulds (RF) distance (Robinson and Foulds, 1981) between T1 and T2 is equal to |Σ(T1Σ(T2)|.

If for two trees the RF distance between them is 0, then they have the same collection of splits, and from splits-equivalence theorem (Semple and Steel, 2003; p. 43), the trees are equivalent.

Definition 2.4. Let Y be a subset of X. An induced subtree of T, denoted by T|Y, is a leaf-labeled tree with the following collection of splits:

graphic file with name eq1.gif

For a species tree T and a given partition with taxon set Y, a partition tree is an induced subtree T|Y.

2.2. Topological rearrangement operations

In this section we introduce the topological rearrangements on trees commonly used in phylogenetic inference.

The simplest possible operation that changes only one split on a tree is an NNI. It can only be applied to interior edges of the tree, since it requires the so-called quartet structure with an interior edge being the central edge of this structure (Fig. 1).

FIG. 1.

FIG. 1.

Visualization of NNI. Species tree T and the two NNIs around central edge e. NNI1 is obtained by exchanging subtrees below edges e1 and e3, while NNI2 by exchanging subtrees e1 and e4. NNI, nearest neighbor interchange.

Let e be an interior edge of T and e1, e2, e3, e4 its four incident edges with A, B, C, D being the taxon sets leading from them, respectively (Fig. 1). An NNI on T around e is obtained by exchanging the subtrees below two nonincident edges from e1, e2, e3, e4. We denote a new tree by TNNI.

For each interior edge e there are two possible NNIs obtained by exchanging a subtree below e1 with a subtree below either e3 or e4 (note that this is equivalent to swapping the subtree below e2 with either e4 or e3, respectively).

Let us assume that the NNI is applied to edge e by swapping e1 and e3. The splits corresponding to e1, e2, e3, and e4 stay unchanged:

graphic file with name eq2.gif

This also holds true for the edges belonging to subtrees below e1, e2, e3, and e4 (Fig. 1). Here, if e1 = A|B ∪ CD, the subtree below e1 is a subtree with a leaf set A and not the union of sets. Hence, the splits corresponding to e1, e2, e3, e4 and edges below them will be shared by T and TNNI.

The central edge e in terms of splits will be changed by the NNI from A ∪ B|C ∪ D to eNNI = AD|B ∪ C.

It follows from above that T and TNNI are different only in one split; that is,

graphic file with name eq3.gif

and the RF distance between T and TNNI is 2.

We now discuss SPR, a more general topological rearrangement that changes one or more splits of the tree.

An SPR on T is represented in Figure 2 (see also Hordijk and Gascuel, 2005). A new tree TSPR is obtained from T by pruning the subtree below edge a and regrafting it onto edge bn (we sometimes refer to such SPR as n-SPR). Note, that n is at least 3 and if n = 3, an SPR is equivalent to an NNI obtained by swapping subtrees belonging to edges a and b2. Let A, B1, …, Bn denote the corresponding taxon sets leading from a, b1, …, bn, respectively (Fig. 2).

FIG. 2.

FIG. 2.

Visualization of SPR. A new tree TSPR is obtained by pruning the subtree A below edge a and regrafting it onto edge bn (dashed red subtree). After SPR is applied, edges b1 and e1 are joined and edge bn is split into en−1 and bn. SPR, subtree pruning and regrafting.

An SPR on T changes only the splits of the path edges, namely: for Inline graphic

graphic file with name eq5.gif

is changed to

graphic file with name eq6.gif

where exSPR is an edge that corresponds to ex on a new tree TSPR. Also, a new edge appears: en − 1 = B1 ∪ … ∪ Bn−1|A ∪ Bn. The rest of splits remain unchanged and are shared by both trees. Hence, for T and TSPR the symmetric difference Inline graphic consists of the following splits:

graphic file with name eq8.gif

The RF distance between T and TSPR is equal to 2 (n − 2).

The last topological rearrangement we are going to discuss is the TBR. A TBR on T is shown in Figure 3, where a new tree TTBR is obtained from T (Fig. 3, in black) by cutting edge e and reconnecting edges bn and cm with a new edge eTBR (Fig. 3, red dashed line). Note that n or m must be greater than 2. W.l.o.g. assume that mn. If n = 3 and m = 2, then a TBR corresponds to an NNI around edge e1 by swapping subtrees below e and b2. If n > 3 and m = 2, then a TBR corresponds to an SPR.

FIG. 3.

FIG. 3.

Visualization of TBR. To obtain TTBR, species tree T is cut into two parts (by removing edge e), which are further reconnected by joining edges bn and cm with eTBR. Edge bn is split into en−1 and bn, while cm is split into zm−1 and cm. Edges b1 and e1 are joined, as well as c1 and z1. TBR, tree bisection and reconnection.

TBR only changes the splits corresponding to all path edges (ei and zj), but e. Namely,

graphic file with name eq9.gif

while for Inline graphic

graphic file with name eq11.gif

is changed to

graphic file with name eq12.gif

and for Inline graphic

graphic file with name eq14.gif

is changed to

graphic file with name eq15.gif

Also two new edges appear

graphic file with name eq16.gif

The remaining splits stay unchanged. Hence, for T and TTBR the symmetric difference Inline graphic is a set consisting of the following splits

graphic file with name eq18.gif

Therefore, the RF distance between T and TTBR is 2 (n + m − 4).

3. Consequences of Topological Rearrangements Applied to a Species Tree

In the following we discuss how the topological rearrangement of the species tree T influences the topology of the partition trees and start with the simplest operation, an NNI.

Proposition 1.Let e be an interior edge and e1, e2, e3, e4 the four edges adjacent to e with A, B, C, D being the taxon sets leading from the corresponding edges (Fig. 1). Let a new tree TNNI be obtained from T via NNI. For a partition with a taxon set Y, the topologies of T|Y and TNNI|Y are different iff Y has at least one representative taxon in each subset A, B, C, D.

Proof.

W.l.o.g. assume that TNNI is obtained from T via swapping of subtrees below e1 and e3. Then Inline graphic and as a consequence for corresponding partition trees we have

graphic file with name eq20.gif

It is easy to show that if at least one set from A ∩ Y, B ∩ Y, C ∩ Y, D ∩ Y were empty, then both splits (A ∪ B) ∩ Y|(C ∪ D) ∩ Y and (A ∪ D) ∩ Y|(B ∪ C) ∩ Y coincide with splits shared by T|Y and TNNI|Y (e.g., see Fig. 4). Hence, Inline graphic and the RF distance between these trees would be 0. Therefore, for T|Y and TNNI|Y to have different topologies, all A ∩ Y, B ∩ Y, C ∩ Y, D ∩ Y must be nonempty, meaning that Y has to have at least one representative in each subset A, B, C, D.   ■

FIG. 4.

FIG. 4.

An example when an NNI on T does not change the topology of T|Y. Solid lines correspond to two induced partition trees before (T|Y) and after (TNNI|Y) the NNI was applied to edge e on T by swapping the subtrees below e1 and e3 (Fig. 1). In this case, Y does not have a representative in A (i.e., A ∩ Y = ∅); therefore, (A ∪ B)∩Y|(C∪D)∩Y =B∩Y|(C∪D)∩Y and (A∪D)∩Y|(B∪C)∩Y = D∩Y|(B∪C)∩Y. Since the splits B∩Y|(C∪D)∩Y and D∩Y|(B∪C)∩Y are shared by T|Y and TNNI|Y, then Σ(T|Y) Δ Σ(TNNI|Y) = ∅ and RF distance between T|Y and TNNI|Y is 0.

In simple words, if some intersections of A, B, C, D with Y are empty, then a partition tree does not have a corresponding quartet structure for the NNI to be applied to and edge e loses its centrality or interior feature (see, e.g., Fig. 4). When this happens, the topology of the partition tree T|Y is not affected by the NNI applied to e on the species tree T.

We next specify the condition when an SPR changes the topology of partition tree.

Proposition 2.Let tree T be in the form shown in Figure 2, and a new tree TSPR is obtained with SPR by pruning subtree below edge a and regrafting it onto bn.

Then for a partition with a taxon set Y the following is true:

(i) the topologies of T|Y and TSPR|Y are different, if Y has at least one representative in A and in at least another three subsets from B1, B2, …, Bn;

(ii) this SPR will correspond to an SPR on T|Y obtained by pruning the subtree below edge with a split A ∩ Y|(B1 ∪ … ∪ Bn) ∩ Y and regrafting it onto edge with split Inline graphic, where Inline graphic.

Proof.

(i) The symmetric difference Inline graphic consists of the following splits

graphic file with name eq25.gif

As a consequence for the induced partition trees T|Y and TSPR|Y, the symmetric difference of Σ(T|Y) and Σ(TSPR|Y) consists of

graphic file with name eq26.gif

It is easy to see that if A ∩ Y = ∅, then all these splits would be shared by both partition trees, that is, Σ(T|Y) Δ Σ(TSPR|Y) = ∅ and the RF distance between T|Y and TSPR|Y would be 0. Therefore, Y must have at least one representative in A.

For T|Y and TSPR|Y to have different topologies, an SPR on T should correspond to at least an NNI on T|Y. Hence, T|Y must have a corresponding quartet structure and together with A at least another three subsets from B1, B2, …, Bn should have at least one representative in Y. W.l.o.g. assume that together with A also Bm, Bh, Bk (1 ≤ m < h < k ≤ n) have at least one representative in Y while Inline graphic (see, e.g., Fig. 5). Then

graphic file with name eq28.gif

FIG. 5.

FIG. 5.

An example when n-SPR on T is a 3-SPR (or NNI) on T|Y. There are two induced partition trees (solid lines): before (T|Y) and after (TSPR|Y) an SPR was applied on T by pruning the subtree below edge a and regrafting it onto bn (Fig. 2). The three dots denote all the subtrees between the corresponding pair of subtrees on the species trees T and TSPR. Here, only A, Bm, Bh, and Bk have at least one representative in Y and ∀j{1, … , n}\{m, h, k}: Bj have no taxa in common with Y.

Thus, the RF distance between T|Y and TSPR|Y is 2.

(ii) Let I = {i1,., ik} be the set of all indices, such that Inline graphic and let Inline graphic.

For edge a = A|B1 ∪ … ∪ Bn its corresponding split on the partition tree T|Y is equal to

graphic file with name eq31.gif

Similarly for Inline graphic its corresponding split on T|Y

graphic file with name eq33.gif

and for Inline graphic its corresponding split on the partition tree TSPR|Y

graphic file with name eq35.gif

The above means that an edge on T|Y with split Inline graphic was divided by an edge with split Inline graphic in two edges (see also Fig. 5, where I = {i1, i2, i3}). Therefore, regrafting onto edge bn on T corresponds to regrafting onto edge with a split Inline graphic on partition tree T|Y. And since Inline graphic, then Inline graphic.   ■

In other words, Proposition 2 states that an SPR on T changes the topology of T|Y if the structure of T from Figure 2 corresponds to at least a quartet structure on T|Y (e.g., Fig. 5). In this case, n-SPR on T is a 3-SPR (or NNI) on T|Y.

We now discuss TBR and the topological change of a partition tree as a consequence of TBR on species tree.

Proposition 3.Let tree T be in the form shown in Figure 3 and a new tree TTBR is obtained by cutting edge e and reconnecting bn and cm with a new edge.

Then for a partition with a taxon set Y the following is true:

(i) the topologies of T|Y and TTBR|Y are different if either of the following conditions is satisfied:

  • • Y has at least one representative in at least one subset from B1, B2, … , Bn and in at least another three subsets from C1, C2, … ,Cm

  • • Y has at least one representative in at least one subset from C1, C2, … , Cm and in at least another three subsets from B1, B2, … , Bn

(ii) this TBR will correspond to a TBR on T|Y obtained by cutting the edge with split (B1 ∪ … ∪ Bn) ∩Y|(C1 ∪ … ∪ Cm) ∩ Y and reconnecting edges with splits BkY|(∪iε{1,.,n}\k BiC1 ∪ … ∪ Cm) ∩ Y and ChY|(∪jε{1,.,m}\h CjB1 ∪ … ∪ Bn) ∩ Y, where k = max1≤i≤n {i | BiY ≠ ∅} and h = max1≤jm {j | CjY ≠ ∅}.

Proof.

(i) The symmetric difference Σ (T) Δ Σ (TTBR) consists of the following splits:

graphic file with name eq41.gif

As a consequence, the symmetric difference Inline graphic consists of

graphic file with name eq43.gif

It is easy to see that if ∀i ε {1,.,n}: Bi ∩ Y = ∅, then all these splits would be shared by both partition trees; that is Σ(T|Y) Δ Σ(TTBR|Y) = ∅ and the RF distance between T|Y and TTBR|Y would be 0. Therefore, Y must have at least one representative in at least one from B1, B2, …, Bn. Similarly, Y must have at least one representative in at least one from C1, C2, … , Cm.

W.l.o.g. assume that BkY ≠ ∅ and ChY ≠ ∅, where 1 ≤ kn and 1 ≤ hm.

Partition trees T|Y and TTBR|Y will have different topologies if a TBR on T corresponds to at least an NNI on T|Y. Hence, the partition tree T|Y must have a corresponding quartet structure and together with Bk and Ch at least other two subsets from the remaining Bi and Cj should have at least one representative in Y.

W.l.o.g. assume that together with Bk and Ch also Cp, Cq (1 ≤ p < q < h ≤ m) have at least one representative in Y (Fig. 6, right panel). Then it is easy to show that

graphic file with name eq44.gif

FIG. 6.

FIG. 6.

Examples of corresponding TBRs on partition trees. Two partition trees with topologies before (T|Y, in black) and after (TTBR|Y, in red) the TBR were applied to the species tree. For simplicity we do not show the pruned subtrees for which BiY =  and CjY = ∅. On the left is an example case when the topology of partition tree remains unchanged after TBR. On the right is the simplest case when the TBR changes the topology of partition tree. In this case a TBR on species tree corresponds to an NNI on partition tree.

and RF distance between T|Y and TTBR|Y is 2.

Similarly, one can show that if together with Bk and Ch also Bp, Bq (1 ≤ p < q < kn) have at least one representative in Y, then RF distance between T|Y and TTBR|Y is also 2.

In contrast, if Y has at least one representative in Bk, Ch and also in Bp, Cq (1 ≤ p < kn and 1 ≤ q < hm), then Inline graphic and RF distance is 0 (Fig. 6, left panel).

(ii) Let I = {i1,.,ik} be the set of all indices such that Inline graphic and let 1 ≤ i1 < … < ikn. Similarly, let J = {j1,.,jh} be the set of all indices such that Inline graphic and let 1 ≤ j1 < … < jhm. Then for edge

graphic file with name eq48.gif

the corresponding split on T|Y is

graphic file with name eq49.gif

For edge

graphic file with name eq50.gif

its corresponding split on tree T|Y is

graphic file with name eq51.gif

Similarly, for the corresponding edge on Inline graphic its split on TTBR|Y is

graphic file with name eq53.gif

For edges

graphic file with name eq54.gif

and

graphic file with name eq55.gif

their corresponding splits on T|Y and TTBR|Y are

graphic file with name eq56.gif

and

graphic file with name eq57.gif

respectively. The above means that edges on T|Y with corresponding splits

graphic file with name eq58.gif

and

graphic file with name eq59.gif

were reconnected on Inline graphic. Since 1 ≤ i1 < … < ikn and 1 ≤ j1 < … < jhm, then ik = max1≤in{i | BiY ≠ ∅} and jh = max 1≤jm{j | CjY ≠ ∅}.   ■

4. Partial Terraces

4.1. Definition of partial terraces

In this section we discuss partial terraces that generalize the terrace concept (Sanderson et al., 2011), which we call full terrace for clarity. When comparing the two trees in a partitioned framework, we compare the sets of their induced partition trees. If the sets are identical, then the two trees belong to one full terrace. Sanderson et al. (2011) showed that the number of trees on one full terrace can be quite large. Large full terraces pose a problem in phylogenetic inference, since they may abort tree search prematurely or even if an optimal tree has been found, this tree is by no means unique. To reduce this problem, it is possible to reduce the terrace size by, for example, choosing a different partition scheme (Sanderson et al., 2015) or by excluding some taxa from the analysis.

Now, if two species trees T1 and T2 share only a subset of identical induced partition trees, then we say that they belong to the same partial terrace. The log-likelihoods and parsimony scores of identical partition trees T1|Yi and T2|Yi are the same. Obviously, partial terraces occur more frequently than full terraces (see below). Large partial terraces can be still problematic for tree search algorithms. On the other hand, partial terraces provide the potential to reduce computation time.

4.2. Occurrence of partial terraces in real data

In this section we evaluate how often partial terraces occur in real alignments. By no means do we intend to make a full exploration of potential computing time that may be saved since the performance of the particular software will depend on the data structures and particular implementation used for the tree space exploration.

To elucidate the occurrence of partial terraces and full terraces, we analyzed seven recently published alignments (Table 1). Alignments have different numbers of taxa ranging from 69 to 404 taxa. The number of partitions (here, genes) varies from 11 to 79.

Table 1.

Alignments Used to Study the Occurrence of Partial Terraces During the Tree Search

Type and ID No. of species No. of genes Missing data (%) Source
DNA1 128 32 30 Stamatakis and Alachiotis (2010)
DNA2 237 74 72 Nyakatura and Bininda-Emonds (2012)
DNA3 372 79 66 Springer et al. (2012)
DNA4 404 11 60 Stamatakis and Alachiotis (2010)
AA1 69 31 35 De Queiroz et al. (1995)
AA2 70 35 34  
AA3 72 51 35  

For each alignment we performed a maximum likelihood tree search using IQ-TREE (Nguyen et al., 2015) under the edge-unlinked (EUL) partition model assuming a GTR+Γ (Lanave et al., 1984; Yang, 1994) model for all partitions. We collected all the intermediate trees encountered during the search. For each intermediate tree T, we explored all trees TNNI in its NNI neighborhood. We examined partial terraces of each TNNI and T by computing how many induced partition trees are shared between them.

Apart from intermediate trees collected during the tree search, we also analyzed NNI neighborhoods for 1000 random Yule–Harding (YH) trees (Harding, 1971) for each tested alignment.

We defined 12 bins based on the percentage of shared induced partition trees between T and TNNI (Table 2) and counted how many TNNI trees fall into each bin. Table 3 shows the mean percentage of TNNI trees that fall into the corresponding bin for the intermediate trees. Figure 7 displays the boxplots for the first three alignments from Table 1 either for the IQ-TREE search trees (left column) or the random YH trees (right column) (see Supplementary Figs. S1–S4 for the remaining alignments; Supplementary Material is available online at www.liebertonline.com/cmb).

Table 2.

Partial Terrace Bins Based on the Percentage of the Shared Partition Trees Between T and TNNI

Name Percentage of shared partition trees out of the total number of partition trees
No partial terrace (PT)  = 0%, the topologies of all partition trees are pairwise different between T and TNNI
PT1 (0%,10%]
PT2 (10%, 20%]
PT3 (20%, 30%]
PT9 (80%, 90%]
PT10 (90%, 100%)
Full terrace  = 100%, T and TNNI belong to one terrace

Table 3.

Mean Percentage of Trees from NNI Neighborhood of Intermediate Trees Falling into Corresponding Partial Terrace Bin

  No PT (%) PT1 (%) PT2 (%) PT3 (%) PT4 (%) PT5 (%) PT6 (%) PT7 (%) PT8 (%) PT9 (%) PT10 (%) Full terrace (%)
DNA1 7.14 12.97 4.85 1.80 3.55 32.59 37.11 0 0 0 0 0
DNA2 0 0 0.02 0.63 1.82 5.07 11.31 8.69 18.19 10.77 41.75 1.75
DNA3 0 2.75 5.38 9.22 10.06 6.32 4.34 1.33 0.23 6.38 50.36 3.63
DNA4 0.35 0.26 1.88 4.06 5.20 6.56 8.77 11.34 16.48 23.37 17.68 4.05
AA1 12.11 10.64 7.47 8.10 6.35 15.42 11.28 8.20 10.50 7.18 2.76 0
AA2 8.73 11.90 6.77 9.10 11.22 9.08 16.27 10.95 11.63 2.92 1.44 0
AA3 12.25 11.62 4.07 7.15 3.04 15.47 15.43 10.40 7.55 7.92 4.85 0.26

FIG. 7.

FIG. 7.

NNI neighborhood analysis for alignments DNA1 (top), DNA2 (middle), and DNA3 (bottom).

Intermediate and random trees have similar percentages of TNNI trees across different bins (Fig. 7 and Supplementary Figs. S1–S4). This suggests that the general picture of partial terraces is mainly determined by the spread of missing data in the supermatrix and is less dependent on the actual tree topology. Moreover, increasing the number of taxa tends to decrease the variance of TNNI percentage within each bin (for both intermediate and random trees).

Figure 8 integrates the information from Tables 2 and 3 and provides rough estimates of potential computational savings if accounting for partial and full terraces. The green bars reflect the average percentage of identical induced partition trees when TNNI is compared to T. For example, for DNA1 there is no full terrace, but we observe partial terraces that may lead to a reduction of about 38% (the percentage of green bars) in computation time.

FIG. 8.

FIG. 8.

Visualization of NNI neighborhoods and potential computational savings. Each horizontal line reflects the NNI neighborhood for each test alignment, that is 100% of TNNI trees. These neighborhoods are divided into partial terrace bins (Table 2) and depicted here by horizontal segments. The length of each segment corresponds to the mean percentage of TNNI trees falling into the bin (Table 3). Each segment is composed of a green and a red bar, corresponding to the fractions of partition trees that are shared and not shared between T and TNNI, respectively. Basically, green bars indicate potential computational savings when accounting for partial and full terraces during the trees search.

There is a full terrace for DNA2, but it consists of only 1.75% of the NNI neighborhood, whereas partial terraces constitute the remaining 98.25% and lead to a potential reduction of computations of about 80% (the percentage of green bars). In fact, since no TNNI tree falls into “no PT” bin, we can save some computation time for all the trees encountered during tree search. Similar trend is observed for DNA3 and DNA4 with the predicted timesaving of 71% for each alignment.

5. Advantages of Using Induced Partition Trees in Maximum Likelihood Inference

In maximum likelihood inference, after applying a topological rearrangement on T, one needs to optimize the edge lengths of a new tree TNEW. Therefore, together with the topological changes of partition trees, it is important to consider how topological rearrangement on T influences edge length optimization.

In the following we discuss two partition models commonly used in likelihood inferences, EUL and edge-linked (EL), and the advantages of using induced partition trees for either model (Yang, 1996).

We start by considering the most general partition model, EUL. Given a species tree T, we first obtain the corresponding induced partition trees. Under the EUL model, the edge lengths of the partition trees are optimized separately. The edge lengths of T are then computed from the corresponding edges lengths inferred on the partition trees, for example, as mean edge length.

Therefore, if the topological rearrangement on T does not change the topology of a partition tree T|Y, no edge length optimization is necessary and, as a result, the optimal partition tree likelihood remains unchanged after such a topological rearrangement on T. Let TNEW be a tree obtained from T by some topological rearrangement.

Under EUL partition model there is no need to optimize the edge lengths of partitioned trees shared between T and TNEW. As a result, the log-likelihood of the corresponding partition trees is the same.

In contrast to the EUL model, the edges between T and partition trees are linked in the EL model. That means that there is only one set of edge lengths for T and partition trees with the possibility of rescaling edge lengths of each partition tree by a partition-specific evolutionary rate. Therefore, the optimization of edge lengths is done on the species tree. Even if a topological rearrangement on T does not change the topology of partition tree, it still affects the optimal partition tree likelihood via optimization of edge lengths. This is also the reason why full terraces cannot occur under the EL model (Sanderson et al., 2015). Theoretically, one would need to optimize each edge on the species tree, which would definitely influence the partition tree edge lengths and also the likelihood. But in practice, to save computations, one only optimizes those edges in the vicinity of topological changes (Stamatakis et al., 2005; Guindon et al., 2010; Nguyen et al., 2015). For example, for an NNI, one reoptimizes only five edge lengths (e, e1, e2, e3, e4) around the swap. Under the EL model, such a particular feature of practical optimization can take an advantage when considering the induced partition trees.

Given a partition tree with taxon set Y and an edge e on T with the corresponding split A|B, if AY = ∅ or BY = ∅ then the optimization of e does not affect the likelihood of T|Y.

In this case, a split A|B does not have a corresponding split in Σ(T|Y), and therefore edge e is not linked to any edge on T|Y. This observation can be exploited to save computing time.

6. Discussion

We have shown that it is advantageous to identify and account for full and partial terraces during the tree search in phylogenomics. One main advantage is the saving of computation time. If two trees belong to the same full or partial terrace, then one needs to compute the objective function for the identical partition trees only once. The values of objective function will be the same for these partition trees. The larger the number of identical partition trees between species trees, the more computation time can be saved.

From the conditions discussed in the previous sections, the topological rearrangement that benefits the most from partial terraces is obviously NNI. It is intuitive that NNI applied to the species tree will not change the topology of partition trees more often than SPR or TBR. However, in tree searches one typically applies short SPR (e.g., RAxML); that is, the number of edges between the pruning and the regrafting edges are much smaller than the number of taxa. The same is true for TBR. And since one also expects short SPR and short TBR to result in no change of partition trees quite often for sparse supermatrices, partial terraces are also beneficial for these rearrangements.

Moreover, the use of induced partition trees has another advantage that long SPR or TBR on a species tree T, as a result of missing data, might correspond to a much shorter SPR or TBR on T|Y. This leads to computation saving even if SPR or TBR changes the topology of the induced partition trees.

Here, we elucidated the frequent existence of partial terraces in practice via NNI neighborhoods, showing that partial terraces are not only a theoretical concept, but also have practical implications in phylogenomics. The predicted timesaving for the examined real alignments is only the rough estimate, since we treated the alignment lengths per partition as equal. If the length of alignment corresponding to the shared partition trees is relatively large compared to the whole supermatrix, then one expects even more speed up.

Another important factor for timesaving is the actual implementation of search strategies in the particular software. We plan to implement efficient techniques to take full advantage of partial and full terraces in IQ-TREE. A more thorough analysis of such techniques will be presented elsewhere.

Supplementary Material

Supplemental data
Supp_Figs1-4.pdf (194.3KB, pdf)

Acknowledgments

This work was supported by the Austrian Science Fund (FWF, grant number I-1824-B22) to O.C.

Author Disclosure Statement

No competing financial interests exist.

References

  1. De Queiroz A., Donoghue M.J., and Kim J. 1995. Separate versus combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst. 26, 657–681 [Google Scholar]
  2. De Queiroz A., and Gatesy J. 2007. The supermatrix approach to systematics. Trends Ecol. Evol. 22, 34–41 [DOI] [PubMed] [Google Scholar]
  3. Felsenstein J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA [Google Scholar]
  4. Guindon S., Dufayard J.F., Lefort V., et al. . 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 [DOI] [PubMed] [Google Scholar]
  5. Harding E.F. 1971. The probabilities of rooted tree shapes generated by random bifurcation. Adv. Appl. Probability. 3, 44–77 [Google Scholar]
  6. Hedtke S.M., Patiny S., and Danforth B.N. 2013. The bee tree of life: A supermatrix approach to apoid phylogeny and biogeography. BMC Evol. Biol. 13, 138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hordijk W., and Gascuel O. 2005. Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood. Bioinformatics. 21, 4338–4347 [DOI] [PubMed] [Google Scholar]
  8. Lanave C., Preparata G., Saccone C., et al. . 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20, 86–93 [DOI] [PubMed] [Google Scholar]
  9. Nguyen L.T., Schmidt H.A., von Haeseler A., et al. . 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Nyakatura K., and Bininda-Emonds O.R.P. 2012. Updating the evolutionary history of Carnivora (Mammalia): A new species-level supertree complete with divergence time estimates. BMC Biol. 10, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Peters R.S., Meyer B., Krogmann L., et al. . 2011. The taming of an impossible child: A standardized all-in approach to the phylogeny of Hymenoptera using public database sequences. BMC Biol. 9, 55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Pyron R.A., Burbrink F.T., Colli G.R., et al. . 2011. The phylogeny of advanced snakes (Colubroidea), with discovery of a new subfamily and comparison of support methods for likelihood trees. Mol. Phylogenet. Evol. 58, 329–342 [DOI] [PubMed] [Google Scholar]
  13. Pyron R.A., and Wiens J.J. 2011. A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians. Mol. Phylogenet. Evol. 61, 543–583 [DOI] [PubMed] [Google Scholar]
  14. Robinson D.F., and Foulds L.R. 1981. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 [Google Scholar]
  15. Sanderson M.J., McMahon M.M., Stamatakis A., et al. . 2015. Impacts of terraces on phylogenetic inference. Syst. Biol. 64, 709–726 [DOI] [PubMed] [Google Scholar]
  16. Sanderson M.J., McMahon M.M., and Steel M. 2011. Terraces in phylogenetic tree space. Science. 333, 448–450 [DOI] [PubMed] [Google Scholar]
  17. Sanderson M.J., Purvis A., and Henze C. 1998. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol. Evol. 13, 105–109 [DOI] [PubMed] [Google Scholar]
  18. Semple C., and Steel M.A. 2003. Phylogenetics. Oxford University Press, New York, NY [Google Scholar]
  19. Springer M.S., Meredith R.W., Gatesy J., et al. . 2012. Macroevolutionary dynamics and historical biogeography of primate diversification inferred from a species supermatrix. PLoS ONE 7, e49521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Stamatakis A., and Alachiotis N. 2010. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data. Bioinformatics. 26, i132–i139 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Stamatakis A., Ludwig T., and Meier H. 2005. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463 [DOI] [PubMed] [Google Scholar]
  22. van der Linde K., Houle D., Spicer G.S., et al. . 2010. A supermatrix-based molecular phylogeny of the family Drosophilidae. Genet. Res. 92, 25–38 [DOI] [PubMed] [Google Scholar]
  23. Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39, 306–314 [DOI] [PubMed] [Google Scholar]
  24. Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data
Supp_Figs1-4.pdf (194.3KB, pdf)

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES