Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Olga Chernomor; Bui Quang Minh; Arndt von Haeseler

doi:10.1089/cmb.2015.0146

. 2015 Dec 1;22(12):1129–1142. doi: 10.1089/cmb.2015.0146

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Olga Chernomor ^1,,^2,^✉, Bui Quang Minh ¹, Arndt von Haeseler ^1,,²

PMCID: PMC4663649 PMID: 26448206

Abstract

In phylogenomic analysis the collection of trees with identical score (maximum likelihood or parsimony score) may hamper tree search algorithms. Such collections are coined phylogenetic terraces. For sparse supermatrices with a lot of missing data, the number of terraces and the number of trees on the terraces can be very large. If terraces are not taken into account, a lot of computation time might be unnecessarily spent to evaluate many trees that in fact have identical score. To save computation time during the tree search, it is worthwhile to quickly identify such cases. The score of a species tree is the sum of scores for all the so-called induced partition trees. Therefore, if the topological rearrangement applied to a species tree does not change the induced partition trees, the score of these partition trees is unchanged. Here, we provide the conditions under which the three most widely used topological rearrangements (nearest neighbor interchange, subtree pruning and regrafting, and tree bisection and reconnection) change the topologies of induced partition trees. During the tree search, these conditions allow us to quickly identify whether we can save computation time on the evaluation of newly encountered trees. We also introduce the concept of partial terraces and demonstrate that they occur more frequently than the original “full” terrace. Hence, partial terrace is the more important factor of timesaving compared to full terrace. Therefore, taking into account the above conditions and the partial terrace concept will help to speed up the tree search in phylogenomic inference.

Key words: : nearest neighbor interchange, partial terraces, phylogenetic terraces, subtree pruning and regrafting, tree bisection and reconnection

1. Introduction

In phylogenomics, one aims to reconstruct a phylogenetic species tree from multiple genes. One popular approach is to infer the trees from the concatenated gene alignment, the so-called supermatrix (Sanderson et al., 1998; De Queiroz and Gatesy, 2007). Here, if a gene sequence is not available for some taxon, it is represented by the sequence of unknown characters and is referred to as missing data. Several studies (van der Linde et al., 2010; Pyron and Wiens, 2011; Pyron et al., 2011; Nyakatura and Bininda-Emonds, 2012; Springer et al., 2012; Hedtke et al., 2013) use quite sparse supermatrices in their analysis and the percentage of missing data sometimes constitutes up to 95% (Peters et al., 2011).

Recently, it has been shown that missing data can hamper the tree search via existence of phylogenetic terraces (Sanderson et al., 2011), a collection of trees with exactly the same likelihood or parsimony score. Terraces occur in the analysis with partitioned data, that is, when distinct blocks of a supermatrix are treated differently (e.g., when each gene corresponding to one block evolves under its own evolutionary model). Two trees are said to belong to one terrace if the collections of their induced partition trees are exactly the same. Here, the induced partition tree is obtained by pruning the taxa on species tree, which have no sequence for the corresponding partition block.

Since the number of trees on one terrace can be quite large (Sanderson et al., 2011), accounting for terraces in tree search algorithms can potentially save a lot of computation time. During the tree search, one explores the tree space by moving from one candidate tree to another by means of topological rearrangements. If the topological rearrangement does not change any of the induced partition trees, then the two trees belong to the same terrace and a recomputation of objective function (maximum likelihood or maximum parsimony) used in the tree search is not necessary in order to evaluate a new tree.

Here, we first specify the conditions under which the topological rearrangements applied to the species tree change the corresponding induced partition trees. Using these conditions, one can quickly identify whether it is necessary to recompute the objective function for a given partition or not as a consequence of one of the three widely used rearrangements: nearest neighbor interchange (NNI), subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR) (Felsenstein, 2004).

We further generalize the concept of terrace to partial terrace, which is even more useful in practical phylogenetic analysis. We analyze several published alignments by examining NNI neighborhoods of random trees and trees encountered during the tree search using IQ-TREE (Nguyen et al., 2015). We show that for large number of taxa partial terraces are mainly determined by the missing data and less dependent on the actual tree topology analyzed. By taking into account partial terraces, it will be possible to speed up the tree search algorithms even in the absence of terraces.

The outline of the article is the following. We first introduce the notations and then discuss the important features of NNI, SPR, and TBR. Next, we specify the conditions when these topological rearrangements do not change the topology of induced partition trees. We further elucidate why such conditions are helpful even in the absence of terraces and define the concept of partial terrace. We analyze several published alignments to point out that partial terraces do occur in practice. Finally, we discuss the additional practical advantages of using induced partition trees in the maximum likelihood framework.

2. Background

2.1. Basic definitions and notations

In this section we provide basic definitions and notations used throughout the article. For a complete overview, see chapters 2, 3, and 6 in Semple and Steel (2003).

Definition 2.1. Let X be a taxon set. A phylogenetic tree T of X is a leaf-labeled tree with a bijection map from X into the set of leaves of T.

In the following, we work only with bifurcating phylogenetic trees; that is, all internal nodes have exactly three adjacent edges.

Definition 2.2. A split, denoted by A|B, is a bipartition of X into two nonempty, nonoverlapping sets A and B, where A ∪ B = X.

Note that A|B and B|A are equivalent. Every edge of T is associated with a split. When cutting an edge e of T, we obtain two subtrees with leaf labels X₁ and X₂, and then a split corresponding to e is defined as X₁| X₂. We denote this with e = X₁| X₂

We denote by Σ(T) a collection of all splits corresponding to edges of T.

The symmetric difference of two sets A and B, denoted AΔB, is given by (A\B) ∪ (B\A), or the union of taxa present in A but not B, and vice versa.

Definition 2.3. Let T₁ and T₂ be the two leaf-labeled trees with the same label set X, and Σ(T₁) and Σ(T₂) be the collections of splits of T₁ and, T₂, respectively. Then the Robinson–Foulds (RF) distance (Robinson and Foulds, 1981) between T₁ and T₂ is equal to |Σ(T₁)ΔΣ(T₂)|.

If for two trees the RF distance between them is 0, then they have the same collection of splits, and from splits-equivalence theorem (Semple and Steel, 2003; p. 43), the trees are equivalent.

Definition 2.4. Let Y be a subset of X. An induced subtree of T, denoted by T|Y, is a leaf-labeled tree with the following collection of splits:

For a species tree T and a given partition with taxon set Y, a partition tree is an induced subtree T|Y.

2.2. Topological rearrangement operations

In this section we introduce the topological rearrangements on trees commonly used in phylogenetic inference.

The simplest possible operation that changes only one split on a tree is an NNI. It can only be applied to interior edges of the tree, since it requires the so-called quartet structure with an interior edge being the central edge of this structure (Fig. 1).

Let e be an interior edge of T and e₁, e₂, e₃, e₄ its four incident edges with A, B, C, D being the taxon sets leading from them, respectively (Fig. 1). An NNI on T around e is obtained by exchanging the subtrees below two nonincident edges from e₁, e₂, e₃, e₄. We denote a new tree by T_NNI.

For each interior edge e there are two possible NNIs obtained by exchanging a subtree below e₁ with a subtree below either e₃ or e₄ (note that this is equivalent to swapping the subtree below e₂ with either e₄ or e₃, respectively).

Let us assume that the NNI is applied to edge e by swapping e₁ and e₃. The splits corresponding to e₁, e₂, e₃, and e₄ stay unchanged:

This also holds true for the edges belonging to subtrees below e₁, e₂, e₃, and e₄ (Fig. 1). Here, if e₁ = A|B ∪ C ∪ D, the subtree below e₁ is a subtree with a leaf set A and not the union of sets. Hence, the splits corresponding to e₁, e₂, e₃, e₄ and edges below them will be shared by T and T_NNI.

The central edge e in terms of splits will be changed by the NNI from A ∪ B|C ∪ D to e^NNI = A∪D|B ∪ C.

It follows from above that T and T_NNI are different only in one split; that is,

and the RF distance between T and T_NNI is 2.

We now discuss SPR, a more general topological rearrangement that changes one or more splits of the tree.

An SPR on T is represented in Figure 2 (see also Hordijk and Gascuel, 2005). A new tree T_SPR is obtained from T by pruning the subtree below edge a and regrafting it onto edge b_n (we sometimes refer to such SPR as n-SPR). Note, that n is at least 3 and if n = 3, an SPR is equivalent to an NNI obtained by swapping subtrees belonging to edges a and b₂. Let A, B₁, …, B_n denote the corresponding taxon sets leading from a, b₁, …, b_n, respectively (Fig. 2).

FIG. 2. — Visualization of SPR. A new tree *T_SPR* is obtained by pruning the subtree A below edge a and regrafting it onto edge *b_n* (dashed red subtree). After SPR is applied, edges b₁ and e₁ are joined and edge *b_n* is split into *e_n−1* and *b_n*. SPR, subtree pruning and regrafting.

An SPR on T changes only the splits of the path edges, namely: for Inline graphic

is changed to

where e_x^SPR is an edge that corresponds to e_x on a new tree T_SPR. Also, a new edge appears: e_n _{− 1} = B₁ ∪ … ∪ B_n₋₁|A ∪ B_n. The rest of splits remain unchanged and are shared by both trees. Hence, for T and T_SPR the symmetric difference Inline graphic consists of the following splits:

The RF distance between T and T_SPR is equal to 2 (n − 2).

The last topological rearrangement we are going to discuss is the TBR. A TBR on T is shown in Figure 3, where a new tree T_TBR is obtained from T (Fig. 3, in black) by cutting edge e and reconnecting edges b_n and c_m with a new edge e^TBR (Fig. 3, red dashed line). Note that n or m must be greater than 2. W.l.o.g. assume that m ≤ n. If n = 3 and m = 2, then a TBR corresponds to an NNI around edge e₁ by swapping subtrees below e and b₂. If n > 3 and m = 2, then a TBR corresponds to an SPR.

TBR only changes the splits corresponding to all path edges (e_i and z_j), but e. Namely,

while for Inline graphic

is changed to

and for Inline graphic

is changed to

Also two new edges appear

The remaining splits stay unchanged. Hence, for T and T_TBR the symmetric difference Inline graphic is a set consisting of the following splits

Therefore, the RF distance between T and T_TBR is 2 (n + m − 4).

3. Consequences of Topological Rearrangements Applied to a Species Tree

In the following we discuss how the topological rearrangement of the species tree T influences the topology of the partition trees and start with the simplest operation, an NNI.

Proposition 1. Let e be an interior edge and e₁, e₂, e₃, e₄ the four edges adjacent to e with A, B, C, D being the taxon sets leading from the corresponding edges (Fig. 1). Let a new tree T_NNI be obtained from T via NNI. For a partition with a taxon set Y, the topologies of T|Y and T_NNI|Y are different iff Y has at least one representative taxon in each subset A, B, C, D.

Proof.

W.l.o.g. assume that T_NNI is obtained from T via swapping of subtrees below e₁ and e₃. Then Inline graphic and as a consequence for corresponding partition trees we have

It is easy to show that if at least one set from A ∩ Y, B ∩ Y, C ∩ Y, D ∩ Y were empty, then both splits (A ∪ B) ∩ Y|(C ∪ D) ∩ Y and (A ∪ D) ∩ Y|(B ∪ C) ∩ Y coincide with splits shared by T|Y and T_NNI|Y (e.g., see Fig. 4). Hence, Inline graphic and the RF distance between these trees would be 0. Therefore, for T|Y and T_NNI|Y to have different topologies, all A ∩ Y, B ∩ Y, C ∩ Y, D ∩ Y must be nonempty, meaning that Y has to have at least one representative in each subset A, B, C, D. ■

FIG. 4. — An example when an NNI on T does not change the topology of *T|Y.* Solid lines correspond to two induced partition trees before (*T|Y*) and after (*T_NNI|Y*) the NNI was applied to edge e on T by swapping the subtrees below e₁ and e₃ (Fig. 1). In this case, Y does not have a representative in A (i.e., A ∩ Y = ∅); therefore, (A ∪ B)∩*Y|(C∪D)∩Y* =*B∩Y|(C∪D)∩Y* and *(A∪D)∩Y|(B∪C)∩Y = D∩Y|(B∪C)∩Y*. Since the splits *B∩Y|(C∪D)∩Y* and *D∩Y|(B∪C)∩Y* are shared by *T|Y* and *T_NNI|Y*, then *Σ(T|Y)* Δ *Σ(T_NNI|Y)* = ∅ and RF distance between *T|Y* and *T_NNI|Y* is 0.

In simple words, if some intersections of A, B, C, D with Y are empty, then a partition tree does not have a corresponding quartet structure for the NNI to be applied to and edge e loses its centrality or interior feature (see, e.g., Fig. 4). When this happens, the topology of the partition tree T|Y is not affected by the NNI applied to e on the species tree T.

We next specify the condition when an SPR changes the topology of partition tree.

Proposition 2. Let tree T be in the form shown in Figure 2, and a new tree T_SPR is obtained with SPR by pruning subtree below edge a and regrafting it onto b_n.

Then for a partition with a taxon set Y the following is true:

(i) the topologies of T|Y and T_SPR|Y are different, if Y has at least one representative in A and in at least another three subsets from B₁, B₂, …, B_n;

(ii) this SPR will correspond to an SPR on T|Y obtained by pruning the subtree below edge with a split A ∩ Y|(B₁ ∪ … ∪ B_n) ∩ Y and regrafting it onto edge with split Inline graphic , where .

Proof.

(i) The symmetric difference Inline graphic consists of the following splits

As a consequence for the induced partition trees T|Y and T_SPR|Y, the symmetric difference of Σ(T|Y) and Σ(T_SPR|Y) consists of

It is easy to see that if A ∩ Y = ∅, then all these splits would be shared by both partition trees, that is, Σ(T|Y) Δ Σ(T_SPR|Y) = ∅ and the RF distance between T|Y and T_SPR|Y would be 0. Therefore, Y must have at least one representative in A.

For T|Y and T_SPR|Y to have different topologies, an SPR on T should correspond to at least an NNI on T|Y. Hence, T|Y must have a corresponding quartet structure and together with A at least another three subsets from B₁, B₂, …, B_n should have at least one representative in Y. W.l.o.g. assume that together with A also B_m, B_h, B_k (1 ≤ m < h < k ≤ n) have at least one representative in Y while Inline graphic (see, e.g., Fig. 5). Then

FIG. 5. — An example when n-SPR on T is a 3-SPR (or NNI) on *T|Y*. There are two induced partition trees (solid lines): before (*T|Y*) and after (*T_SPR|Y*) an SPR was applied on T by pruning the subtree below edge a and regrafting it onto *b_n* (Fig. 2). The three dots denote all the subtrees between the corresponding pair of subtrees on the species trees T and *T_SPR*. Here, only *A, B_m, B_h,* and *B_k* have at least one representative in Y and ∀j∈{1, *… , n}\{m, h, k}*: *B_j* have no taxa in common with Y.

Thus, the RF distance between T|Y and T_SPR|Y is 2.

(ii) Let I = {i₁,., i_k} be the set of all indices, such that Inline graphic and let .

For edge a = A|B₁ ∪ … ∪ B_n its corresponding split on the partition tree T|Y is equal to

Similarly for Inline graphic its corresponding split on T|Y

and for Inline graphic its corresponding split on the partition tree T_SPR|Y

The above means that an edge on T|Y with split Inline graphic was divided by an edge with split in two edges (see also Fig. 5, where I = {i₁, i₂, i₃}). Therefore, regrafting onto edge b_n on T corresponds to regrafting onto edge with a split on partition tree T|Y. And since , then . ■

In other words, Proposition 2 states that an SPR on T changes the topology of T|Y if the structure of T from Figure 2 corresponds to at least a quartet structure on T|Y (e.g., Fig. 5). In this case, n-SPR on T is a 3-SPR (or NNI) on T|Y.

We now discuss TBR and the topological change of a partition tree as a consequence of TBR on species tree.

Proposition 3. Let tree T be in the form shown in Figure 3 and a new tree T_TBR is obtained by cutting edge e and reconnecting b_n and c_m with a new edge.

Then for a partition with a taxon set Y the following is true:

(i) the topologies of T|Y and T_TBR|Y are different if either of the following conditions is satisfied:

• Y has at least one representative in at least one subset from B₁, B₂, … , B_n and in at least another three subsets from C₁, C₂, … ,C_m
• Y has at least one representative in at least one subset from C₁, C₂, … , C_m and in at least another three subsets from B₁, B₂, … , B_n

(ii) this TBR will correspond to a TBR on T|Y obtained by cutting the edge with split (B₁ ∪ … ∪ B_n) ∩Y|(C₁ ∪ … ∪ C_m) ∩ Y and reconnecting edges with splits B_k ∩ Y|(∪_iε{1,.,n}\k B_i ∪ C₁ ∪ … ∪ C_m) ∩ Y and C_h ∩ Y|(∪_jε{1,.,m}\h C_j ∪ B₁ ∪ … ∪ B_n) ∩ Y, where k = max_1≤i≤n {i | B_i ∩ Y ≠ ∅} and h = max_1≤j≤m {j | C_j ∩ Y ≠ ∅}.

Proof.

(i) The symmetric difference Σ (T) Δ Σ (T_TBR) consists of the following splits:

As a consequence, the symmetric difference Inline graphic consists of

It is easy to see that if ∀i ε {1,.,n}: B_i ∩ Y = ∅, then all these splits would be shared by both partition trees; that is Σ(T|Y) Δ Σ(T_TBR|Y) = ∅ and the RF distance between T|Y and T_TBR|Y would be 0. Therefore, Y must have at least one representative in at least one from B₁, B₂, …, B_n. Similarly, Y must have at least one representative in at least one from C₁, C₂, … , C_m.

W.l.o.g. assume that B_k ∩ Y ≠ ∅ and C_h ∩ Y ≠ ∅, where 1 ≤ k ≤ n and 1 ≤ h ≤ m.

Partition trees T|Y and T_TBR|Y will have different topologies if a TBR on T corresponds to at least an NNI on T|Y. Hence, the partition tree T|Y must have a corresponding quartet structure and together with B_k and C_h at least other two subsets from the remaining B_i and C_j should have at least one representative in Y.

W.l.o.g. assume that together with B_k and C_h also C_p, C_q (1 ≤ p < q < h ≤ m) have at least one representative in Y (Fig. 6, right panel). Then it is easy to show that

and RF distance between T|Y and T_TBR|Y is 2.

Similarly, one can show that if together with B_k and C_h also B_p, B_q (1 ≤ p < q < k ≤ n) have at least one representative in Y, then RF distance between T|Y and T_TBR|Y is also 2.

In contrast, if Y has at least one representative in B_k, C_h and also in B_p, C_q (1 ≤ p < k ≤ n and 1 ≤ q < h ≤ m), then Inline graphic and RF distance is 0 (Fig. 6, left panel).

(ii) Let I = {i₁,.,i_k} be the set of all indices such that Inline graphic and let 1 ≤ i₁ < … < i_k ≤ n. Similarly, let J = {j₁,.,j_h} be the set of all indices such that and let 1 ≤ j₁ < … < j_h ≤ m. Then for edge

the corresponding split on T|Y is

For edge

its corresponding split on tree T|Y is

Similarly, for the corresponding edge on Inline graphic its split on T_TBR|Y is

For edges

and

their corresponding splits on T|Y and T_TBR|Y are

and

respectively. The above means that edges on T|Y with corresponding splits

and

were reconnected on Inline graphic . Since 1 ≤ i₁ < … < i_k ≤ n and 1 ≤ j₁ < … < j_h ≤ m, then i_k = max_1≤i≤n{i | B_i ∩ Y ≠ ∅} and j_h = max _1≤j≤m{j | C_j ∩ Y ≠ ∅}. ■

4. Partial Terraces

4.1. Definition of partial terraces

In this section we discuss partial terraces that generalize the terrace concept (Sanderson et al., 2011), which we call full terrace for clarity. When comparing the two trees in a partitioned framework, we compare the sets of their induced partition trees. If the sets are identical, then the two trees belong to one full terrace. Sanderson et al. (2011) showed that the number of trees on one full terrace can be quite large. Large full terraces pose a problem in phylogenetic inference, since they may abort tree search prematurely or even if an optimal tree has been found, this tree is by no means unique. To reduce this problem, it is possible to reduce the terrace size by, for example, choosing a different partition scheme (Sanderson et al., 2015) or by excluding some taxa from the analysis.

Now, if two species trees T₁ and T₂ share only a subset of identical induced partition trees, then we say that they belong to the same partial terrace. The log-likelihoods and parsimony scores of identical partition trees T₁|Y_i and T₂|Y_i are the same. Obviously, partial terraces occur more frequently than full terraces (see below). Large partial terraces can be still problematic for tree search algorithms. On the other hand, partial terraces provide the potential to reduce computation time.

4.2. Occurrence of partial terraces in real data

In this section we evaluate how often partial terraces occur in real alignments. By no means do we intend to make a full exploration of potential computing time that may be saved since the performance of the particular software will depend on the data structures and particular implementation used for the tree space exploration.

To elucidate the occurrence of partial terraces and full terraces, we analyzed seven recently published alignments (Table 1). Alignments have different numbers of taxa ranging from 69 to 404 taxa. The number of partitions (here, genes) varies from 11 to 79.

Table 1.

Alignments Used to Study the Occurrence of Partial Terraces During the Tree Search

Type and ID	No. of species	No. of genes	Missing data (%)	Source
DNA1	128	32	30	Stamatakis and Alachiotis (2010)
DNA2	237	74	72	Nyakatura and Bininda-Emonds (2012)
DNA3	372	79	66	Springer et al. (2012)
DNA4	404	11	60	Stamatakis and Alachiotis (2010)
AA1	69	31	35	De Queiroz et al. (1995)
AA2	70	35	34
AA3	72	51	35

Open in a new tab

For each alignment we performed a maximum likelihood tree search using IQ-TREE (Nguyen et al., 2015) under the edge-unlinked (EUL) partition model assuming a GTR+Γ (Lanave et al., 1984; Yang, 1994) model for all partitions. We collected all the intermediate trees encountered during the search. For each intermediate tree T, we explored all trees T_NNI in its NNI neighborhood. We examined partial terraces of each T_NNI and T by computing how many induced partition trees are shared between them.

Apart from intermediate trees collected during the tree search, we also analyzed NNI neighborhoods for 1000 random Yule–Harding (YH) trees (Harding, 1971) for each tested alignment.

We defined 12 bins based on the percentage of shared induced partition trees between T and T_NNI (Table 2) and counted how many T_NNI trees fall into each bin. Table 3 shows the mean percentage of T_NNI trees that fall into the corresponding bin for the intermediate trees. Figure 7 displays the boxplots for the first three alignments from Table 1 either for the IQ-TREE search trees (left column) or the random YH trees (right column) (see Supplementary Figs. S1–S4 for the remaining alignments; Supplementary Material is available online at www.liebertonline.com/cmb).

Table 2.

Partial Terrace Bins Based on the Percentage of the Shared Partition Trees Between T and T_NNI

Name	Percentage of shared partition trees out of the total number of partition trees
No partial terrace (PT)	= 0%, the topologies of all partition trees are pairwise different between T and T_NNI
PT1	(0%,10%]
PT2	(10%, 20%]
PT3	(20%, 30%]
…	…
PT9	(80%, 90%]
PT10	(90%, 100%)
Full terrace	= 100%, T and T_NNI belong to one terrace

Open in a new tab

Table 3.

Mean Percentage of Trees from NNI Neighborhood of Intermediate Trees Falling into Corresponding Partial Terrace Bin

	No PT (%)	PT1 (%)	PT2 (%)	PT3 (%)	PT4 (%)	PT5 (%)	PT6 (%)	PT7 (%)	PT8 (%)	PT9 (%)	PT10 (%)	Full terrace (%)
DNA1	7.14	12.97	4.85	1.80	3.55	32.59	37.11	0	0	0	0	0
DNA2	0	0	0.02	0.63	1.82	5.07	11.31	8.69	18.19	10.77	41.75	1.75
DNA3	0	2.75	5.38	9.22	10.06	6.32	4.34	1.33	0.23	6.38	50.36	3.63
DNA4	0.35	0.26	1.88	4.06	5.20	6.56	8.77	11.34	16.48	23.37	17.68	4.05
AA1	12.11	10.64	7.47	8.10	6.35	15.42	11.28	8.20	10.50	7.18	2.76	0
AA2	8.73	11.90	6.77	9.10	11.22	9.08	16.27	10.95	11.63	2.92	1.44	0
AA3	12.25	11.62	4.07	7.15	3.04	15.47	15.43	10.40	7.55	7.92	4.85	0.26

Open in a new tab

Intermediate and random trees have similar percentages of T_NNI trees across different bins (Fig. 7 and Supplementary Figs. S1–S4). This suggests that the general picture of partial terraces is mainly determined by the spread of missing data in the supermatrix and is less dependent on the actual tree topology. Moreover, increasing the number of taxa tends to decrease the variance of T_NNI percentage within each bin (for both intermediate and random trees).

Figure 8 integrates the information from Tables 2 and 3 and provides rough estimates of potential computational savings if accounting for partial and full terraces. The green bars reflect the average percentage of identical induced partition trees when T_NNI is compared to T. For example, for DNA1 there is no full terrace, but we observe partial terraces that may lead to a reduction of about 38% (the percentage of green bars) in computation time.

FIG. 8. — Visualization of NNI neighborhoods and potential computational savings. Each horizontal line reflects the NNI neighborhood for each test alignment, that is 100% of *T_NNI* trees. These neighborhoods are divided into partial terrace bins (Table 2) and depicted here by horizontal segments. The length of each segment corresponds to the mean percentage of *T_NNI* trees falling into the bin (Table 3). Each segment is composed of a green and a red bar, corresponding to the fractions of partition trees that are shared and not shared between T and *T_NNI*, respectively. Basically, green bars indicate potential computational savings when accounting for partial and full terraces during the trees search.

There is a full terrace for DNA2, but it consists of only 1.75% of the NNI neighborhood, whereas partial terraces constitute the remaining 98.25% and lead to a potential reduction of computations of about 80% (the percentage of green bars). In fact, since no T_NNI tree falls into “no PT” bin, we can save some computation time for all the trees encountered during tree search. Similar trend is observed for DNA3 and DNA4 with the predicted timesaving of 71% for each alignment.

5. Advantages of Using Induced Partition Trees in Maximum Likelihood Inference

In maximum likelihood inference, after applying a topological rearrangement on T, one needs to optimize the edge lengths of a new tree T_NEW. Therefore, together with the topological changes of partition trees, it is important to consider how topological rearrangement on T influences edge length optimization.

In the following we discuss two partition models commonly used in likelihood inferences, EUL and edge-linked (EL), and the advantages of using induced partition trees for either model (Yang, 1996).

We start by considering the most general partition model, EUL. Given a species tree T, we first obtain the corresponding induced partition trees. Under the EUL model, the edge lengths of the partition trees are optimized separately. The edge lengths of T are then computed from the corresponding edges lengths inferred on the partition trees, for example, as mean edge length.

Therefore, if the topological rearrangement on T does not change the topology of a partition tree T|Y, no edge length optimization is necessary and, as a result, the optimal partition tree likelihood remains unchanged after such a topological rearrangement on T. Let T_NEW be a tree obtained from T by some topological rearrangement.

Under EUL partition model there is no need to optimize the edge lengths of partitioned trees shared between T and T_NEW. As a result, the log-likelihood of the corresponding partition trees is the same.

In contrast to the EUL model, the edges between T and partition trees are linked in the EL model. That means that there is only one set of edge lengths for T and partition trees with the possibility of rescaling edge lengths of each partition tree by a partition-specific evolutionary rate. Therefore, the optimization of edge lengths is done on the species tree. Even if a topological rearrangement on T does not change the topology of partition tree, it still affects the optimal partition tree likelihood via optimization of edge lengths. This is also the reason why full terraces cannot occur under the EL model (Sanderson et al., 2015). Theoretically, one would need to optimize each edge on the species tree, which would definitely influence the partition tree edge lengths and also the likelihood. But in practice, to save computations, one only optimizes those edges in the vicinity of topological changes (Stamatakis et al., 2005; Guindon et al., 2010; Nguyen et al., 2015). For example, for an NNI, one reoptimizes only five edge lengths (e, e₁, e₂, e₃, e₄) around the swap. Under the EL model, such a particular feature of practical optimization can take an advantage when considering the induced partition trees.

Given a partition tree with taxon set Y and an edge e on T with the corresponding split A|B, if A ∩ Y = ∅ or B ∩ Y = ∅ then the optimization of e does not affect the likelihood of T|Y.

In this case, a split A|B does not have a corresponding split in Σ(T|Y), and therefore edge e is not linked to any edge on T|Y. This observation can be exploited to save computing time.

6. Discussion

We have shown that it is advantageous to identify and account for full and partial terraces during the tree search in phylogenomics. One main advantage is the saving of computation time. If two trees belong to the same full or partial terrace, then one needs to compute the objective function for the identical partition trees only once. The values of objective function will be the same for these partition trees. The larger the number of identical partition trees between species trees, the more computation time can be saved.

From the conditions discussed in the previous sections, the topological rearrangement that benefits the most from partial terraces is obviously NNI. It is intuitive that NNI applied to the species tree will not change the topology of partition trees more often than SPR or TBR. However, in tree searches one typically applies short SPR (e.g., RAxML); that is, the number of edges between the pruning and the regrafting edges are much smaller than the number of taxa. The same is true for TBR. And since one also expects short SPR and short TBR to result in no change of partition trees quite often for sparse supermatrices, partial terraces are also beneficial for these rearrangements.

Moreover, the use of induced partition trees has another advantage that long SPR or TBR on a species tree T, as a result of missing data, might correspond to a much shorter SPR or TBR on T|Y. This leads to computation saving even if SPR or TBR changes the topology of the induced partition trees.

Here, we elucidated the frequent existence of partial terraces in practice via NNI neighborhoods, showing that partial terraces are not only a theoretical concept, but also have practical implications in phylogenomics. The predicted timesaving for the examined real alignments is only the rough estimate, since we treated the alignment lengths per partition as equal. If the length of alignment corresponding to the shared partition trees is relatively large compared to the whole supermatrix, then one expects even more speed up.

Another important factor for timesaving is the actual implementation of search strategies in the particular software. We plan to implement efficient techniques to take full advantage of partial and full terraces in IQ-TREE. A more thorough analysis of such techniques will be presented elsewhere.

Supplementary Material

Supplemental data

Supp_Figs1-4.pdf^{(194.3KB, pdf)}

Acknowledgments

This work was supported by the Austrian Science Fund (FWF, grant number I-1824-B22) to O.C.

Author Disclosure Statement

No competing financial interests exist.

References

De Queiroz A., Donoghue M.J., and Kim J. 1995. Separate versus combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst. 26, 657–681 [Google Scholar]
De Queiroz A., and Gatesy J. 2007. The supermatrix approach to systematics. Trends Ecol. Evol. 22, 34–41 [DOI] [PubMed] [Google Scholar]
Felsenstein J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA [Google Scholar]
Guindon S., Dufayard J.F., Lefort V., et al. . 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 [DOI] [PubMed] [Google Scholar]
Harding E.F. 1971. The probabilities of rooted tree shapes generated by random bifurcation. Adv. Appl. Probability. 3, 44–77 [Google Scholar]
Hedtke S.M., Patiny S., and Danforth B.N. 2013. The bee tree of life: A supermatrix approach to apoid phylogeny and biogeography. BMC Evol. Biol. 13, 138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hordijk W., and Gascuel O. 2005. Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood. Bioinformatics. 21, 4338–4347 [DOI] [PubMed] [Google Scholar]
Lanave C., Preparata G., Saccone C., et al. . 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20, 86–93 [DOI] [PubMed] [Google Scholar]
Nguyen L.T., Schmidt H.A., von Haeseler A., et al. . 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nyakatura K., and Bininda-Emonds O.R.P. 2012. Updating the evolutionary history of Carnivora (Mammalia): A new species-level supertree complete with divergence time estimates. BMC Biol. 10, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peters R.S., Meyer B., Krogmann L., et al. . 2011. The taming of an impossible child: A standardized all-in approach to the phylogeny of Hymenoptera using public database sequences. BMC Biol. 9, 55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pyron R.A., Burbrink F.T., Colli G.R., et al. . 2011. The phylogeny of advanced snakes (Colubroidea), with discovery of a new subfamily and comparison of support methods for likelihood trees. Mol. Phylogenet. Evol. 58, 329–342 [DOI] [PubMed] [Google Scholar]
Pyron R.A., and Wiens J.J. 2011. A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians. Mol. Phylogenet. Evol. 61, 543–583 [DOI] [PubMed] [Google Scholar]
Robinson D.F., and Foulds L.R. 1981. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 [Google Scholar]
Sanderson M.J., McMahon M.M., Stamatakis A., et al. . 2015. Impacts of terraces on phylogenetic inference. Syst. Biol. 64, 709–726 [DOI] [PubMed] [Google Scholar]
Sanderson M.J., McMahon M.M., and Steel M. 2011. Terraces in phylogenetic tree space. Science. 333, 448–450 [DOI] [PubMed] [Google Scholar]
Sanderson M.J., Purvis A., and Henze C. 1998. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol. Evol. 13, 105–109 [DOI] [PubMed] [Google Scholar]
Semple C., and Steel M.A. 2003. Phylogenetics. Oxford University Press, New York, NY [Google Scholar]
Springer M.S., Meredith R.W., Gatesy J., et al. . 2012. Macroevolutionary dynamics and historical biogeography of primate diversification inferred from a species supermatrix. PLoS ONE 7, e49521. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stamatakis A., and Alachiotis N. 2010. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data. Bioinformatics. 26, i132–i139 [DOI] [PMC free article] [PubMed] [Google Scholar]
Stamatakis A., Ludwig T., and Meier H. 2005. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463 [DOI] [PubMed] [Google Scholar]
van der Linde K., Houle D., Spicer G.S., et al. . 2010. A supermatrix-based molecular phylogeny of the family Drosophilidae. Genet. Res. 92, 25–38 [DOI] [PubMed] [Google Scholar]
Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39, 306–314 [DOI] [PubMed] [Google Scholar]
Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data

Supp_Figs1-4.pdf^{(194.3KB, pdf)}

[B1] De Queiroz A., Donoghue M.J., and Kim J. 1995. Separate versus combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst. 26, 657–681 [Google Scholar]

[B2] De Queiroz A., and Gatesy J. 2007. The supermatrix approach to systematics. Trends Ecol. Evol. 22, 34–41 [DOI] [PubMed] [Google Scholar]

[B3] Felsenstein J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA [Google Scholar]

[B4] Guindon S., Dufayard J.F., Lefort V., et al. . 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 [DOI] [PubMed] [Google Scholar]

[B5] Harding E.F. 1971. The probabilities of rooted tree shapes generated by random bifurcation. Adv. Appl. Probability. 3, 44–77 [Google Scholar]

[B6] Hedtke S.M., Patiny S., and Danforth B.N. 2013. The bee tree of life: A supermatrix approach to apoid phylogeny and biogeography. BMC Evol. Biol. 13, 138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Hordijk W., and Gascuel O. 2005. Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood. Bioinformatics. 21, 4338–4347 [DOI] [PubMed] [Google Scholar]

[B8] Lanave C., Preparata G., Saccone C., et al. . 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20, 86–93 [DOI] [PubMed] [Google Scholar]

[B9] Nguyen L.T., Schmidt H.A., von Haeseler A., et al. . 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Nyakatura K., and Bininda-Emonds O.R.P. 2012. Updating the evolutionary history of Carnivora (Mammalia): A new species-level supertree complete with divergence time estimates. BMC Biol. 10, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Peters R.S., Meyer B., Krogmann L., et al. . 2011. The taming of an impossible child: A standardized all-in approach to the phylogeny of Hymenoptera using public database sequences. BMC Biol. 9, 55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Pyron R.A., Burbrink F.T., Colli G.R., et al. . 2011. The phylogeny of advanced snakes (Colubroidea), with discovery of a new subfamily and comparison of support methods for likelihood trees. Mol. Phylogenet. Evol. 58, 329–342 [DOI] [PubMed] [Google Scholar]

[B13] Pyron R.A., and Wiens J.J. 2011. A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians. Mol. Phylogenet. Evol. 61, 543–583 [DOI] [PubMed] [Google Scholar]

[B14] Robinson D.F., and Foulds L.R. 1981. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 [Google Scholar]

[B15] Sanderson M.J., McMahon M.M., Stamatakis A., et al. . 2015. Impacts of terraces on phylogenetic inference. Syst. Biol. 64, 709–726 [DOI] [PubMed] [Google Scholar]

[B16] Sanderson M.J., McMahon M.M., and Steel M. 2011. Terraces in phylogenetic tree space. Science. 333, 448–450 [DOI] [PubMed] [Google Scholar]

[B17] Sanderson M.J., Purvis A., and Henze C. 1998. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol. Evol. 13, 105–109 [DOI] [PubMed] [Google Scholar]

[B18] Semple C., and Steel M.A. 2003. Phylogenetics. Oxford University Press, New York, NY [Google Scholar]

[B19] Springer M.S., Meredith R.W., Gatesy J., et al. . 2012. Macroevolutionary dynamics and historical biogeography of primate diversification inferred from a species supermatrix. PLoS ONE 7, e49521. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Stamatakis A., and Alachiotis N. 2010. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data. Bioinformatics. 26, i132–i139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Stamatakis A., Ludwig T., and Meier H. 2005. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463 [DOI] [PubMed] [Google Scholar]

[B22] van der Linde K., Houle D., Spicer G.S., et al. . 2010. A supermatrix-based molecular phylogeny of the family Drosophilidae. Genet. Res. 92, 25–38 [DOI] [PubMed] [Google Scholar]

[B23] Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39, 306–314 [DOI] [PubMed] [Google Scholar]

[B24] Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 [DOI] [PubMed] [Google Scholar]

PERMALINK

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Olga Chernomor

Bui Quang Minh

Arndt von Haeseler

Abstract

1. Introduction

2. Background

2.1. Basic definitions and notations

2.2. Topological rearrangement operations

FIG. 1.

FIG. 2.

FIG. 3.

3. Consequences of Topological Rearrangements Applied to a Species Tree

FIG. 4.

FIG. 5.

FIG. 6.

4. Partial Terraces

4.1. Definition of partial terraces

4.2. Occurrence of partial terraces in real data

Table 1.

Table 2.

Table 3.

FIG. 7.

FIG. 8.

5. Advantages of Using Induced Partition Trees in Maximum Likelihood Inference

6. Discussion

Supplementary Material

Acknowledgments

Author Disclosure Statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Olga Chernomor

Bui Quang Minh

Arndt von Haeseler

Abstract

1. Introduction

2. Background

2.1. Basic definitions and notations

2.2. Topological rearrangement operations

FIG. 1.

FIG. 2.

FIG. 3.

3. Consequences of Topological Rearrangements Applied to a Species Tree

FIG. 4.

FIG. 5.

FIG. 6.

4. Partial Terraces

4.1. Definition of partial terraces

4.2. Occurrence of partial terraces in real data

Table 1.

Table 2.

Table 3.

FIG. 7.

FIG. 8.

5. Advantages of Using Induced Partition Trees in Maximum Likelihood Inference

6. Discussion

Supplementary Material

Acknowledgments

Author Disclosure Statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases