Skip to main content
Springer logoLink to Springer
. 2021 Jan 25;82(1):8. doi: 10.1007/s00285-021-01567-5

Computing nearest neighbour interchange distances between ranked phylogenetic trees

Lena Collienne 1, Alex Gavryushkin 1,
PMCID: PMC7835203  PMID: 33492606

Abstract

Many popular algorithms for searching the space of leaf-labelled (phylogenetic) trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is NP-hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although ranked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to NP-hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with tens of thousands of leaves (and likely hundreds of thousands if implemented efficiently).

Mathematics Subject Classification: 68Q25, 92B05


One of the major problems in computational biology is the reconstruction of evolutionary histories, also known as phylogenetic trees, from sequence data such as RNA, DNA, or protein sequences. Of particular interest in various applications is the order of internal nodes in these trees, as these nodes represent evolutionary events and their ranking models the order in which these events happened in time. For example in species evolution, where internal nodes of trees correspond to speciation events, the ranking of these nodes represents the order of divergence events in time. Fossils can be used to rank and time divergence events in phylogenetic trees (Gavryushkina et al. 2014). Other research fields where ranked trees play an important role are viral epidemiology, where ranking gives the order of transmission events (Ypma et al. 2013), and language evolution (Bouckaert et al. 2018; Gray et al. 2009), where phylogenetic trees reveal how and when human populations expanded across different continents. Recently, phylogenetic trees have become a popular tool to study cancer evolution (Singer et al. 2018; Alves et al. 2019). In cancer phylogenies internal nodes can refer to emergence of metastatic clones and their ranking shows in which order metastases had been seeded in time (Lote et al. 2017).

Most commonly trees are inferred from sequences via maximum likelihood (Stamatakis 2006; Guindon et al. 2010), MCMC (Ronquist and Huelsenbeck 2003; Suchard et al. 2018; Bouckaert et al. 2019), distance-, or parsimony-based approaches (Tamura et al. 2011). A similarity measure between trees is required for the development of algorithms implementing these methods and evaluating the accuracy of reconstructed trees. Furthermore, summary or consensus tree methods (McMorris and Steel 1994; Bansal et al. 2010; Whidden et al. 2014) often rely on a tree metric. Most of the currently used distance measures for trees, however, do not take the order of divergence events into account—only the tree topology. Moreover, popular tree distances are either hard to compute or lack biological interpretability (Whidden and Matsen 2018).

Most tree inference methods rely on various tree rearrangement operations (Semple and Steel 2003), the most popular of which are nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection and reconnection (TBR). Under any such operation, the tree inference problem can be formulated as a graph search, where vertices are trees and edges are given by tree rearrangement operations. For search algorithms to be efficient, it is important to understand the geometry of these graphs. For example, basic geometric properties of the NNI graph have been successfully leveraged to speed up the maximum likelihood method (Nguyen et al. 2015). The most basic geometric characteristic that frequently arises in applications is the minimum number of rearrangements necessary to transform one tree into another (Semple and Steel 2003). The problem then amounts to computing the length of a shortest path between trees in the NNI, SPR, or TBR graph. This can also be seen as computing the distance between trees in the corresponding metric space.

Classical results in mathematical phylogenetics imply that these distances are NP-hard to compute for all three rearrangement operations NNI, SPR, and TBR (DasGupta et al. 2000; Bordewich and Semple 2005; Hickey et al. 2008; Allen and Steel 2001). Intuitively, the difference between them is how much change can be done to a tree by a single operation, with NNI being the most local type of rearrangement and TBR the most global one. Remarkably, it took over 25 years and a number of published erroneous attempts, as discussed in detail by DasGupta et al. (2000), to prove that computing distances is NP-hard in NNI (DasGupta et al. 2000). Similarly, incorrect proofs for SPR have been discussed in the literature (Hein et al. 1996; Allen and Steel 2001), before Bordewich and Semple (2005) proved the NP-hardness result for rooted trees and Hickey et al. (2008) utilised this proof to establish the result for unrooted trees. To facilitate practical applications, fixed parameter tractable algorithms (Downey and Fellows 2013) for computing the SPR distance have been developed over the years (Whidden et al 2010; Bordewich and Semple 2005; Whidden and Matsen 2018). Computing the NNI distance is also known to be fixed parameter tractable (DasGupta et al. 1999). Although important, these algorithms remain impractical for large distances and are only applied to trees with a moderate number of leaves or those with small distances (Whidden and Matsen 2018).

Another popular tree distance measure that does not rely on a tree rearrangement method is the Robinson–Foulds distance (Robinson and Foulds 1981). In contrast to the tree rearrangement-based distances mentioned above, this distance can be computed efficiently. A downside of this approach however is a lack of biological interpretability. The Robinson–Foulds distance is not motivated by a biological process, unlike for example SPR, where the tree rearrangement operation can be used to model hybridisation and other horizontal events. This pattern is quite common—tree distance measures that are easy to compute lack biological interpretability, while those that are biologically meaningful are often hard to compute (Whidden and Matsen 2018).

In this paper, we consider a generalisation of the NNI operation to ranked trees introduced by Gavryushkin et al. (2018), which is called RNNI (for Ranked Nearest Neighbour Interchange). We show that the shortest path problem in RNNI is computable in O(n2), where n is the number of tree leaves. This makes RNNI the first tree rearrangement operation under which shortest paths and distances between trees are polynomial-time computable. Our proof of this result (Theorem 1) is constructive—we provide an algorithm called FINDPATH that computes shortest paths in the RNNI graph in O(n2) time. Our algorithm is optimal as shortest paths often have length quadratic in the number of leaves n. The algorithm is practical as it takes seconds on a laptop to compute the distance between trees with thousands of leaves, while in the closely related NNI graph the tractable number of leaves is well below twenty (Li et al. 1996; Whidden and Matsen 2017). Furthermore, FINDPATH reveals the following property of the RNNI graph, which is desirable for tree distances from a biological point of view. If two trees share some information, more specifically a cluster, there is a shortest paths in RNNI that preserves this information (the cluster). In other words, shortest paths in RNNI maintain clusters, an important property that is not true in NNI (Li et al. 1996). This implies in particular that trees that share an evolutionary hypothesis in form of a common subtree are closer to each other than they are to a tree not sharing a subtree with them. For a cancer phylogeny this can be interpreted as two trees supporting the emergence of one particular metastatic clone are closer to each other than a phylogeny that does not support this hypothesis.

Because NNI can be seen as a special case of RNNI, we investigate whether there exists a threshold at which the complexity of the shortest path problem shifts from NP-hard to polynomial. Specifically, we introduce an edge weight parameter ρ in the RNNI graph and consider a parametrised graph RNNI(ρ). More precisely, the RNNI operations that change the ranking, but not the tree topology, weigh ρ, while moves that change the topology weigh one. We show that the shortest path problem is NP-hard in RNNI(0) and quadratic in RNNI(1), so the complexity changes with ρ. We hence propose to characterise the complexity classes of the problem RNNI(ρ) for values of ρ0.

The biological interpretation of this characterisation problem is as follows. In many large-scale applications two or more different methods are used to reconstruct an evolutionary process—one to model and reconstruct the branching process and another one to time or rank the evolutionary events (Lote et al. 2017). Often this results in different support probabilities for the inferred tree topologies and for the ranking of events. A comparison method for trees inferred this way has to have different penalties for conflicts in the tree structure and the ranking. This difference can be quantitatively modelled using our ρ parameter. For example, if the tree topology estimate is more certain than the ranking, ρ should be chosen to be less than one. An efficient algorithm to compare trees for such values of ρ is hence desirable.

Definitions and background results

Unless stated otherwise, by a tree in this paper we mean a ranked phylogenetic tree, which is a binary tree where leaves are uniquely labelled by elements of the set {a1,,an} for a fixed integer n, and all internal (non-leaf) nodes are uniquely ranked by elements of the set {1,,n-1} so that each child has a strictly smaller rank than its parent. All leaves are assumed to have rank 0 but we only refer to the ranks of internal nodes throughout. In total there are (n-1)!n!2n-1 such trees on n leaves (Gavryushkin et al. 2018). Two trees are considered to be identical if there exists an isomorphism between them which preserves edges, leaf labels, and node rankings. For example, trees in Fig. 1 are all different.

Fig. 1.

Fig. 1

Trees in the RNNI graph with three NNI moves on the left and a rank move on the right

Because internal nodes of a tree T are ranked uniquely, we can address the node of rank t{1,,n-1}, and we write (T)t to denote this node. An interval [(T)t,(T)t+1] is defined by two nodes of consecutive ranks. A cluster C{a1,,an} in a tree T is a subset of leaves that contains all leaves descending from one internal node of T. We then say that this internal node induces the cluster C, and that the subtree rooted at this node is induced by C. Trees can uniquely be specified using the cluster representation, that is a list of all clusters induced by internal nodes of that tree ordered according to the ranks of internal nodes. For example, the cluster representation of tree T in Fig. 1 is [{a1,a2},{a1,a2,a3},{a4,a5},{a1,a2,a3,a4,a5}]. For a set S{a1,,an} and tree T we denote the most recent common ancestor of S in T, that is the node of the lowest rank in T that induces a cluster containing all elements of S, by (S)T. Note that (C)T=(T)t if the cluster C is induced by the node of rank t in T.

Our main object of study is the following class of graphs RNNI(ρ) indexed by a real-valued parameter ρ0. Vertices of the RNNI(ρ) graph are trees as defined above. Two trees are connected by an edge (also called an RNNI move) if one results from the other by performing one of the following two types of tree rearrangement operation (see Fig. 1):

  • (i)

    A rank move on a tree T exchanges the ranks of two internal nodes (T)t and (T)t+1 with consecutive ranks, provided the two nodes are not connected by an edge in T.

  • (ii)

    Trees T and R are connected by an NNI move if there are edges e in T and f in R both connecting nodes of consecutive ranks in the corresponding trees, such that the (non-binary) trees obtained by shrinking e and f into internal nodes are identical.

The parameter ρ0 is the weight of the rank move operation, an NNI move weighs 1.

The weight of a path in RNNI(ρ) is the sum of the weights of all moves along the path. The distance between two trees in RNNI(ρ) is the weight of a path with the minimal weight, which we will call a shortest path. When ρ=1 we assume that the graph is unweighted.

We consider the following class of problems parametrised by a real number ρ0. graphic file with name 285_2021_1567_Figa_HTML.jpg

Since RNNI(ρ) is a connected graph, there always exists a solution to RNNI(ρ)-SP. Furthermore, the size of every solution to an instance of RNNI(ρ)-SP is bounded by a polynomial in n, despite the search space being super-exponential. This is because the diameter of the RNNI(1) graph is bounded from above (Gavryushkin et al. 2018) by n2-3n-5/8.

Our main goal is to prove that RNNI(1)-SP can be solved in polynomial time. We will see later in the paper that it follows from a classical result (DasGupta et al. 2000) that RNNI(0)-SP is NP-hard. To be consistent with notations used in the literature (Gavryushkin et al. 2018), we will denote the graph RNNI(1) by RNNI.

FINDPATH algorithm

In this section we introduce an algorithm called FINDPATH that computes paths between trees and is quadratic in the number of leaves.

An input of the FINDPATH algorithm is two trees T and R in their cluster representation. We denote the representation of R by [C1,,Cn-1]. The algorithm considers the clusters C1,,Cn-2 iteratively in their order and produces a sequence p of trees which becomes a shortest path from T to R after the algorithm terminates. During each iteration k=1,,n-2 new trees are added to p if necessary, and we will refer to the last added tree as T1. In iteration k, the rank of (Ck)T1 is decreased by RNNI moves until Ck is induced by the node of rank k in T1. In Proposition 1 we show that FINDPATH is a deterministic algorithm with running time quadratic in the number of leaves n. In particular, there always exists a unique move that decreases the rank of (Ck)T1 as described above.

Note that if two trees share a cluster, every tree on the path computed by FINDPATH contains this cluster as well. An implementation of this algorithm is available on GitHub (Collienne et al. 2019). Note that the version of FINDPATH implemented in (Collienne et al. 2019) outputs a shortest path as a list of trees. The algorithm that outputs the length of a shortest path can be implemented so that the wall clock running time on a generic laptop is under 30 s for trees with tens of thousands of leaves.graphic file with name 285_2021_1567_Figb_HTML.jpg

Proposition 1

FINDPATH is a correct deterministic algorithm that runs in O(n2) time.

Proof

To show that FINDPATH is a deterministic algorithm (see the pseudocode above), we have to prove that tree T2 constructed in the while loop (line 3) of the algorithm always exists and is uniquely defined. If T2 is obtained in line 7 from T1 by a rank move, the tree exists and is unique because there always exists exactly one rank move on any particular interval that is not an edge. It remains to show that an NNI move that decreases the rank of (Ck)T1 always exists and is unique. To prove this we consider cases k=1 and k>1 separately.

Case k=1.

In this case Ck consists of two leaves {x,y}. Since we assumed that the while condition is satisfied, the node v=({x,y})T1 has rank r>1. Consider the node u with rank r-1 in T1. Assume without loss of generality that x is in the cluster induced by u, so y has to be outside this cluster. Consider the following three disjoint subtrees of T1: the subtree T11 induced by a child of u and containing x, the subtree T12 induced by the other child of u, the subtree T13 induced by a child of v and containing y. Now observe that out of two NNI moves possible on the edge [uv] in T1, only the one that swaps T12 and T13 does decrease the rank of the most recent common ancestor of {x,y}. Hence T2 exists and is unique in this case.

Case k>1.

In this case Ck=CiCj for i,j<k. In this case the subtree of T1 induced by (Ci)T1 is identical to the subtree of R induced by (Ci)R, and the same is true for (Cj)T1 and (Cj)R. Hence, we can reduce this case to k=1 by suppressing Ci and Cj in both T1 and R to new leaves ci and cj (of rank zero) respectively. As in Case k=1, exactly one of two possible NNI moves deceases the rank of the most recent common ancestor of ci, cj in T1, so the same is true for the most recent common ancestor (Ck)T1, and T2 is unambiguously defined.

Thus, FINDPATH is a deterministic algorithm.

To prove correctness, note that the algorithm starts by adding T to the output path, and every new tree added to the output path is an RNNI neighbour of the previously added one (see line 5 and 7). To see that the output path terminates in R, observe that after k iteration of the for loop (line 2) of the algorithm, the first k clusters of T1 and R must coincide, and so after n-2 iterations a path between T and R is constructed.

The worst-case time complexity of FINDPATH is quadratic in the number of leaves, as there can be at most n-2 executions of the for loop (line 2) and in every iteration of the for loop at most n-2 while loops (line 3) are executed. Here and throughout the paper we assume that the output of FINDPATH is encoded by a list of RNNI moves rather than an actual list of trees. This is because writing out a tree on n leaves takes time linear in n and the complexity of FINDPATH becomes cubic.

FINDPATH computes shortest paths in optimal time

In this section we prove the main result of this paper, that RNNI(1)-SP is polynomial. Specifically we prove that paths returned by FINDPATH are always shortest. We also show that FINDPATH is an optimal algorithm, that is, no sub-quadratic algorithm can solve RNNI(1)-SP.

The main ingredient of our proof is to show that a local property (see (1) in the proof) of the FINDPATH algorithm is enough to establish that the output paths are shortest. The property can intuitively be understood as FINDPATH always choosing the best tree possible to go to. Importantly, this result can be used for an arbitrary vertex proposal algorithm in an arbitrary graph to establish that the algorithm always follows a shortest path between vertices in the graph, hence our proof technique is of general interest.

Theorem 1

The worst-case time complexity of the shortest path problem in the RNNI graph on trees with n leaves is O(n2). Hence RNNI(1)-SP is polynomial time solvable.

Proof

We prove this theorem by showing that for every pair of trees T and R, the path computed by the FINDPATH algorithm is a shortest RNNI path. We denote this path by FP(T,R) and its length by |FP(T,R)|. By d(TR) we denote the length of a shortest path between T and R, that is, the RNNI distance between trees. We hence want to show that |FP(T,R)|=d(T,R) for all trees.

Assume to the contrary that T and R are two trees with a minimum distance d(TR) such that d(T,R)|FP(T,R)|, that is, d(T,R)<|FP(T,R)|. Let T be the first tree on a shortest RNNI path from T to R. Then d(T,R)=d(T,R)-1, implying that the distance between T and R is strictly smaller than that between T and R. This implies that |FP(T,R)|=d(T,R)=d(T,R)-1<|FP(T,R)|-1 and hence, |FP(T,R)|<|FP(T,R)|-1. We finish the proof by showing that no trees satisfy this inequality.

Specifically, we will show that

for all treesT,R,andTsuch thatTis oneRNNImove away fromT,|FP(T,R)||FP(T,R)|-1 1

We will use Fig. 2 to demonstrate our argument.

Fig. 2.

Fig. 2

Trees T, T, and R as in inequality (1). Paths FP(T,R)=[T,T1,T2,,R] and FP(T,R)=[T,T1,T2,,R] are indicated by arrows

Assume to the contrary that T and R are trees for which there exists T violating inequality (1). Out of all such pairs TR choose one with the minimal |FP(T,R)|. Denote FP(T,R)=[T,T1,T2,,R] and FP(T,R)=[T,T1,T2,,R], and let [(T)t,(T)t+1] be the interval in T on which the RNNI move connecting T and T is performed. Let Ck be the cluster of R such that the node (Ck)T is moved down by the first move on FP(T,R). If the rank of (Ck)T is not in {t,t+1} then (Ck)T and (Ck)T induce the same cluster, so FINDPATH would make the same rearrangement in both trees T and T in the first move along FP(T,R) and FP(T,R) resulting in trees T1 and T1 which are RNNI neighbours, as in Fig. 2. In this case, paths FP(T1,R) and FP(T1,R) violate inequality (1) but FP(T1,R) is strictly shorter than FP(T,R), contradicting our minimality assumption. Hence, the first move on FP(T,R) has to involve an interval incident to at least one of the nodes (T)t, (T)t+1.

Moreover, because Ck is the first cluster satisfying the while condition of FINDPATH applied to T and R, all clusters Cj with j<k have to be present in T. And since the first move on FP(T,R), which decreases the rank of (Ck)T, involves nodes with ranks not higher than t+2, the most recent common ancestor of Ck has rank not higher than t+1 after this move. Hence kt+1. Furthermore, clusters Cj for all jk-2 have to be present in T as well as T, because all clusters induced by nodes of rank t-1 or lower coincide in these two trees. Cluster Ck-1, however, might not be induced by a node in T if k-1=t. Therefore, the first move on FP(T,R) can decrease the rank of the most recent common ancestor of either Ck-1 or Ck.

We will distinguish two cases depending on whether T and T are connected by an NNI or a rank move. For each of these we will further distinguish all possible moves between T and T1. Note that in all figures illustrating possible moves on FP(T,R) and FP(T,R) below, the position of the tree root is irrelevant, so we have positioned roots to simplify our figures.

Case 1.

T and T are connected by an NNI move. So [(T)t,(T)t+1] is an edge in T—see Fig. 3. Denote the clusters induced by the children of (T)t by A and B and the cluster induced by the child of (T)t+1 that is not (T)t by C, and assume that the NNI move between T and T exchanges the subtrees induced by clusters B and C. Additionally, if (T)t+2 is the parent of (T)t+1 (Cases 1.2 and 1.3), we denote the cluster induced by the child of (T)t+2 that is not (T)t+1 by D—see Fig. 3.

We now consider all possible moves FINDPATH can perform to go from T to T1 that involve a node of rank t or t+1, that is, we will consider three intervals in total.

  • 1.1

    RNNI move (either type) on interval [(T)t,(T)t+1]. This move has to be the NNI move that is different from the NNI move connecting T and T. In this case, the cluster BC is built in T1, as depicted in the bottom of Fig. 3. Hence the first cluster Ck that satisfies the while condition of FINDPATH must contain elements from both B and C but not from A, and the rank of (Ck)R has to be at most t. But then FINDPATH applied to T and R has to decrease the rank of (Ck)T in its first step implying that T1=T1, so |FP(T,R)|=|FP(T,R)|. This contradicts our assumption that |FP(T,R)|<|FP(T,R)|-1.

  • 1.2
    NNI move on (edge) interval [(T)t+1,(T)t+2] that swaps the subtrees induced by clusters C and D. This move is shown in Fig. 4a by an arrow from T to the leftmost tree in the middle row. In this case, the first cluster Ck that satisfies the while condition of FINDPATH computing FP(T,R) must intersect D but not C. Additionally, Ck must intersect A, or B, or both of them. Hence, we will consider each of these three cases individually, and demonstrate them in Fig. 4.
    • 1.2.1
      Ck intersects A, B, and D but not C. In this case, since we assumed [(T1)t,(T1)t+1] to be an edge in the tree, no move on T1 can decrease the rank of (Ck)T1. It follows from the proof of Proposition 1 that this can happen only when the subtrees induced by (Ck)T1 and (Ck)R in the corresponding trees coincide. That is, the while condition of FINDPATH must be false after this first move for all jk. This implies that t=k-1 and Ck-1=AB. But since the rank of (Ck-1)T is t+1>k-1, Ck-1 has to be the first cluster for which the while condition of FINDPATH applied to T and R is met. Hence the first move on FP(T,R) must decrease the rank of (Ck-1)T by building the cluster AB, in which case T1=T. This however contradicts |FP(T,R)|<|FP(T,R)|-1.
    • 1.2.2
      Ck intersects A and D but not B or C. Starting from T, FINDPATH exchanges first subtrees induced by clusters C and D and then by B and D. This results in trees T1 and T2—see the path leading to the tree in the middle of the bottom row in Fig. 4a. This implies that the rank of (Ck-1)R is lower than t, so the first cluster that satisfies the while condition of FINDPATH applied to T and R is Ck. Hence, starting from T, FINDPATH exchanges first subtrees induced by B and D and then by C and D. This results in trees T1 and T2—see the path leading to the tree in the middle of the bottom row in Fig. 4b. It follows that T2 and T2 are connected by an RNNI move on the interval [(T2)t+1,(T2)t+2] (indicated by dotted edges in the corresponding trees in Fig. 4). This together with the facts that |FP(T2,R)|=|FP(T,R)|-2 and |FP(T2,R)|=|FP(T,R)|-2 contradicts the assumption that FP(T,R) is of minimal length violating inequality (1).
    • 1.2.3
      Ck intersects B and D but not A or C. This case is analogous to the previous one. The two initial segments of FP(T,R) and FP(T,R) are the paths leading to the leftmost trees in the bottom row of Fig. 4a and b, respectively. Note that the rank swap leading from T1 to T2 is required because the rank of (Ck)R is at most t as implied by the move leading from T1 to T2. The corresponding trees T2 and T2 are again RNNI neighbours.
  • 1.3
    NNI move on (edge) interval [(T)t+1,(T)t+2] that builds a cluster CD in T1. This move is shown in Fig. 4a by an arrow from T to the second leftmost tree in the middle row. In this case, Ck intersects C and D but not A or B. And we have the following two possibilities to consider.
    • 1.3.1
      The ranks of (Ck)T1 and (Ck)R coincide. In this case, the previous cluster Ck-1 of R has to be AB. Since AB is not a cluster in T, the first RNNI move on FP(T,R) builds the cluster AB by swapping subtrees induced by cluster B and C. This move results in T1=T contradicting |FP(T,R)|<|FP(T,R)|-1.
    • 1.3.2
      The rank of (Ck)T1 is strictly higher than that of (Ck)R. In this case, FINDPATH decreases the rank of (Ck)T1 in the second step. This results in the path from T to the rightmost tree in Fig. 4a. Hence, FP(T,R) also has to begin with two moves that decrease the rank of (Ck)T twice, resulting in the rightmost path in Fig. 4b. Similarly to case 1.2.2, we arrive at a contradiction that trees T2, T2, and R violate inequality (1) and |FP(T2,R)|<|FP(T,R)|.
  • 1.4

    Rank move on interval [(T)t+1,(T)t+2]. This case is analogous to case 1.3 (see Fig. 5). If the ranks of (Ck)T1 and (Ck)R coincide then Ck-1=AB, and applying FINDPATH to T,R we get T1=T. If the rank of (Ck)T1 is strictly higher than that of (Ck)R then FINDPATH decreases the rank of (Ck)T1 in the second step. Recall that the interval between nodes of rank t and t+1 is an edge in both T and T. Hence, the first two moves on FP(T,R) decrease the rank of (Ck)T twice resulting in T2 which is an RNNI neighbour of T2 as depicted in Fig. 5. As before, this contradicts our minimality assumption.

  • 1.5

    RNNI move (either type) on interval [(T)t-1,(T)t]. In this case CkAB and the rank of (Ck)R is at most t-1. This implies that Ck is the first cluster to satisfy the while condition for T and the first move on FP(T,R) decreases the rank of (Ck)T by exchanging the subtrees induced by B and C. This results in T1=T.

Case 2.

T and T are connected by a rank move. We assume that the rank move is performed on the interval [(T)t,(T)t+1]. Denote the cluster induced by (T)t by A, the clusters induced by the children of (T)t by A1 and A2, the cluster induced by (T)t+1 by B, and the clusters induced by the children of (T)t+1 by B1 and B2—see Fig. 6.

We again consider all possible moves FINDPATH can perform to go from T to T1 that involve a node of rank t or t+1.

  • 2.1

    Rank move on [(T)t,(T)t+1]. This move results in T1=T.

  • 2.2
    NNI move on (edge) interval [(T)t+1,(T)t+2]. The following two sub-cases are analogous to case 1.3.
    • 2.2.1
      (T)t+2 is a parent of (T)t. The first move on FP(T,R) builds a cluster AB1 or AB2, and we assume without loss of generality that it is the former, as in Fig. 6. This implies that Ck intersects A and B1 but not B2 If the ranks of (Ck)T1 and (Ck)R coincide then the previous cluster Ck-1 of R has to be A. Therefore, the first move on FP(T,R) decreases the rank of (A)T, which results in T1=T. If the rank of (Ck)T1 is strictly higher than that of (Ck)R then FINDPATH decreases the rank of (Ck)T1 in the second step. Due to the symmetry we can assume that CkA1B1, which implies that the move between T1 and T2 exchanges the subtrees induced by A2 and B1, as depicted on the left of Fig. 6. CkA1B1 implies that the first two moves on FP(T,R) result in a tree T2 that is an RNNI neighbour of T2—see Fig. 6. This is a contradiction to the minimality assumption on |FP(T,R)|.
    • 2.2.2
      (T)t+2 is not a parent of (T)t. In this case, there exists a cluster C induced by the child of (T)t+2 which is different from the one that induces B—see Fig. 7. We can assume without loss of generality that CkCB1 and the first move on FP(T,R) builds a new cluster CB1. If the ranks of (Ck)T1 and (Ck)R coincide then Ck-1=A, which implies that A is induced by the node of rank t in both T and R. So T1=T. If the rank of (Ck)T1 is strictly higher than that of (Ck)R then FINDPATH decreases the rank of (Ck)T1 in the second step—see Fig. 7. The corresponding first moves on FP(T,R) are shown on the right in Fig. 7, and we again get that T2 and T2 are RNNI neighbours.
  • 2.3

    Rank move on interval [(T)t+1,(T)t+2]. Again, depending on whether or not the ranks of (Ck)T1 and (Ck)R coincide, we arrive at the conclusion that either T1=T or T2 and T2 are RNNI neighbours, similarly to case 1.4.

  • 2.4

    RNNI move (either type) on interval [(T)t-1,(T)t]. In this case CkA and the first move on FP(T,R) must be a rank swap resulting in T1=T.

Since all possible cases result in a contradiction, we conclude that inequality (1) is true for all trees, which completes the proof of the theorem.

Fig. 3.

Fig. 3

NNI move between T and T on the edge [(T)t,(T)t+1] indicated in bold, and the third RNNI neighbour resulting from a move on this edge

Fig. 4.

Fig. 4

Comparison of paths FP(T,R) and FP(T,R) if T and T are connected by an NNI move on edge [(T)t,Tt+1] in T. The bottom row displays all possibilities for T2 and T2, depending on the position of cluster Ck that satisfies the while condition of FINDPATH: case Ck intersects B and D is on the left, Ck intersects A and D is in the middle, and Ck intersects C and D is on the right

Fig. 5.

Fig. 5

Comparison of paths FP(T,R) and FP(T,R) if there is an NNI move between T and T and a rank move on the interval above this edge follows on FP(T,R)

Fig. 6.

Fig. 6

Rank move between T and T and possible initial segments of FP(T,R) and FP(T,R) when [(T)t+1,(T)t+2] is an edge. We use notations A=A1A2 and B=B1B2

Fig. 7.

Fig. 7

Comparison of paths FP(T,R) and FP(T,R) if there is a rank move between T and T and an NNI move on the edge below the corresponding (rank) interval follows on FP(T,R)

We finish this section by showing that no algorithm has strictly lower worst-case time complexity than FINDPATH. We again assume here that the output of an algorithm for solving RNNI(1)-SP is a list of RNNI moves. Requiring the output to be a list of trees would result in cubic complexity while maintaining the optimality of FINDPATH.

Corollary 1

The time-complexity of the shortest path problem RNNI(1)-SP is Ω(n2).

Proof

We prove this by establishing the lower bound on the output size to the problem, that is, the length of a shortest paths.

Consider two “caterpillar” trees T=[{a1,a2},{a1,a2,a3},,{a1,a2,,an}] and R=[{a1,an},{a1,an,an-1},,{a1,an,,a2}]. Applied to these trees FINDPATH executes an NNI move in each of the n-k-1 while loops (line 3) in every iteration k of the for loop (line 2). Hence the length of the output path of FINDPATH is k=1n-2k=(n-1)(n-2)2 and therefore quadratic in n. Theorem 1 then implies that this path is a shortest path. It follows that the worst-case size of the output to RNNI(1)-SP is quadratic.

For what ρ is RNNI(ρ)-SP polynomial?

As we have seen in Sect. 2, the shortest path problem RNNI(1)-SP is solvable in polynomial time. In this section, we will show that a classical result in mathematical phylogenetics implies that RNNI(0)-SP is NP-hard. We will also discuss RNNI(ρ)-SP for other values of ρ.

Theorem 2

(DasGupta et al. 2000) RNNI(0)-SP is NP-hard.

Proof

Because two trees with the same tree topology but different rankings have distance 0 in RNNI(0), this graph corresponds to a pseudo-metric space. The length of the path required in an instance of RNNI(0)-SP is equal to the minimum number of NNI moves necessary to convert one tree into another tree, as rank moves weigh 0. Therefore, the distance in RNNI(0) equals the NNI distance between trees where the rankings of internal nodes are ignored and NNI moves are allowed on every edge. The corresponding shortest path problem is known to be NP-hard (DasGupta et al. 2000).

In the light of Theorems 1 and 2 the following problem is natural.

Problem 1

Characterise the complexity of RNNI(ρ)-SP in terms of ρ.

This problem is also of applied value. For example, trees might come from an inference method with higher certainty of their branching structure and lower certainty of their nodes order. A comparison method for such trees should have higher penalty for NNI changes and lower penalty for rank changes, which in our notations requires ρ<1.

In the rest of this section, we show that the FINDPATH algorithm substantially relies on the fact that the rank move and the NNI move have the same weight in the RNNI graph. This suggests that a non-trivial algorithmic insight is necessary to extend our polynomial complexity result to other values of ρ.

Proposition 2

FINDPATH does not compute shortest paths in RNNI(ρ) for ρ1.

Proof

For ρ>1 a counterexample is given by the following trees (see Fig. 8)

T=[{a1,a2},{a1,a2,a3},{a1,a2,a3,a4}]andR=[{a3,a4},{a2,a3,a4},{a1,a2,a3,a4}].

Applied to these trees FINDPATH proceeds from T to [{a1,a2},{a3,a4},{a1,a2,a3,a4}], then to [{a3,a4},{a1,a2},{a1,a2,a3,a4}], and then to R. This path consists of two NNI moves with one rank move in between them and therefore has weight 2+ρ. However, the path from T to [{a2,a3},{a1,a2,a3},{a1,a2,a3,a4}] to [{a2,a3},{a2,a3,a4},{a1,a2,a3,a4}] to R consists of three NNI moves and is hence shorter.

Fig. 8.

Fig. 8

Path computed by FINDPATH (top) and a shorter path (bottom) for ρ>1

For ρ<1 a counterexample is given by the following trees (see Fig. 9)

T=[{a1,a2},{a3,a4},{a1,a2,a3,a4}]andR=[{a1,a3},{a1,a3,a4},{a1,a2,a3,a4}].

Applied to these trees FINDPATH proceeds from T to [{a1,a2},{a1,a2,a3},{a1,a2,a3,a4}], then to [{a1,a3},{a1,a2,a3},{a1,a2,a3,a4}], and then to R. This path consists of three NNI moves and therefore has weight 3. However, the path from T to [{a3,a4},{a1,a2},{a1,a2,a3,a4}] to [{a3,a4},{a1,a3,a4},{a1,a2,a3,a4}] to R consists of one rank move followed by two NNI moves and is hence shorter.

Fig. 9.

Fig. 9

Path computed by FINDPATH (top) and a shorter path (bottom) for ρ<1

Additional open problems

The idea utilised by DasGupta et al. (2000) to prove that computing distances in NNI is NP-hard stems from a result that shortest paths in NNI do not preserve clusters (Li et al. 1996), that is, sometimes a cluster shared by two trees T and R is shared by no other tree on any shortest path between T and R. This counter-intuitive property eventually led to the computational hardness result in NNI. Moreover, this property makes little sense biologically as trees clustering the same set of sequences into a subtree should be closer to each other than to a tree that does not have that subtree. Indeed, a shared cluster means that both trees support the hypothesis that this cluster has evolved along a subtree. In light of this biological argument, the NP-hardness result can be interpreted as RNNI(ρ)-SP being hard only when the graph RNNI(ρ) is biologically irrelevant. From this paper we know that RNNI(1)-SP can be solved in polynomial time by an algorithm that preserves clusters. This however does not mean that every shortest path in RNNI preserves clusters. The following question is hence natural.

  1. For which values of ρ does RNNI(ρ) have the cluster property? How do those compare to the values of ρ for which RNNI(ρ)-SP is efficient?

Other natural questions that arise in the context of our results are the following.

  • (2)

    The questions we have considered for ranked NNI can be studied in other rearrangement-based graphs on leaf-labelled trees, such as the ranked SPR graph and the ranked TBR graph (Semple and Steel 2003). What is the complexity of the shortest path problem there?

  • (3)

    Can our results be used to establish whether the problem of computing geodesics between trees with real-valued node heights is polynomial-time solvable? This geodesic metric space is called t-space and an efficient algorithm for computing geodesics in t-space would be of importance for applications (Gavryushkin and Drummond 2016).

Footnotes

We thank Alexei Drummond, David Bryant, and Kieran Elmes for useful discussions about the weight difference between RNNI moves, complexity, and applied aspects of our results. Their comments improved our paper. We acknowledge support from the Royal Society Te Apārangi through a Rutherford Discovery Fellowship (RDF-UOO1702). This work was partially supported by Ministry of Business, Innovation, and Employment of New Zealand through an Endeavour Smart Ideas Grant (UOOX1912) and a Data Science Programmes Grant (UOAX1932).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Lena Collienne, Email: lena.collienne@postgrad.otago.ac.nz.

Alex Gavryushkin, Email: alex@biods.org.

References

  1. Allen BL, Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann Comb. 2001;5(1):1–15. doi: 10.1007/s00026-001-8006-8. [DOI] [Google Scholar]
  2. Alves JM, Sonia P-L, Manuel C-TJ, David P. Rapid evolution and biogeographic spread in a colorectal cancer. Nat Commun. 2019;10(1):1–7. doi: 10.1038/s41467-019-12926-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bansal MS, Gordon BJ, Oliver E, David F-B. Robinson–Foulds supertrees. Algorithms Mol Biol. 2010;5(February):18. doi: 10.1186/1748-7188-5-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bordewich M, Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann Comb. 2005;8(4):409–423. doi: 10.1007/s00026-004-0229-z. [DOI] [Google Scholar]
  5. Bouckaert RR, Bowern C, Atkinson QD. The origin and expansion of Pama–Nyungan languages across Australia. Nat Ecol Evol. 2018;2(4):741–749. doi: 10.1038/s41559-018-0489-3. [DOI] [PubMed] [Google Scholar]
  6. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchene S, Fourment M, Gavryushkina A, Heled J, Jones G, Kuhnert D, et al. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2019;15(4):e1006650. doi: 10.1371/journal.pcbi.1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Collienne L, Elmes K, Berling L, Gavryushkin A (2019) RNNI code. https://github.com/bioDS/treeOclock
  8. DasGupta B, He X, Jiang T, Li M, Tromp J. On the linear-cost subtree-transfer distance between phylogenetic trees. Algorithmica. 1999;25(2):176–195. doi: 10.1007/PL00008273. [DOI] [Google Scholar]
  9. DasGupta B, He X, Jiang T , Li M, Tromp J, Zhang L (2000) On computing the nearest neighbor interchange distance. In: Discrete mathematical problems with medical applications: DIMACS workshop discrete mathematical problems with medical applications, December 8–10, 1999, vol 55. DIMACS Center, p 19
  10. Downey RG, Fellows MR. Fundamentals of parameterized complexity. London: Springer; 2013. [Google Scholar]
  11. Gavryushkin A, Drummond AJ. The space of ultrametric phylogenetic trees. J Theor Biol. 2016;403(August):197–208. doi: 10.1016/j.jtbi.2016.05.001. [DOI] [PubMed] [Google Scholar]
  12. Gavryushkina A, Welch D, Stadler T, Drummond AJ. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput Biol. 2014;10(12):e1003919. doi: 10.1371/journal.pcbi.1003919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gavryushkin A, Whidden C, Matsen FA. The combinatorics of discrete time-trees: theory and open problems. J Math Biol. 2018;76(5):1101–1121. doi: 10.1007/s00285-017-1167-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gray RD, Drummond AJ, Greenhill SJ. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science. 2009;323(5913):479–483. doi: 10.1126/science.1166858. [DOI] [PubMed] [Google Scholar]
  15. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
  16. Hein J, Jiang T, Wang L, Zhang K. On the complexity of comparing evolutionary trees. Discrete Appl Math. 1996;71(1):153–169. doi: 10.1016/S0166-218X(96)00062-5. [DOI] [Google Scholar]
  17. Hickey G, Dehne F, Rau-Chaplin A, Blouin C. SPR distance computation for unrooted trees. Evol Bioinform Online. 2008;4:17–27. doi: 10.4137/EBO.S419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li M, Tromp J, Zhang L (1996) Some notes on the nearest neighbour interchange distance. In: Cai JY, Wong CK (eds) Computing and combinatorics. COCOON 1996. Lecture Notes in Computer Science, vol 1090. Springer, Berlin, Heidelberg. 10.1007/3-540-61332-3_168
  19. Lote H, Spiteri I, Ermini L, Vatsiou A, Roy A, McDonald A, Maka N, Balsitis M, Bose N, et al. Carbon dating cancer: defining the chronology of metastatic progression in colorectal cancer. Ann Oncol. 2017;28(6):1243–1249. doi: 10.1093/annonc/mdx074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. McMorris FR, Steel MA (1994) The complexity of the median procedure for binary trees. In: Diday E, Lechevallier Y, Schader M, Bertrand P, Burtschy B (eds) New approaches in classification and data analysis. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, Heidelberg. 10.1007/978-3-642-51175-2_14
  21. Nguyen L-T, Schmidt HA, von Haeseler A, Quang MB. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32(1):268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53(1):131–147. doi: 10.1016/0025-5564(81)90043-2. [DOI] [Google Scholar]
  23. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19(12):1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
  24. Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003. [Google Scholar]
  25. Singer J, Kuipers J, Jahn K, Beerenwinkel N. Single-cell mutation identification via phylogenetic inference. Nat Commun. 2018;9(1):5144. doi: 10.1038/s41467-018-07627-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22(21):2688–2690. doi: 10.1093/bioinformatics/btl446. [DOI] [PubMed] [Google Scholar]
  27. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018;4(1):vey016. doi: 10.1093/ve/vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011;28(10):2731–2739. doi: 10.1093/molbev/msr121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Whidden C, Matsen FA (2017) Ricci–Ollivier curvature of the rooted phylogenetic subtree–prune–regraft graph. Theoret Comput Sci 699:1–20. 10.1016/j.tcs.2017.02.006
  30. Whidden C, Matsen FA. Calculating the unrooted subtree prune-and-regraft distance. IEEE ACM Trans Comput Biol Bioinform. 2018;16:898–911. doi: 10.1109/TCBB.2018.2802911. [DOI] [PubMed] [Google Scholar]
  31. Whidden C, Beiko RG , Zeh N (2010) Fast FPT algorithms for computing rooted agreement forests: theory and experiments. In: Experimental algorithms. Lecture notes in computer science , pp 141–153
  32. Whidden C, Zeh N, Beiko RG. Supertrees based on the subtree prune-and-regraft distance. Syst Biol. 2014;63(4):566–581. doi: 10.1093/sysbio/syu023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ypma RJF, van Ballegooijen WM, Wallinga J. Relating phylogenetic trees to transmission trees of infectious disease outbreaks. Genetics. 2013;195(3):1055–1062. doi: 10.1534/genetics.113.154856. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES