Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2021 Dec 6;16:23. doi: 10.1186/s13015-021-00202-8

A simpler linear-time algorithm for the common refinement of rooted phylogenetic trees on a common leaf set

David Schaller 1, Marc Hellmuth 2, Peter F Stadler 1,3,4,5,6,7,
PMCID: PMC8647445  PMID: 34872590

Abstract

Background

The supertree problem, i.e., the task of finding a common refinement of a set of rooted trees is an important topic in mathematical phylogenetics. The special case of a common leaf set L is known to be solvable in linear time. Existing approaches refine one input tree using information of the others and then test whether the results are isomorphic.

Results

An O(k|L|) algorithm, LinCR, for constructing the common refinement T of k input trees with a common leaf set L is proposed that explicitly computes the parent function of T in a bottom-up approach.

Conclusion

LinCR is simpler to implement than other asymptotically optimal algorithms for the problem and outperforms the alternatives in empirical comparisons.

Availability

An implementation of LinCR in Python is freely available at https://github.com/david-schaller/tralda.

Keywords: Mathematical phylogenetics, Rooted trees, Compatibility of rooted trees

Introduction

Given a collection of rooted phylogenetic trees T1, T2, ...Tk, the supertree problem in phylogenetics consists in determining whether there is a common tree T that “displays” all input trees Ti, 1ik, and if so, a supertree T is to be constructed [1, 2]. In its most general form, the leaf sets L(Ti), representing the taxonomic units (taxa), may differ, and the supertree T has the leaf set L(T)=i=1kL(Ti). Writing n:=|L(T)|, N:=i=1k|L(Ti)|, and R:=i=1k|L(Ti)|2, this problem is solved by the algorithm of Aho et al. [3], which is commonly called BUILD in the the phylogenetic literature [4], in O(Nn) time for binary trees and O(Rn) time in general.

An O(N2) algorithm to compute all binary trees compatible with the input is described in [5]. Using sophisticated data structures, the effort for computing a single supertree was reduced to O(min(Nn,N+n2logn)) for binary trees and (Rlog2R) for arbitrary input trees [6]. Recently, an O(Nlog2N) algorithm has become available for the compatibility problem for general trees [7]. The compatibility problem for nested taxa in addition assigns labels to inner vertices and can also be solved in O(Nlog2N) [8].

Here we consider the special case that the input trees share the same leaf set L(T1)=L(T2)==L(Tk)=L(T)=L, and thus N=kn and R=kn2. While the general supertree problem arises naturally when attempting to reconcile phylogenetic trees produced in independent studies, the special case appears in particular when incompletely resolved trees are produced with different methods. In a recent work, we have shown that such trees can be inferred e.g. as the least resolved trees from best match data [9, 10] and from information of horizontal gene transfer [11, 12]. Denoting with H(T) the set of “clusters” in T, we recently showed that the latter type of data can be explained by a common evolutionary scenario if and only if (1) both the best match and the horizontal transfer data can be explained by least resolved trees T1 and T2, respectively, and (2) the union H(T1)H(T2) is again a hierarchy. In this context it is of practical interest whether the latter property can be tested efficiently, and whether the common refinement T satisfying H(T)=H(T1)H(T2) [13] can be constructed efficiently in the positive case.

Several linear time, i.e., O(|L|) time, algorithms for the common refinement of two input trees T1 and T2 with a common leaf set have become available. The INSERT algorithm [14], which makes use of ideas from [15], inserts the clusters of T2 into T1 and vice versa and then uses a linear-time algorithm to check whether the two edited trees are isomorphic [16]. Assuming that the input trees are already known to be compatible, Merge_Trees [17, 18] can also be applied to insert the clusters of one tree into the other. For both of these methods, an overall linear-time algorithm for the common refinement of k input trees is then obtained by iteratively computing the common refinement of the input tree Tj and the common refinement of first j-1 trees, resulting in a total effort of O(k|L|).

Here we describe an alternative algorithm that constructs in a single step a candidate refinement T of all k input trees. This is achieved by computing the parent-function of the potential refinement T in a bottom-up fashion. As we shall see, the algorithm is easy to implement and does not require elaborate data structures. The existence of a common refinement is then verified by checking that the parent function defines a tree T and, if so, that T displays each of the input trees Tj. This test is also much simpler to implement than the isomorphism test for rooted trees [16].

Theory

Notation and preliminaries

Let T be a rooted tree. We write V(T) for its vertex set, E(T) for is edge set, L(T)V(T) for its leaf set, V0(T):=V(T)\L(T) for the set of inner vertices and ρV0(T) for its root. An edge e={u,v}E(T) is an inner edge if u,vV0(T). The ancestor partial order T on V(T) is defined by xTy whenever y lies along the unique path connecting x and the root ρ. If xTy and xy, we write xTy. For vV(T), we set childT(v):={u{v,u}E(T),uTv}. If uchildT(v), then v is the unique parent of u. In this case, we write v=parentT(u). All trees T considered in this contribution are phylogenetic, i.e., they satisfy |childT(v)|2 for all vV0(T).

We denote by T(u) the subtree of T rooted in u and write L(T(u)) for its leaf set. The last common ancestor of a vertex set WV(T) is the unique T-minimal vertex lcaT(W)V(T) satisfying wTlcaT(W) for all wW. For brevity, we write lcaT(x,y):=lcaT({x,y}). Furthermore, we will sometimes write vuE(T) as a shorthand for “{u,v}E(T) with uTv.”

A hierarchy on L is set system H2L such that (i) LH, (ii) AB{A,B,} for all A,BH, and (iii) {x}H for all xL. There is a well-known bijection between rooted phylogenetic trees T with leaf set L and hierarchies on L, see e.g. [4, Thm. 3.5.2]. It is given by H(T):={L(T(u))uV(T)}; conversely, the tree TH corresponding to a hierarchy H is the Hasse diagram w.r.t. set inclusion. Thus, if v=lcaT(A) for some AL(T), then L(T(v)) is the inclusion-minimal cluster in H(T) that contains A, see e.g. [19]. We call the elements of H(T) clusters and say that two clusters C and C are compatible if CC{C,C,}. Note that, by (i), the clusters of the same tree are all pairwise compatible.

A (rooted) triple is a binary tree on three leaves. We say that a tree T displays a triple xy|z if lcaT(x,y)TlcaT(x,z)=lcaT(y,z), or equivalently, if there is a cluster CH(T) such that x,yC and zC. The set of all triples that are displayed by T is denoted by r(T). A set R of triples is consistent if there is a tree that displays all triples in R.

Let T and T be phylogenetic trees with L(T)=L(T). We say that T is a refinement of T if T can be obtained from T by contracting a subset of inner edges. Equivalently, T is a refinement of T if and only if H(T)H(T). A tree T displays a tree T if L(T)L(T) and H(T){CL(T)CH(T)andCL(T)}. In particular, therefore, T displays a tree T with L(T)=L(T) if and only if H(T)H(T), i.e., if and only if T is a refinement of T. The minimal common refinement of the trees Ti, 1ik is the tree T such that H(T)=i=1kH(Ti), provided it exists.

Thm. 3.5.2 of [4] can be rephrased in the following form:

Lemma 1

Let T1, T2, ..., Tk be trees with common leaf set L(Ti)=L such that H:=i=1kH(Ti) is a hierarchy. Then there is a unique tree T such that H(T)=H. Furthermore, T is the unique “least resolved” tree in the sense that contraction of any edge in T yields a tree Te with H(Te)H(T).

Proof

By definition of H and the bijection between phylogenetic trees and hierarchies, there is a unique tree T such that H=H(T). Consider an inner edge e=uv. By construction, there is at least one tree Tv such that C:=L(T(v))H(Tv). However, H(Te)=H(T)\{Cv} and thus Te does not display Tv.

By Thm. 1 in [20], a tree T is displayed by a tree T with L(T)L(T) if and only if r(T)r(T). As an immediate consequence, a common refinement of trees with a common leaf set L exists if and only if the union L of their triple sets is consistent. The latter condition can be checked using the BUILD algorithm which, in the positive case, returns a tree BUILD(R,L) that displays all triples in R.

Lemma 2

Suppose that T is the unique least resolved common refinement of the trees T1, T2, ..., Tk with common leaf set L(Ti)=L, 1ik and let R:=r(Ti)r(T2)r(Tk). Then T=BUILD(R,L).

Proof

The tree T^:=BUILD(R,L) is a common refinement since, by the arguments above, it displays T1, T2, ..., Tk. By Lemma 1, we therefore have H(T)H(T^). Prop. 4.1 in [21] implies that T^ is least resolved w.r.t. R, i.e., every tree T^ obtained from T^ by contraction of an edge no longer displays all input triples in R. By Thm. 6.4.1 in [4], Ti is displayed by T^ if and only if T^ displays all triples of Ti. Since this is not true for all input trees Ti, T^ does not display all input trees Ti, 1ik. Together with H(T)H((T^)), this implies that T=T^.

We note that, given a set of triples R, “T is a least resolved displaying R” does not imply that vertex set V(T) is minimal among all such trees. It is possible in general that there is a tree T displaying a given triple set R with |V(T)|<|V(BUILD(R,L))|. In this case, BUILD(R,L) does not display T, see [22] for details. However, uniqueness of the least resolved tree, Lemma 1, rules out this scenario in our setting.

The algorithm BuildST [7] computes the supertree of a set T:={Ti1ik} of rooted trees without first breaking down each tree to its triple set r(Ti). Lemma 5 in [7] establishes that BuildST applied to a set of trees and BUILD applied to the triple set R:=i=1kr(Ti) produce the same output for all instances. If R is consistent, BuildST computes the tree BUILD(R,L). If all input trees have the same same leaf set L BuildST in particular computes their common refinement. The performance analysis in [7] shows that BuildST runs in O(k|L|log2(k|L|)) time for this special case. Linear-time algorithms for the special case of a common leaf set therefore offer a further improvement over the best known general purpose supertree algorithms.

A bottom-up linear time algorithm

The basic idea of our approach is to construct T by means of a simple bottom-up approach that computes the parent function parentT:V(T)\{ρT}V(T)\L of a candidate tree T in a stepwise manner. This algorithm is based on three simple observations:

  • (i)

    If it exists, the common refinement T of T1, T2, ..., Tk is uniquely defined by virtue of H(T)=i=1kH(Ti) (cf. Lemma 1). We will therefore identify all vertices viV(Ti) with a vertex v in the prospective tree T whenever their clusters – i.e., the sets L(Ti(vi)) – are the same. In this case, we have L(T(v)):=L(Ti(vi)). From here on, we simply say, by a slight abuse of notation, that v is also a vertex of Ti and write vV(Ti).

  • (ii)

    Since H(T)=i=1kH(Ti), each vertex vV(T) is also a vertex in at least one input tree Ti. Conversely, every vertex vV(Ti), i{1,,k}, is a vertex in T. Therefore, we have V(T)=i=1kV(Ti).

  • (iii)

    T exists if and only if the sets L(T(x)) and L(T(y)) for all x,yi=1kV(Tk) are either comparable by set inclusion or disjoint, i.e., L(T(x))L(T(y)){L(T(x)),L(T(y)),}. Thus, xTy if and only if L(Ti(x))=L(T(x))L(Tj(y))=L(T(y)) for the appropriate choices of i,j{1,,k}.

Observation (iii) makes it possible to access the ancestor order T on V(T) without knowing the common refinement T explicitly. Many of the upcoming definitions are illustrated in Fig. 1.

Fig. 1.

Fig. 1

The three trees T1, T2, and T3 with common leaf set L={a,b,c,d,e} have the (unique) common refinement T. Here, J(ρ)={1,2,3} and thus, J¯(ρ)=. The different symbols for vertices indicate which vertex u in the Tis corresponds to which vertex u in T. Consider the vertex v highlighted as . The corresponding vertices pi(v) are shown in the respective trees Ti. Here, p2(v)=v while the vertices p1(v) and p3(v) in T1 and T3 correspond to parentT(v) and ρ, respectively. Consequently, J(v)={2} and J¯(v)={1,3}. We have p2(v)=vTparentT(v)=p1(v)Tp3(b), according to Obs. 3. In this example, only the last case in Obs. 4 for v is satisfied, namely parentT(v)=p1(v). Moreover, A(v)={v}{parentT2(v)=ρ}{p1(v),p3(v)}={v,ρ,parentT(v)}

We introduce, for each vV(T), the index set J(v)={iL(Ti(v))=L(T(v))} of the trees that contain vertex v. We have J(v) for all vV(T). For simplicity, we write J¯(v):={1,,k}\J(v) for the indices of all other trees. Hence, J¯(v)= if and only if L(T(v))H(Ti) for all i{1,,k}. In particular, therefore, J¯(v)= whenever vL or v=ρ.

Let us assume until further notice that a common refinement exists and let T=(V,E) be the unique least resolved common refinement of T1, T2, ..., Tk on a common leaf set. Due to Lemma 1, T is uniquely determined by the parent function parentT. The key ingredient in our construction are the following vertices in Ti:

pi(v):=lcaTi(L(T(v)),i{1,,k},vV(T) 1

By assumption, we have L(T(v))L(Ti) and thus pi(v) is well-defined. As immediate consequence of the definition in Eq. (1), we have

Observation 3

For all vV(T) and all i{1,,k} it holds that pi(v)=v iff vV(Ti) iff iJ(v). If iJ(v), then vTpi(v) and therefore parentT(v)Tpi(v).

Now assume that parentT(v) exists in T, i.e., vρ. By Observation (ii), vV(T) implies vV(Ti) for some i{1,,k}. In this case, parentT(v) must be the unique Ti-minimal vertex uiV(Ti) that satisfies L(T(v))L(Ti(ui)) because H(Ti)H(T). In other words, pi(parentT(v))=ui=parentTi(v). Hence, we have

Observation 4

For all vV\{ρ} it holds that parentT(v)=parentTi(v) for some iJ(v) or parentT(v)=pj(v) for some jJ¯(v).

Note that in general also both cases in Obs. 4 are possible. Consider the set of vertices A(v):={v}{parentTi(v)iJ(v)}{pi(v)iJ¯(v)}. By construction and Obs. 4, we have vTx for all xA(v). Since all ancestors of a vertex in a tree are mutually comparable w.r.t. the ancestor order, we have

Observation 5

All x,yA(v) are pairwise comparable w.r.t. T.

Taken together, Observations 3-5 imply that the parent map of T can be expressed in the following form:

parentT(v)=minminiJ(v)parentTi(v),miniJ¯(v)pi(v) 2

where the minimum is taken w.r.t. the ancestor order T on T. Since the root ρi of each Ti coincides with the root ρ of T, v is the root of T iff parentTi(v)= is undefined for one and thus for all i. In this case, we set parentT(v)=.

With this in hand, we show how to compute the maps pi for u:=parentT(v) for all i{1,,k}. To this end, we distinguish three cases. (1) If uV(Ti), we have pi(u)=u by definition. (2) If uV(Ti), then we have to identify the T-minimal vertex wV(Ti) with uTw. If vV(Ti), then pi(u)=w=parentTi(v). In the remaining case, iJ¯(v), we already know that pi(v) is the Ti-minimal ancestor of v. Thus, we have either pi(v)=u=parentT(v), i.e., a sub-case of (1), or (3) uTpi(v) whenever vV(Ti) and uV(Ti). In this case, the definition of pi implies pi(u)=pi(v). Summarizing the three cases yields the following recursion:

pi(u)=uifiJ(u)parentTi(v)ifiJ(v)pi(v)ifiJ¯(u)andiJ¯(v) 3

Note, although the cases in Eq. (3) are not exclusive (since J(v)J(u) is possible), they are not in conflict. To see this, observe that if iJ(u) and iJ(v), then u=parentTi(v) as a consequence of the definition of u.

Initializing iJ(v) for all i and all leaves v, we can compute J(u) for u=parentT(v) as a by-product by the minimum computation in Eq. (2) by simply keeping track of the equalities encountered since both parentTi(v) and pi(v) are vertices in Ti. More precisely, each time a strictly T-smaller vertex u, i.e., a proper set inclusion, is encountered in Eq. (2), the current list of equalities is discarded and re-initialized as {i}, where i is the index of the tree Ti in which the new minimum u was found. The indices of the trees Tj with uV(Tj) are then appended.

It remains to ensure that the vertices are processed in the correct order. To this end, we use a queue Q, which is initialized by enqueueing the leaf set. Upon dequeueing v, its parent u and the values pi(u) are computed. Except for the leaves, every vertex uV(T) appears as parent of some vV(T). On the other hand, u may appear multiple times as parent. Thus we enqueue u in Q only if the same vertex has not been enqueued already in a previous step. We emphasize that it is not sufficient to check whether uQ since u may have already been dequeued from Q before re-appearance as a parent. We therefore keep track of all vertices that have ever been enqueued in a set V. To see that this is indeed necessary, consider a tree Ti=(a,(b,c)v1)v2 and an initial queue Q=(a,b,c). Without the auxiliary set V, we obtain Q=(b,c,v2), Q=(c,v2,v1), Q=(v2,v1), Q=(v1), Q=(v2), etc., and thus v2 is enqueued twice.

An implementation of this procedure also needs to keep track of the correspondence between vertices in V(T) and the vertices of V(Ti). To this end, we can associate with each vV(T) a list of pointers to vV(Ti) for iJ(v), and pointer from vV(Ti) back to vV(T). For the leaves, these are assigned upon initialization. Afterwards, they are obtained for u=parentT(v) as a by-product of computing J(u), since the pointers have to be set exactly for the iJ(u). In particular, whenever the pointer for u found Ti has already been set, we know that uV.

Summarizing the discussion so far, we have shown:

Proposition 6

Suppose the trees T1, T2, ..., Tk have a common refinement T. Then parentT(v) is correctly computed by the recursions Eq. (2) and Eq. (3).

Next we observe that it is not necessary to explicitly compute set inclusions. As an immediate consequence of Obs. 5 and the fact that xy implies L(T(x))L(T(y)) because all trees are phylogenetic by assumption, we obtain

Observation 7

For any two x,yA(v), we have xTy if and only if |L(T(x))|<|L(T(y))|.

Thus it suffices to evaluate the minimum in Eq. (2) w.r.t. to the cardinalities |L(T(v))|. This can be achieved in O(k) time provided the values i(v):=|L(Ti(v))| are known for the input trees. Since the parent-function parentT unambiguously defines a tree T, we have

Corollary 8

Suppose the trees T1, T2, ..., Tk have a common refinement T. Then T can be computed in O(k|L|) time.

Proof

For each input tree Ti, i(v) can be computed as

i(v)=1ifvL,andi(v)=uchildTi(v)i(u)otherwise. 4

Since the total number of terms appearing for the inner vertices of T equals the number of edges of Ti, the total effort for Ti is bounded by O(|L|). The total number of vertices u computed as parentT(v) equals the number of edges of T, and thus is also bounded by O(L). Since the tree T, as well as the k trees Ti, have O(|L|) vertices, we require O(k|L|) pointers from the vertices in T to their corresponding vertices in the Ti and vice versa. By initializing the pointers for all vV(Ti) as “not set”, it can be checked in constant time whether u that was found in Ti is already contained in the set V, since this is the case if and only if its pointer has already been set. Evaluation of Eq. (2) requires O(k) comparisons, each of which can be performed in constant time by virtue of Obs. 7. The computation of pi(u) and J(u) as well as the update of the correspondence table between vertices in T and Ti, 1ik requires O(k) operations for each vV(T). Thus T can be computed in O(k|L|) time.

graphic file with name 13015_2021_202_Figa_HTML.jpg

So far, we have assumed that a common refinement exists. By a slight abuse of notation, we also use the function parentT if the refinement T does not exist. In this case, we define parentT on the union of the V(Ti) recursively by Eqs. (2) and (3). Alg. 1 summarizes the procedure based on the leaf set cardinalities for the general case. If no common refinement T exists, then either parentT does not specify a tree, or the tree T defined by parentT is not a common refinement of T1, T2, ..., Tk. The following result shows that we can always efficiently compute parentT and check whether it specifies a common refinement of the input trees.

Theorem 9

LinCR (Alg. 1) decides in O(k|L|) time whether a common refinement of trees T1, T2, ..., Tk on the same leaf set L exists and, in the affirmative case, returns the tree T corresponding to H(T)=H(T1)H(T2)H(Tk).

Proof

We construct parentT in Lines 1–24 as described in the proof of Cor. 8. In particular, we determine u:=parentT(v) by virtue of the smallest i(u). Hence, we can process each enqueued vertex v in O(k). Moreover, if a common refinement T exists, then Cor. 8 guarantees that we obtain this tree in Line 25.

A tree on |L| leaves has at most |L|-1 inner vertices with equality holding for binary trees. Therefore, the set V of distinct vertices encountered in Alg. 1, can contain at most 2|L|-2 vertices (note that by construction the root does not enter V). If this condition is violated, no common refinement exists and we can terminate with a negative answer (cf. Line 19). This ensures that parentT is constructed in O(k|L|) time. We continue by showing that, unless the algorithm exits in Line 16 or 19, parentT in Line 25 always defines a tree T. To see this, consider the graph G with vertex set V{ρ} where ρ is the root vertex which is contained in each Ti and an edge {u,v} if and only if parentT(v)=u or parentT(u)=v. Checking whether (v)<min(=(u)) in Line 15 ensures that G does not contain cycles and that parentT(v)=u and parentT(u)=v is not possible. Moreover, every vertex vV is enqueued to Q and receives a parent u such that (v)<(u). Unless u=ρ, u in turn receives a parent u with (u)<(u). Since V is finite v,u,u,... are pairwise distinct as a consequence of the cardinality condition, and we conclude that eventually ρ is reached, i.e., a path to ρ exists for all vV. It follows that G is connected, acyclic, and simple, and thus a tree (with root ρ).

It remains to check whether T is phylogenetic and displays Ti for all i{1,,k}. Checking whether T is phylogenetic in Line 26 can be done in O(|L|) in a top-down traversal that exits as soon as it encounters a vertex with a single child. To check whether T displays a tree Ti, we contract (in a copy of T) in a top-down traversal all edges uv with vchildT(u) for which uV(Ti), i.e., for which iJ(v). Since the root of T and leaves of T are in Ti, this results in a rooted tree Ti with V(Ti)=V(Ti) if T is indeed the common refinement of all trees. The contraction of an edge uv can be performed in O(childT(v)|), hence in total time O(|E(Ti)|)=O(|L|). Finally, we can check in O(|L|) time whether the known correspondence between the vertices of Ti and Ti is an isomorphism. To this end, it suffices to traverse Ti and to check that childTi(v)=childTi(v) for all vV(Ti) (cf. Lines 31–32) using the pointers of v and all elements in childTi(v) to the corresponding vertices in T. Note that, in general, the pointer from a vertex v in Ti to a vertex in Ti may not be set, in which case vV(Ti) and thus, we can terminate with a negative answer. The total effort thus is bounded by O(k|L|).

If T on L is a phylogenetic tree displaying all trees T1, T2, ..., Tk, then it is a common refinement of these trees. Since every vertex vV(T) is also contained in some Ti, i.e., L(T(v))=L(Ti(v)), we have H(T)=H(T1)H(T2)H(Tk).

Computational results

We compare the running times for (a) BUILD [3], (b) BuildST [7], (c) Merge_Trees [18], (c’) Loose_Cons_Tree [18], and (d) LinCR (Alg. 1). To this end, we implemented all of these algorithms in Python as part of the tralda library. We note that BUILD operates on a set of triples extracted from the input trees rather than the trees themselves. We use the union of the minimum cardinality sets of representative triples of every Ti appearing in the proof of Thm. 2.8 in [23]. Therefore, we have RO(k|L|2) [24, Thm. 6.4] and BUILD runs in O(k|L|3) time. In the case of Merge_Trees, we implemented a variant that starts with T=T1 and then iteratively merges the clusters of the tress Ti, 2ik, into T. Merge_Trees assumes that the input trees are compatible, which is guaranteed in our benchmarking data set. In practice, however, this condition may be violated, in which case the behavior of Merge_Trees is undefined. We therefore also implemented an O(k|L|) algorithm for constructing the loose consensus tree for a set of trees T1, T2, ..., Tk on the same leaf set, Loose_Cons_Tree, following [18]. The loose consensus comprises all clusters that occur in at least one tree Ti, 1ik and that are compatible with all other clusters of the input trees (see [2527] and the references therein). The loose consensus tree by definition coincides with the common refinement whenever the latter exists. Loose_Cons_Tree uses Merge_Trees as a subroutine but ensures compatibility in each step by first deleting incompatible clusters in one of the trees. This is implemented as the deletion of the corresponding inner vertex v followed by reconnecting the children of v to the parent of v. The input trees are compatible if and only if no deletion is necessary. The existence of a common refinement can therefore by checked by keeping track of the number of deletions. However, the subroutine that processes trees to remove incompatible clusters significantly adds to the running time of the Loose_Cons_Tree algorithm. The linear-time algorithms require O(k|L|) space.

We simulate test instances as follows: First, a random tree T is generated recursively by starting from a single vertex (which becomes the root) and stepwise attaching new leaves to a randomly chosen vertex v until the desired number of leaves |L| is reached. In each step, we add two children to v if v is currently a leaf, and only a single new leaf otherwise. This way, the number of leaves increases by exactly one in each step and the resulting tree T is phylogenetic (but in general not binary). From T, we obtain k{2,8,32} trees T1, T2,..., Tk by random contraction of inner edges in (a copy of) T. Each edge is considered for contraction independently with a probability p{0.1,0.5,0.9}. Therefore, T is a refinement of Ti for all 1ik, i.e., a common refinement exists by construction. However, in general we have H(T)i=1kH(Ti), i.e., T is not necessarily the minimal common refinement of the Ti. The trees T1, T2, ..., Tk constructed in this manner serve as input for all algorithms.

The running time comparisons were performed using tralda on an off-the-shelf laptop (Intel® CoreTM i7-4702MQ processor, 16 GB RAM, Ubuntu 20.04, Python 3.7). The time required to compute a least resolved common refinement of the input trees is included in the respective total running time shown in Figs. 2 and 3 . The empirical performance data are consistent with the theoretical result that LinCR scales linearly in k|L|. In particular, the median running times scale linearly with |L|, as shown by the slopes of 1 in the log/log plot for the running times of LinCR in Fig. 3.

Fig. 2.

Fig. 2

Running time comparison of the algorithms for the construction of a common refinement of k input trees on leaf set L. The subplots of each row show boxplots for the running time for different numbers of leaves |L| (indicated on the x-axis) and different values of k{2,8,32} (indicated in the leftmost column of each subplot). In each row, a different probability p{0.1,0.5,0.9} for edge contraction was used to produce the k input trees. Per combination of the parameters |L|, k, and p, 100 instances were simulated to which all four algorithm were applied

Fig. 3.

Fig. 3

Running time comparison of the algorithms for the construction of a common refinement of k input trees on leaf set L. Per combination of the parameters |L| (indicated on the horizontal axis), k (columns), and p (rows), 100 instances were simulated and median values are shown for all algorithms. In each row, a different probability p{0.1,0.5,0.9} for edge contraction was used to produce the k input trees

In accordance with the theoretical complexity of O(k|L|log2(k|L|)) for the common refinement problem, the performance curve of BuildST is almost parallel to that of LinCR; however, its computation cost is higher by almost two orders of magnitude. Our implementation of BuildST uses an algorithm for dynamic graph connectivity often referred to as HDT data structure [28] as originally described in [7]. While we do not expect BuildST to become competitive with the other algorithms, we note that a recent experimental study showed that a simplified version of the HDT data structure (with a slightly worse asymptotic bound) outperforms the full version in practice [29]. For both LinCR and BuildST, the contraction probability p appears to have little effect on the running time. In both cases, a larger value of p (i.e., a lower average resolution of the input trees) leads to a moderate decrease of the running time.

In contrast, the resolution of the input trees has a large impact on the efficiency of BUILD. It also scales nearly linearly when the resolution of the individual input trees Ti is comparably high (and even terminates faster than LinCR up until a few hundred leaves, cf. top-right panel), whereas its performance drops drastically with increasing p, i.e., for poorly resolved input trees. The reason for this is most likely the cardinality of a minimal triple set that represents the set of input trees. For binary trees, the cardinality of the triple set of Ti equals the number of inner edges [23], i.e., there are O(|L|) triples. For very poorly resolved trees, on the other hand, O(|L|2) triples are required [24], matching the differences of the slopes with p observed for BUILD in Fig. 3.

As expected, the curves of the two O(k|L|) algorithms Merge_Trees and Loose_Cons_Tree are also almost parallel to that of LinCR in Fig. 3. For k=2, we can even observe that Merge_Trees is slightly faster than LinCR. However, the smaller number of necessary tree traversals in LinCR apparently becomes a noticeable advantage with an increasing number k of input trees. The additional tree processing steps in the more practically relevant Loose_Cons_Tree algorithm, furthermore, result in a longer running time compared to our new approach.

Concluding remarks

We developed a linear-time algorithm to compute the common refinement of trees on the same leaf set. In contrast to the “classical” supertree algorithms BUILD and BuildST, LinCR uses a bottom-up instead of a top-down strategy. This is similar to Loose_Cons_Tree and its subroutine Merge_Trees [18], which can also be used to obtain the common refinement of trees on the same leaf set in linear time. LinCR, however, requires fewer tree traversals and is, in our opinion, simpler to implement. In contrast to Merge_Trees, LinCR in particular does not rely on a data structure that enables linear-time tree preprocessing and constant-time last common ancestor queries for the nodes in the tree [30]. All algorithms were implemented in Python and are freely available for download from https://github.com/david-schaller/tralda as part of the tralda library. Empirical comparisons of running times show that LinCR consistently outperforms the linear-time alternatives. Only BUILD is faster for very small instances and moderate-size trees that are nearly binary.

Although it may be possible to improve Alg. 1 by a constant factor, it is asymptotically optimal, since the input size is O(k|L|) for k trees with |L| leaves. Furthermore, trivial solutions can be obtained in some limiting cases. For instance, if |V(Ti)|=2|L|-1, then Ti is binary, i.e., no further refinement is possible. In this case, we can immediately use T=Ti as the only viable candidate and only check that Tj displays all other Tj. However, we cannot entirely omit Lines 1–24 in this case since we require the sets J(v) as well as the correspondence between the vertices in order to check whether T displays every Ti.

It is worth noting that the idea behind LinCR does not generalize to more general supertree problems. The main reason is that the set inclusions employed to determine T do not carry over to the more general case because the inclusion order of C1,C2H(T) cannot be determined from C1L(Ti) and C2L(Tj) for two trees with L(Ti),L(Tj)L(T).

Depending on the application, a negative answer to the existence of a common refinement may not be sufficient. One possibility is to resort to the loose consensus tree or possibly other notions of consensus trees, see e.g. [25, 31]. A natural alternative approach is to extract a maximum subset of consistent triples from i=1kr(Ti). This problem, however, is known to be NP-hard for arbitrary triple sets, see e.g. [32] and the references therein.

Authors' contributions

All authors contributed to deriving the mathematical results, the interpretation of results and the writing of the manuscript. DS implemented and benchmarked the algorithms. All authors read and approved the final manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work was supported in part by the German Research Foundation (DFG), proj. no. STA850/49-1.

Availability of data and materials

Implementations of the algorithms used in this contribution are available at https://github.com/david-schaller/tralda as part of the tralda library.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

David Schaller, Email: sdavid@bioinf.uni-leipzig.de.

Marc Hellmuth, Email: marc.hellmuth@math.su.se.

Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de

References

  • 1.Sanderson MJ, Purvis A, Henze C. Phylogenetic supertrees: assembling the trees of life. Trends Ecol Evol. 1998;13:105–109. doi: 10.1016/S0169-5347(97)01242-1. [DOI] [PubMed] [Google Scholar]
  • 2.Semple C, Steel M. A supertree method for rooted trees. Discr Appl Math. 2000;105:147–158. doi: 10.1016/S0166-218X(00)00202-X. [DOI] [Google Scholar]
  • 3.Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421. doi: 10.1137/0210030. [DOI] [Google Scholar]
  • 4.Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003. [Google Scholar]
  • 5.Constantinescu M, Sankoff D. An efficient algorithm for supertrees. J Classif. 1995;12:101–112. doi: 10.1007/BF01202270. [DOI] [Google Scholar]
  • 6.Henzinger MR, King V, Warnow T. Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica. 1999;24:1–13. doi: 10.1007/PL00009268. [DOI] [Google Scholar]
  • 7.Deng Y, Fernández-Baca D. Fast compatibility testing for rooted phylogenetic trees. Algorithmica. 2018;80:2453–2477. doi: 10.1007/s00453-017-0330-4. [DOI] [Google Scholar]
  • 8.Deng Y, Fernández-Baca D. An efficient algorithm for testing the compatibility of phylogenies with nested taxa. Algorithms Mol Biol. 2017;12:7. doi: 10.1186/s13015-017-0099-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs. J Math Biol. 2019;78:2015–57. doi: 10.1007/s00285-019-01332-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Schaller D, Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Corrigendum to “Best Match Graphs”. J Math Biol. 2021;82:47. doi: 10.1007/s00285-021-01601-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Geiß M, Anders J, Stadler PF, Wieseke N, Hellmuth M. Reconstructing gene trees from Fitch’s Xenology relation. J Math Biol. 2018;77:1459–1491. doi: 10.1007/s00285-018-1260-8. [DOI] [PubMed] [Google Scholar]
  • 12.Hellmuth M, Seemann CR. Alternative characterizations of Fitch’s Xenology relation. J Math Biol. 2019;79:969–986. doi: 10.1007/s00285-019-01384-x. [DOI] [PubMed] [Google Scholar]
  • 13.Hellmuth M, Michel M, Nøjgaard N, Schaller D, Stadler PF. Combining orthology and xenology data in a common phylogenetic tree. In: Stadler PF, Walter MEMT, Hernandez-Rosales M, Brigido MM, editors. Advances in bioinformatics and computational biology. Cham: Springer; 2021. pp. 53–64. [Google Scholar]
  • 14.Warnow TJ. Tree compatibility and inferring evolutionary history. J Algorithms. 1994;16:388–407. doi: 10.1006/jagm.1994.1018. [DOI] [Google Scholar]
  • 15.Gusfield D. Efficient algorithms for inferring evolutionary trees. Networks. 1991;21:19–28. doi: 10.1002/net.3230210104. [DOI] [Google Scholar]
  • 16.Aho AV, Hopcroft JE, Ullman JD. The design and analysis of computer algorithms. Boston: Addison-Wesley, Reading; 1974. [Google Scholar]
  • 17.Jansson J, Shen C, Sung W-K. Improved algorithms for constructing consensus trees. In: Khanna, S. (ed.) Proceedings of the 2013 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1800–1813. Soc. Indust. Appl. Math., Philadelphia, PA 2013. 10.1137/1.9781611973105.129.
  • 18.Jansson J, Shen C, Sung W-K. Improved algorithms for constructing consensus trees. J ACM. 2016;63:1–24. doi: 10.1145/2925985. [DOI] [Google Scholar]
  • 19.Hellmuth M, Schaller D, Stadler PF. Compatibility of partitions with trees, hierarchies, and split systems 2021. submitted; arXiv:2104.14146.
  • 20.Bryant D, Steel M. Extension operations on sets of leaf-labeled trees. Adv Appl Math. 1995;16:425–453. doi: 10.1006/aama.1995.1020. [DOI] [Google Scholar]
  • 21.Semple C. Reconstructing minimal rooted trees. Discr Appl Math. 2003;127:489–503. doi: 10.1016/S0166-218X(02)00250-0. [DOI] [Google Scholar]
  • 22.Jansson J, Lemence RS, Lingas A. The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J Comput. 2012;41:272–291. doi: 10.1137/100811489. [DOI] [Google Scholar]
  • 23.Grünewald S, Steel M, Swenson MS. Closure operations in phylogenetics. Math Biosci. 2007;208:521–537. doi: 10.1016/j.mbs.2006.11.005. [DOI] [PubMed] [Google Scholar]
  • 24.Seemann CR, Hellmuth M. The matroid structure of representative triple sets and triple-closure computation. Eur J Comb. 2018;70:384–407. doi: 10.1016/j.ejc.2018.02.013. [DOI] [Google Scholar]
  • 25.Bremer K. Combinable component consensus. Cladistics. 1990;6(4):369–372. doi: 10.1111/j.1096-0031.1990.tb00551.x. [DOI] [PubMed] [Google Scholar]
  • 26.Day WHE, McMorris FR. Axiomatic Consensus Theory in Group Choice and Bioinformatics. Society for Industrial and Applied Mathematics, Providence, RI 2003. 10.1137/1.9780898717501.
  • 27.Dong J, Fernández-Baca D, McMorris FR, Powers RC. An axiomatic study of majority-rule (+) and associated consensus functions on hierarchies. Discr Appl Math. 2011;159:2038–2044. doi: 10.1016/j.dam.2011.07.002. [DOI] [Google Scholar]
  • 28.Holm J, de Lichtenberg K, Thorup M. Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. J ACM. 2001;48:723–760. doi: 10.1145/502090.502095. [DOI] [Google Scholar]
  • 29.Fernández-Baca D, Liu L. Tree compatibility, incomplete directed perfect phylogeny, and dynamic graph connectivity: an experimental study. Algorithms. 2019;12(3):53. doi: 10.3390/a12030053. [DOI] [Google Scholar]
  • 30.Bender MA, Farach-Colton M, Pemmasani G, Skiena S, Sumazin P. Lowest common ancestors in trees and directed acyclic graphs. J Algorithms. 2005;57(2):75–94. doi: 10.1016/j.jalgor.2005.08.001. [DOI] [Google Scholar]
  • 31.Bryant D. A classification of consensus methods for phylogenetics. In: Janowitz MF, Lapointe F-J, McMorris FR, Mirkin B, Roberts FS, editors. Bioconsensus. Providence, RI: Amer. Math. Soc.; 2003. pp. 163–183. [Google Scholar]
  • 32.Byrka J, Guillemot S, Jansson J. New results on optimizing rooted triplets consistency. Discr Appl Math. 2010;158:1136–1147. doi: 10.1016/j.dam.2010.03.004. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Implementations of the algorithms used in this contribution are available at https://github.com/david-schaller/tralda as part of the tralda library.


Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES