Skip to main content
Springer logoLink to Springer
. 2022 Mar 21;198(1):811–853. doi: 10.1007/s10107-022-01790-y

A duality based 2-approximation algorithm for maximum agreement forest

Neil Olver 1,2,, Frans Schalekamp 3, Suzanne van der Ster 5, Leen Stougie 2,6,7, Anke van Zuylen 4
PMCID: PMC9945189  PMID: 36845754

Abstract

We give a 2-approximation algorithm for the Maximum Agreement Forest problem on two rooted binary trees. This NP-hard problem has been studied extensively in the past two decades, since it can be used to compute the rooted Subtree Prune-and-Regraft (rSPR) distance between two phylogenetic trees. Our algorithm is combinatorial and its running time is quadratic in the input size. To prove the approximation guarantee, we construct a feasible dual solution for a novel exponential-size linear programming formulation. In addition, we show this linear program has a smaller integrality gap than previously known formulations, and we give an equivalent compact formulation, showing that it can be solved in polynomial time.

Keywords: Maximum agreement forest, Phylogenetic tree, SPR distance, Subtree prune-and-regraft distance, Computational biology

Introduction

Evolutionary relationships are often modeled by a rooted tree, where the leaves represent a set of species, and internal nodes are (putative) common ancestors of the leaves below the internal node. Such phylogenetic trees date back to Darwin [11], who used them in his notebook to elucidate his thoughts on evolution. For an introduction to phylogenetic trees we refer to [12, 24].

The topology of phylogenetic trees can be based on different sources of data, e.g., morphological data, behavioral data, genetic data, etc., which can lead to different phylogenetic trees on the same set of species. Such partly incompatible trees may actually be unavoidable: there exist non-tree-like evolutionary processes that preclude the existence of a phylogenetic tree, so-called reticulation events, such as hybridization, recombination and horizontal gene transfer [17, 18]. Irrespective of the cause of the conflict, the natural question arises to quantify the dissimilarity between such trees. Especially in the context of reticulation, a particularly meaningful measure of comparing phylogenetic trees is the Subtree Prune-and-Regraft distance for rooted trees (rSPR-distance), which provides a lower bound on a certain type of these non-tree evolutionary events. The problem of finding the exact value of this measure for a set of species motivated the formulation of the Maximum Agreement Forest Problem (MAF) by Hein, Jian, Wang and Zhang [16].

In the definition of MAF by Hein et al. we are given two rooted binary trees and a bijection from the leaves of each tree to a given set of labels L. The problem is to find a minimum set of edges to be deleted from the two trees, so that the rooted trees in the resulting two forests form isomorphic pairs. Here, and throughout the paper, two rooted trees are said to be isomorphic if (i) the labelled nodes of the two trees have the same subset of labels, say A, and (ii) the two trees give rise to the same tree if we take the minimal subtree spanning the nodes labelled by A and repeatedly identify a node with its child if it only has a single child.

Since the introduction by Hein et al. in [16], in which they also proved NP-hardness, MAF has been extensively studied, mostly in its version of two rooted binary input trees. After Allen and Steel [1] pointed out that the claim by Hein et al. that solving MAF on two rooted directed trees computes the rSPR-distance between the trees is incorrect, Bordewich and Semple [5] presented a subtle redefinition of MAF, whose optimal value does coincide with the rSPR-distance. In this redefinition, the set of labels is extended with a label ρ, which is assigned to the roots of the two input trees. As before, we want to find a minimum set of edges so that the trees in the resulting forests form isomorphic pairs; note that the fact that the roots of the input trees have labels means that now there must be an isomorphic pair of trees in the resulting forests containing the (original) roots. This has now become the standard definition of MAF, for which Bordewich and Semple [5] showed that NP-hardness still holds, and Rodrigues [20] showed that it is in fact APX-hard.

The problem has attracted a lot of attention, and indeed has become a canonical problem in the field of phylogenetic networks. Many variants of MAF have been studied, including versions where the input consists of more than two trees [6, 7], and where the input trees are unrooted [29, 30] or non-binary [22, 27]. We will concentrate on MAF in its classical form with two rooted binary input trees, and we will be concerned with the worst-case approximability of the problem. The literature includes many other approaches to the problem, including fixed-parameter tractable algorithms (e.g., [28, 30]) and integer linear programming [31, 32]. But the quest for better approximation algorithms has become central within the MAF literature.

The first approximation algorithm for the problem with a fully correct analysis was given by Bonet et al. [3] in 2006; they obtain an approximation factor of 5, with a running time that is linear in the number of leaves. (The algorithm follows closely the approach taken by Hein et al. [16] and Rodrigues et al. [21], who both claimed 3-approximation algorithms; but both papers turned out to have flaws in the analysis.) This was followed by a sequence of three papers, each obtaining a 3-approximation algorithm. The first, by Bordewich et al. [4], had a running time of O(n5), where n denotes the number of leaves; Rodrigues et al. [22] substantially improved the running time to O(n2). Finally, Whidden and Zeh [28] simplified the analysis and improved the running time to O(n), matching the running time of the previous 5-approximation.

These algorithms all take a similar approach, and make decisions that are in a certain sense based on “local” information. We focus here on the algorithm and analysis of Whidden and Zeh [28] (based on [22]), since it is the cleanest. The algorithm maintains a tree T1 and a forest T2; initially, these are precisely the two input trees. T1 and T2 always have the same leaf set, which shrinks as the algorithm progresses; a leaf is removed when the part of the algorithm’s solution involving that leaf has been determined. The algorithm proceeds by considering any pair of leaves a, b in T1 that are siblings (two nodes are siblings in a tree if they have the same parent). Consider their situation in T2. If they are also siblings in T2, then there is clearly no reason to separate a and b in a solution, and they can be contracted together in both T1 and T2 to yield a smaller instance. Otherwise, the algorithm deletes the edges directly above a and b in both T1 and T2, resulting in two “trivial” trees consisting of a single leaf each that can essentially be removed from the instance; and also makes one further cut in T2, which will be the edge directly above a sibling of either a or b in T2. The process of merging and deleting edges is then continued on the new instance, until eventually a valid solution is found. (Note that the algorithm might at first glance appear to create many trivial trees consisting of only a single leaf; however, single leaves later in the algorithm may represent larger collections of leaves that have been merged together in earlier iterations.) A fairly direct combinatorial charging argument is used to show that in each iteration of the algorithm (where the algorithm makes three cuts), at least one edge deleted in the optimal solution can be uniquely charged for this iteration.

The next improvement in approximation factor, to a 2.5-approximation (at the cost of an increased quadratic running time) came from Shi et al. [25]. Their approach, like the 3-approximation algorithm described above, starts by choosing a pair of leaves a, b that are siblings in the first tree. However, it pays more attention to the configuration of the second tree and the positioning of a and b within it when deciding what edges to cut. Since larger structures are considered, the analysis is substantially more involved. A further improvement to a factor of 7/3 was then obtained by Chen, Machida and Wang [10]; their algorithm also runs in quadratic time. Again, larger combinatorial structures play a role; further, it does not begin with an arbitrary pair of sibling leaves in the first tree, but chooses the pair more carefully.

The first 2-approximation algorithm was given by a subset of the authors of the current work [23] (independently and essentially concurrently with the 7/3-approximation algorithm of Chen et al. [10]). They do not explicitly discuss (or attempt to optimize) the running time of the algorithm, beyond showing that it is polynomial time. Subsequently, Chen, Harada and Wang [8] (see also [9]), building on the 7/3-approximation algorithm [10], gave a very different factor 2 approximation algorithm, with a cubic running time.

The 2-approximation algorithm presented in the current paper may be viewed as the full version of the algorithm in [23]. However, while the algorithm presented here is similar in spirit, it differs in many details, and the exposition is entirely new. Although the algorithm and analysis remain quite subtle, this version is significantly shorter and clearer. Moreover, we show how our algorithm can, with some care, be implemented in quadratic time ( [23] discusses only a polynomial time bound). This improves over the cubic running time of Chen et al. [8].

Our 2-approximation algorithm differs from previous works in two key aspects.

  • Our algorithm takes a global approach; choices made by the algorithm may depend on large parts of the instance. This is in contrast to the “local” algorithms discussed above. The cubic 2-approximation by Chen et al. [8] also requires non-local substructures, suggesting this may be a crucial factor in achieving this approximation bound.

  • We introduce a novel integer linear programming formulation for the analysis. Our approximation guarantee is proved by constructing a feasible solution to the dual of this linear program, rather than arguing locally about the objective of the optimal solution. We thus bring a powerful tool from the theory of approximation algorithms to bear, one that has not been exploited in the study of MAF so far.

    We use the integer linear programming formulation, and in particular, its linear relaxation, only in our analysis. The algorithm itself is purely combinatorial. It is essentially a dual-fitting algorithm: the analysis explicitly constructs a dual solution with objective value at least half the cost of the primal solution returned by the algorithm.

Although we do not need to solve the linear programming (LP) relaxation, it is an interesting object of study, and it is natural to ask if it can indeed be efficiently optimized. This is not immediately clear, since the formulation has an exponential number of variables. Being able to solve the LP may, for example, be of future utility in obtaining better approximation guarantees using LP-rounding techniques. We show that the relaxation can be reformulated as a compact LP, with only a polynomial number of variables and constraints. This immediately implies that it can be optimized efficiently (in polynomial time). This may make the integer linear program amenable for use with commercial integer programming solvers. There is a previous formulation due to Wu [31], but our formulation is significantly stronger: the integrality gap of the relaxation of Wu is at least 3.2, whereas for ours we show it is at most 2, and in fact the worst example that we are aware of has integrality gap 1.25 (see the Appendix).

We have implemented and tested our algorithm, as well as the compact formulation [19]. The implementation has been designed so that it is easy to step through the algorithm and explore its behaviour on a given instance; the reader may find it helpful when examining the technical details of the algorithm.

Outline We define the problem and introduce necessary notation in Sect. 2. Section 3 describes the algorithm, and proves that it produces a feasible solution to MAF. In Sect. 4, we introduce the linear program, and describe a feasible solution to its dual that can be maintained by the algorithm. We then show the objective value of this dual solution is always at least half the objective value of the MAF solution, which proves the approximation ratio of 2. In Sect. 5, we show a compact formulation of the (exponential sized) linear program used for the analysis. Section 6 gives some concluding remarks and directions for further research. Finally, in the appendices, we provide the details on how to implement our algorithm so that it runs in time quadratic in the size of the input, and we give an example that shows that a previously known integer linear program [31] is not as strong as the formulation introduced here.

Preliminaries

The input to the Maximum Agreement Forest problem (MAF) consists of two rooted binary trees T1 and T2. There is a bijection from the leaves of each tree to a given set of labels L.

Let V1 and V2 denote the node sets of T1 and T2 respectively, and let V=V1V2. We will take a small liberty, and treat L as being a subset of V1 and a subset of V2. We call all nodes in V\L internal nodes. We let L(u) denote the set of leaves that are descendants of a node uV.

We will use the following notational conventions: we use u and v to denote arbitrary nodes (including leaves); if the node we refer to is an internal node in V2, we will use u^ and v^; and we use the letters xy and w to refer to leaves.

For AL we use Vi[A] to denote the set of nodes in Ti that lie on a path between any two leaves in A for i{1,2}, and define V[A]:=V1[A]V2[A].

Definition 1

We say that a set AL covers a node uV if uV[A]. We say that A,AL overlap if V[A]V[A]; we also say that A overlaps A in U, for UV, if V[A]V[A]U. We say a partition P of L overlaps in UV if there exist A,AP, AA, such that A and A overlap in U.

To give some intuition for the use of this definition, recall from the introduction that the goal of the MAF problem is to find a minimum set of edges to be deleted from the two input trees, so that the trees in the resulting two forests can be matched up into isomorphic pairs. One of the requirements for a pair of trees to be isomorphic is that they have the same set of labelled nodes. In other words, the trees in the two forests induce the same partition P of L, and the fact that the forests are formed by deleting edges from the input trees means that no two sets in P overlap.

Next, we will give a definition that allows us to precisely express the other requirement for a pair of trees to be isomorphic. For AL, we let lcai(A) denote the lowest common ancestor of A in Ti. We will sometimes omit braces of explicit sets and write, e.g., lca1(x1,x2,x3) instead of lca1({x1,x2,x3}). For nodes uv in the same tree, we use uv to indicate that u is a descendant of v and uv if u is equal to v or a descendant of v.

Definition 2

A set LL is compatible if for all x1,x2,x3L

lca1(x1,x2)lca1(x1,x2,x3)lca2(x1,x2)lca2(x1,x2,x3).

We call a set of leaves incompatible if it is not a compatible set. Note that LL is compatible precisely if the minimum subtree spanning L in T1 and the minimum subtree spanning L in T2 are isomorphic.

A feasible solution to MAF is a partition P={A1,A2,,Ak} of L such that every component Ai is compatible, and Ai does not overlap Aj, for any ij. The cost of this solution is defined to be |P|-1. This cost corresponds to the number of edges that must be deleted from T1, as well as the same number from T2, so that in both of the resulting forests, each AiP is the leaf set of a single tree.

Remark

In order for MAF to correspond to the rSPR distance, it is necessary to add an additional label ρ to L (see figure below), that is assigned to the roots of T1 and T2. This is the distinction between the original definition of MAF by Hein [16] and the correction by Bordewich and Semple [5]. To maintain the property that only leaves have labels, we instead add a new root to T1 and T2, which has as its two children a leaf labelled ρ and the original root. We simply assume that this addition is already included in the input instance, after which there is no need to distinguish this additional leaf from the others.

graphic file with name 10107_2022_1790_Figa_HTML.jpg

When we describe and analyze our algorithm, the following extended notion of compatibility is convenient.

Definition 3

Given KL, we say a set LL is K-compatible if LK is compatible. A partition P={A1,A2,,Ak} of L is K-compatible if Ai is K-compatible for all i=1,2,,k.

The Red-Blue algorithm

The algorithm maintains a partition P of L, which at the end of the algorithm will correspond to a feasible solution to MAF. The algorithm will maintain the invariant that P does not overlap in V2. Observe that this is equivalent to defining P to be the leaf sets of the trees in a forest, obtained by deleting edges from T2. Initially P={L}.

Very informally, an iteration begins by coloring the leaves with three colors, red, blue, and white. The coloring is such that in T1, there is a node u that has the red and blue leaves as its descendants; the set B of blue leaves is the set of “left” descendants of u and the set R of red leaves is the set of “right” descendants of u. The remaining leaves W are white. Furthermore, it will be the case that the current partition is feasible for the problem restricted to R and for the problem restricted to B. The current iteration will work to make the partition feasible for the problem restricted to RB (in fact, it will be feasible for the problem restricted to RB{w} for all wW). Observe that a forest corresponding to a feasible solution to the full instance can have at most one tree that has leaves of multiple colors, because if there were two such trees then their leaf sets overlap on node u in T1. Also, a multicolored tree in a feasible solution must be such that there is a node u^ in T2 such that (i) no white leaf of the tree is a descendant of u^, and (ii) the blue and red leaves of the tree are left and right descendants of u^. We say the component is (RB)-compatible if (ii) holds. The iteration will refine the multicolored components of the partition into (all but one) unicolored components. The natural idea would be to do this by intersecting each (or all but one) component with each color, but then the resulting partition might overlap in V2; if not, we call the original partition splittable. So we first refine the partition such that it is splittable. In order to achieve the desired approximation guarantee, we need to be careful about the ordering of the steps we take to make the partition splittable , so that we can simultaneously maintain a feasible dual LP solution with an objective value that tracks the number of components; we do this by first making it (RB)-compatible (which works toward splittability as well). Once the partition is (RB)-compatible and splittable , we refine the partition by splitting all but at most one component into unicolored components. Finally, we look for a split that can be undone; the careful order in which the components are refined also serves to guarantee that such a merging of components is possible where needed to prove the approximation guarantee. We now give a precise definition, using the notation from the previous section.

As explained above, our algorithm works towards feasibility by iteratively refining P, focusing each iteration on a set of leaves L(u) for some uV1; u is a node such that the current partition is infeasible for L(u) in some (quite narrowly defined) way. At the end of the iteration the solution is feasible if we restrict our attention to L(u), and even if we consider L(u){w} for any arbitrary wL\L(u).

We use the following definition to specify which sets L(u) the algorithm considers.

Definition 4

Given an infeasible partition P that does not overlap in V2, we call uV1 a root of infeasibility if at least one of the following holds:

  1. P is not L(u)-compatible;

  2. P overlaps in V1[L(u)];

  3. P is L(u)-compatible, and there exists a component AP such that A\L(u) and (AL(u)){w} is incompatible for all wA\L(u).

While the first two conditions can be naturally interpreted as failures of feasibility within V1[L(u)], condition (c) is more subtle. It says that while A is L(u)-compatible, every leaf wA\L(u) provides a certificate that A is in fact incompatible. A different view of this is that every leaf in A\L(u) lies below lca2(AL(u)) in T2. We note that replacing condition (c) by requiring only the existence of at least one such leaf leads to an algorithm that appears to be “too greedy”; more precisely, the approximation guarantee we can prove in that case is worse than 2.

Observe that if uV1 is a root of infeasibility , then any ancestor of u is a root of infeasibility as well. We will say an internal node u in tree Ti is the “lowest” node with property Γ if property Γ does not hold for any of u’s descendants in Ti. The algorithm will thus identify a lowest node uV1 that is a root of infeasibility.

We illustrate the three conditions of a root of infeasibility in Fig. 1. R1, R2, B1, B2, W1, W2 and W3 represent nonempty subtrees that appear in both T1 and T2 — for the examples it suffices to think of these as a subtree consisting of a single leaf. We will adopt this viewpoint and, with a slight abuse of notation, we will refer to the labels of these leaves as R1, R2, B1, B2, W1, W2 and W3, respectively. If P={L}, u satisfies (a). Note that u is indeed a lowest root of infeasibility, since {R1,R2,W3} and {B1,B2,W3} are compatible sets, so u and ur do not satisfy (c) (nor (a) or (b)). If P={{B1},{B2,W1},{R1,R2,W2,W3}}, node u satisfies (b). Again, u is a lowest root of infeasibility (clearly u and ur do not satisfy (a) or (b); they also do not satisfy (c) since {B2,W1} is compatible, as is {R1,R2,W3}). Finally, if P={{R1},{B1,B2,W1,R2,W2},{W3}}, node u satisfies (c). Observe that in this case u is again a lowest root of infeasibility. For u{u,ur}, (a) and (b) are clearly not satisfied; neither is (c) because the only AP such that A\L(u),AL(u) is A={B1,B2,W1,R2,W2}, but then AL(u){w} is not incompatible for w=W2 (and also not incompatible for w=W1 if u=ur).

Fig. 1.

Fig. 1

If P={L}, then node u satisfies case (a) of Definition 4; if P={{B1}, {B2,W1}, {R1,R2,W2,W3}}, it satisfies case (b) and if P={{R1}, {B1,B2,W1,R2,W2}, {W3}}, it satisfies (c)

Given a root of infeasibility uT1, we partition L into RBW, where R=L(ur) and B=L(u) for the two children ur and u of u. We will refer to this partition as a coloring of the leaves; we will refer to the leaves in R as red leaves, the leaves in B as blue leaves and the leaves in W as white leaves. We note that ur and u are lca1(R) and lca1(B), respectively, and we use these interchangeably. We call a component of P tricolored if it has a nonempty intersection with RB and W, and bicolored if it has a nonempty intersection with exactly two of the sets RBW. A component is called multicolored if it is either tricolored or bicolored , and unicolored otherwise.

Observation 1

Let u be a lowest root of infeasibility for P, and consider the coloring RBW, where R=L(ur) and B=L(u) for the two children ur and u of u. Then the set of multicolored components of P consists of either at most two bicolored components or exactly one tricolored component.

Proof

If u is a lowest root of infeasibility, P does not overlap in V1[R] and V1[B], and so at most one component of P covers ur=lca1(R), and at most one covers u=lca1(B). Since any multicolored component covers at least one of lca1(R) and lca1(B), there can be at most two multicolored components. Furthermore, because any tricolored component covers both lca1(R) and lca1(B), if there is a tricolored component there can be no other multicolored component.

We note that the above observation can be refined; it is possible to show that P contains either one tricolored component or exactly two bicolored components; see Lemma 12 in Sect. 4.3.

We now give the overall algorithm. In the description, but also in the descriptions of the various procedures that follow, the in front of certain lines will be used to refer to these lines in the analysis in Sect. 4.2.graphic file with name 10107_2022_1790_Figb_HTML.jpg

The various procedures in the Red-Blue Algorithm will be described in detail in the subsequent subsections, along with lemmas regarding the properties they ensure. For now, we give a very high-level description.

An iteration of the main while-loop starts by finding a lowest root of infeasibility u, yielding a coloring (RBW) of the vertices; if there is no root of infeasibility, then the current partition is feasible, and the main loop terminates. The goal of the iteration, essentially, is to ensure that by the end of the iteration, u is no longer a root of infeasibility, while maintaining the invariant that the partition does not overlap on V2. Until the very end of the algorithm, the partition is only ever refined; since each iteration must modify the partition, the number of iterations is bounded by |L|. (Alternatively, our analysis shows that if u is chosen for some iteration of the algorithm, then from the end of the iteration until the very end of the algorithm, u will never again be a root of infeasibility.)

The process of refining the partition to make u no longer a root of infeasibility proceeds in two main stages. First, the procedure Make-(RB)-compatible refines the partition if necessary so that it is (RB)-compatible, i.e., so that condition (a) fails to hold. The procedures Split and Make-Splittable will together ensure that conditions (b) and (c) also both fail to hold, so that u is no longer a root of infeasibility at the end of the iteration. In particular, they ensure that the partition does not overlap in V1[L(u)], and that the final partition is (RB{w})-compatible for every wL (which is stronger than (c) not holding).

Finally, Find-Merge-Pairs and Merge-Components are needed for the approximation bound only. All the other steps in the algorithm only refine the current partition. In some particular cases, it is possible and necessary to undo some of these refinements. This is done in a careful way at the very end of the algorithm by Merge-Components, using information prepared by Find-Merge-Pairs. The reason that the merges are done at the end, rather than during the main loop, is primarily for analysis purposes.

In order to simplify the statement of the lemmas, we will make statements like “let P be the partition after ProcedureName(P,(R,B,W))”. This implicitly assumes that (RBW) was a coloring chosen in the beginning of the current iteration of the Red-Blue Algorithm (and thus, that lca1(RB) was a lowest root of infeasibility at that moment), and that P is the partition resulting from calling ProcedureName(P,(R,B,W)) in the current iteration.

Make-(RB)-compatible

If P is not (RB)-compatible, we start by refining P with the following procedure so that each of its components is (RB)-compatible.graphic file with name 10107_2022_1790_Figc_HTML.jpg

An example is given in Fig. 2. We note that in general, the choice of u^ does not have to be unique, and that multiple refinements may be needed to make the partition (RB)-compatible .

Fig. 2.

Fig. 2

Illustration of Make-(RB) -compatible(P,(R,B,W)). Because P and P do not overlap in V2, we can represent these partitions as the leaf sets of trees in a forest obtained by deleting edges from T2. In this figure and the following figures the dashed edges represent deleted edges. In this example P={L}. Then Make-(RB) -compatible(P,(R,B,W)) must choose u^=lca2(R1,B1), and refines the partition to {{B1,R1},{B2,W1,R2,W2,W3}}, which is (RB)-compatible

As observed above, for any partition P that does not overlap in V2, there is a set of edges in T2 such that P consists of the leaf sets of the trees in the forest obtained after deleting these edges. Our refinement is equivalent to deleting the parent edge of u^, and hence the resulting partition does not overlap in V2 if the original partition did not overlap in V2.

Lemma 1

Let P be the partition after Make-(RB) -compatible(P,(R,B,W)). Then P is a refinement of P that does not overlap in V2 and is (RB)-compatible .

Proof

First, observe P is R-compatible and B-compatible, since u’s children are not roots of infeasibility. If P is (RB)-compatible then P is not modified by the procedure, and the lemma is vacuously true. Otherwise, the procedure refines P, and, as argued above, the resulting partition P does not overlap in V2 provided that P does not overlap in V2. The procedure ends when there are no sets in P that are not (RB)-compatible , so the only thing left to show is that this procedure halts. Because u^ was chosen to be the lowest internal node in V2[A] such that AL(u^) intersects both R and B, the children of u^, say u^r and u^, are so that AL(u^r) and AL(u^) can only intersect one of R and B. Therefore AL(u^) is (RB)-compatible , where A was not, and thus the number of (RB)-compatible components in P increases, which can only happen at most |L| times.

Observe that if P is (RB)-compatible , then any refinement of P is also (RB)-compatible , and hence we may assume that the partition at any later point in the current iteration of the Red-Blue Algorithm is (RB)-compatible .

Make-splittable

The goal of the next two procedures is to further refine the partition so that there is no overlap in V1[RB]. We will do this in two steps. The first of these procedures will make the partition “splittable ”. To describe this informally, we view the components of the partition as the trees of the forest obtained by deleting edges from T2. We call a component A that intersects k colors splittable , if there are k-1 edges that can be deleted from T2 to “split” the tree into k unicolored components. We can phrase this property succinctly using the notion of overlapping: if the sets AR, AB and AW do not overlap in V2, then there are disjoint trees in T2 that have each of these sets as leaf sets, and we can therefore split the tree associated with A in T2 into these three trees by deleting at most two edges.

Definition 5

Given a coloring (RBW) of L, a set AL is splittable if AR, AB and AW do not overlap in V2. A partition is splittable if every component in the partition is splittable.

graphic file with name 10107_2022_1790_Figd_HTML.jpg

As a first example of Make-Splittable, consider P={{B1,R1},{B2,W1,R2,W2,W3}} that was the output of Make-(RB) -compatible depicted in Fig. 2. In this example P is already splittable . In Fig. 3 a more interesting example is given.

Fig. 3.

Fig. 3

Illustration of Make-Splittable(P,(R,B,W)). P={{R1},{B1,B2,W1,R2,W2},{W3}}, and the set A={B1,B2,W1,R2,W2} is not splittable . Make-Splittable(P) would choose u^=lca2(B2,W1) and replace A by {B2,W1} and {B1,R2,W2}

Lemma 2

Make-Splittable is well-defined, in that a node u^ satisfying the desired properties in line can always be found.

Proof

If A is bicolored and not splittable , then there exists u^V2[A] such that both AL(u^) and A\L(u^) are bicolored : just take u^ to be a lowest node in V2[AC1]V2[AC2] for distinct C1,C2{R,B,W}; such a node exists because A is not splittable, and the fact that u^ is in V2[ACi] for i=1,2 implies that AL(u^) and A\L(u^) intersect Ci.

It remains to prove the lemma for the case that A is tricolored . For this to hold, we need that P is (RB)-compatible , which by Lemma 1 is indeed true when Make-Splittable is called. So suppose A is tricolored and not splittable . Note that V2[AR] and V2[AB] cannot intersect because A is (RB)-compatible . Assume without loss of generality that V2[AR]V2[AW], and let u^ be a lowest node in V2[AR]V2[AW]. Note that both AL(u^) and A\L(u^) must intersect W and R, and that AL(u^) cannot intersect B, since then A would not be (RB)-compatible . So AL(u^) is bicolored , and A\L(u^) is tricolored .

Lemma 3

Let P be the partition after Make-Splittable(P,(R,B,W)). Then P is a refinement of P that does not overlap in V2 and in which every component is splittable .

Proof

By Lemma 2, and since each iteration increases the number of components in P, Make-Splittable must terminate, and by its definition, the final partition P contains only splittable components. Clearly P is a refinement of P; it does not overlap in V2 by the same arguments as used in the proof of Lemma 1.

Before continuing, we summarize the properties of the partition resulting after Make-Splittable that will be useful in the proof of the approximation guarantee in Sect. 4. To describe these, we need the notion of a top component.

Definition 6

If A is a component in the partition at the beginning of an iteration, and A is multicolored , then A is a top component. If A is a top component of the current partition, and A gets subdivided into A\L(u^) and AL(u^) by Make-(RB) -compatible or Make-Splittable, then A\L(u^) (but not AL(u^)) is a top component of the resulting partition.

We note that by Observation 1, there are always either exactly one or two top components at the start of the iteration, and hence throughout (until the call to Split, after which the notion is no longer defined).

Lemma 4

Let P(0) denote the partition at the start of a given iteration, and (RBW) the coloring of the leaves that is selected, let P(1) denote the partition after Make-(RB) -compatible(P(0),(R,B,W)), and let P(2) denote the partition after Make-Splittable(P(1),(R,B,W)). Then the following properties hold:

  1. Only multicolored components are subdivided by the iteration, i.e., if AP(0)\P(2), then A is multicolored.

  2. The number of tricolored components in P(2) is the same as in P(1).

  3. Any tricolored component in P(1) or P(2) that is not a top component contains no compatible tricolored triple.

  4. Any bicolored component A in P(2) that is not a top component satisfies that lca2(A) is not covered by AC for any color C{R,B,W}. In other words, L(u^)A and L(u^r)A are unicolored where u^ and u^r are the children of lca2(A).

  5. If xW is in component A in P(0), and xW is not a descendant of lca2(A(RB)) (and thus xW is a white leaf) , then either AP(2) or xW is in a top component in P(2).

Proof

The fact that property 1 holds can be read from the description of Make-(RB) -compatible and Make-Splittable. Property 2 follows from the description of Make-Splittable.

For property 3, we prove that when a non-top component is created from a top component, this non-top component cannot have compatible tricolored triples. This implies that no non-top component can have a compatible tricolored triple. First consider non-top components created by Make-(RB) -compatible from a top component A. The fact that node u^ picked in Make-(RB) -compatible is always chosen as low as possible implies that when the non-top component A=AL(u^) is created, it holds that lca2(xR,xB)=u^ for any xRAR,xBAB. Therefore, for any xWAW, it must be the case that either lca2(xW,xR)u^ or lca2(xW,xB)u^. But then {xR,xB,xW} is incompatible , because lca1(xR,xB)lca1(xR,xB,xW). So non-top components in P(1) can indeed not have compatible tricolored triples. Non-top components created by Make-Splittable from a top component are bicolored by definition, so these cannot have compatible tricolored triples either. Therefore, property 3 holds.

A similar argument shows property 4. First, consider a non-top component A created by Make-(RB) -compatible. A intersects R and B, so if A is bicolored , it contains no white leaves, so lca2(A) is not covered by AW=. Now, because lca2(A) is the node u^ picked in Make-(RB) -compatible, which is as low as possible, lca2(A) is not covered by AR nor AB. For a non-top component A created by Make-Splittable, the fact that lca2(A) is the node u^ picked in Make-Splittable which is chosen as low as possible again implies that lca2(A) is not covered by AC for any color C{R,B,W}.

For property 5, if AP(2), consider a node u^ selected by Make-(RB) -compatible or Make-Splittable that leads to a subdivision of A. It suffices to argue that u^lca2(A(RB)), because then the fact that xW is not a descendant of lca2(A(RB)) implies that xW always remains in a top component. For u^ selected by Make-(RB) -compatible this fact holds because u^ is a lowest node such that AL(u^) intersects R and B. For u^ selected by Make-Splittable this fact holds because u^ is a lowest node such that AL(u^) is bicolored , and A\L(u^) intersects the same colors as A.

Split

We now “split” the multicolored components of the partition: essentially, we further refine the partition by intersecting each multicolored component with R, B and W. Thus a component intersecting k colors will be split into k unicolored components. The fact that the components of the partition were splittable ensures that the resulting partition does not overlap in V2. We will, however, need to be slightly more careful in order to achieve the approximation guarantee; in particular, we will sometimes need to perform what we call a Special-Split.graphic file with name 10107_2022_1790_Fige_HTML.jpg

Remark

Our analysis in Sect. 4 needs the Special-Split, Find-Merge-Pair and Merge-Components procedures only in one (of three) cases that will be described in Lemma 12. Without these procedures, it is trivial to see that the resulting partition is feasible, and we will see in Sect. 4 that the proof of the approximation ratio is quite simple in these cases. On first reading, the reader may thus choose to skip the description of these procedures, and also read Sect. 4 only up to the proof of Proposition 14.

We emphasize that the Special-Split procedure is only called if A is tricolored , and there is at least one tricolored compatible triple in A. Hence, by property 3 of Lemma 4, Special-Split is only applied to tricolored top components.graphic file with name 10107_2022_1790_Figf_HTML.jpg

We refer to Fig. 4 for examples of the split operations in the two cases.

Fig. 4.

Fig. 4

Two illustrations of Split(P,(R,B,W)). In the top example P={{R1}, {B2,W1}, {B1,R2,W2}, {W3}} and Split(P) would simply refine each set of P by intersecting it with the three color classes. The result is that every leaf is a singleton in P. In the bottom example, P={{B1,R1},{B2,W1,R2,W2,W3}}. The set A={B2,W1,R2,W2,W3} is tricolored and contains triple {B2,R2,W3} that is tricolored and compatible , but not every tricolored triple in A is compatible , e.g., {B2,R2,W2} is not compatible . In this case, the Special-Split replaces A by {{B2},{R2}, {W1,W2}, {W3}}

We now describe the property that the partition produced by Split will have, which goes beyond merely being (RB)-compatible and non-overlapping in V2 and V1[RB].

Definition 7

Let KL. A partition P is K-feasible if for all wL, P is K{w}-compatible, and no two components in P overlap in V2V1[K].

We will simply say P is feasible if it is L-feasible, which we note does indeed coincide with the definition of a feasible solution to MAF. We make two additional remarks about the notion of K-feasibility:

  • This stronger compatibility notion will be used in Lemma 7 to show that if P is (RB)-feasible, then future iterations of the Red-Blue Algorithm will not further subdivide (the restriction of the partition to) RB. This is not necessarily true if P is only (RB)-compatible and does not overlap in V2V1[RB]. See Fig. 5 for an example.

  • If uV1 is a root of infeasibility for P, then P is not L(u)-feasible. The converse is not true, however: if P contains a single component containing L(u) which is L(u)-compatible, but this component contains both wL\L(u) such that L(u){w} is compatible, and wL\L(u) such that L(u){w} is not compatible, then P is not L(u)-feasible, but u is not a root of infeasibility. See Fig. 6 for an example. The stronger notion of a u being a root of infeasibility versus not being L(u)-feasible is needed when we prove the approximation guarantee in Sect. 4.

Fig. 5.

Fig. 5

An example where P is (RB)-compatible and does not overlap in V2V1[RB], but that is not (RB)-feasible. In this example, P={L}, which clearly does not overlap in any node. If we stop the current iteration with P, then lca1({B1,B2}) and lca1({W1,W2}) are lowest roots of infeasibility; no matter which one is chosen, the next iteration would further subdivide the partition restricted to RB. Because we want to ensure this does not happen, the current iteration of the Red-Blue Algorithm will further subdivide the partition induced on RB: it will create components {B1,W1},{B2,W2,R1} in Make-Splittable and split everything into singleton components in Split

Fig. 6.

Fig. 6

An example where P is not L(u)-feasible, but u is not a root of infeasibility. (To emphasize that u is not a root of infeasibility, the leaves are labelled with x1, x2, x3, w1 and w2, in contrast to earlier figures.) In this example, P={L}, which does not overlap in any node, and P is L(u)–compatible because the triple L(u) is compatible. But P is not L(u)-feasible because L(u){w1} is not compatible . On the other hand, u is not a root of infeasibility because L(u){w2} is compatible

Before we prove that the outcome of Split is (RB)-feasible, we prove the following technical lemma that gives sufficient conditions for a partition to not overlap in V1[RB].

Lemma 5

Let P be the partition and (RBW) be the coloring at the start of an iteration. Let P be a refinement of P that does not overlap in V2 and that is (RB)-compatible . Then P does not overlap in V1[RB] if the following two conditions are met:

  • (i)

    P has at most one multicolored component;

  • (ii)

    for the multicolored component AP (if it exists), either lca2(RB)lca2(A) or any node v^ with lca2(A)v^lca2(RB) is covered only by components in P that are subsets of W, or that are also components of P.

Proof

Suppose the conditions of the lemma hold for P. First, observe that P having at most one multicolored component implies that P contains at most one component covering lca1(RB). Hence, if we suppose for a contradiction A,AP exist that overlap in V1[RB], then they must overlap in V1[R] or V1[B]. Without loss of generality, assume that A,AP overlap in V1[R]. Since they do not overlap in lca1[RB], we may assume also without loss of generality that AR and ARBW.

Since lca1(RB) was chosen as a lowest root of infeasibility, lca1(R) was not a root of infeasibility for P. This implies that no two components of P overlap in V1[R], so it must be the case that A and A were both part of a single component in P and were split. Also, P must have been R-compatible, so (AA)R is a compatible set. We will show that these facts imply that if A and A overlap in V1[R], then they must overlap in V2[R], thus contradicting that P does not overlap in V2.

Let v be a lowest node in V1[R] such that AL(v) and AL(v) where we note that v exists since A,A overlap in some node in V1[R]. Observe that a child of v cannot be in both V1[A] and V1[A], as this contradicts the choice of v, and v itself is in V1[A] and V1[A] only if A and A also contain leaves in L\L(v). Let x,x be in AL(v) and AL(v) respectively, and choose y,y in A\L(v) and A\L(v). Note that x,yR because AR, and xR because x is a descendant of vV1[R], and the coloring guarantees that all descendants of nodes in V1[R] are red.

First, assume both A and A are unicolored (that is, both red). Then also yR, so {x,x,y,y}R is a compatible set. Note that lca1(x,x)=vlca1(x,x,y) and similarly lca1(x,x)lca1(x,x,y). Since {x,x,y,y} is compatible, we must also have lca2(x,x)lca2(x,x,y) and lca2(x,x)lca2(x,x,y). But then lca2(x,x) is on the path from x to y as well as on the path from x to y. Hence, A and A overlap in lca2(x,x)V2, contradicting that P does not overlap in V2.

Now, suppose that while A is unicolored, A is multicolored . Since {x,x,y}R is compatible, lca2(x,x)lca2(x,x,y), so the fact that x,yA implies that lca2(x,x)V2[A]. Now, it must be the case that lca2(A)lca2(x,x), because xA and otherwise A and A overlap in lca2(x,x), contradicting that P does not overlap in V2. The fact that x,xR implies that lca2(x,x)lca2(RB). So lca2(A)lca2(x,x)lca2(RB), and by property (ii), it must thus be the case that lca2(x,x) is covered only by components in P that are subsets of W or that are also components of P. But this is a contradiction because lca2(x,x) is covered by AP\P.

The next lemma states that the partition resulting after Split is (RB)-feasible.

Lemma 6

Let P be the partition after Split(P,(R,B,W)). Then P is a refinement of P that is (RB)-feasible.

Proof

It is easy to see that every component is (RB{w})-compatible for all wL: each component is either unicolored (and thus (RB{w})-compatible by the fact that the partition is R-compatible and B-compatible), or it is the result of a Special-Split on a component in which all tricolored triples are compatible, and hence, since all triples in RB are compatible by the fact the component is (RB)-compatible , it was already (RB{w})-compatible for all wL before the Special-Split.

To see that P does not overlap in V2, note that the fact that P does not overlap in V2 and is splittable (by Lemma 3) implies that AR,AB,AW do not overlap in V2 for any AP. If A is split by a Special-Split into AR and A(BW), then A is (RB{w})-compatible for all wL (again, because A has no incompatible tricolored triples and A is (RB)-compatible ). This implies that there is a node u^rV2 such that AL(u^r)=AR; hence, AR and A\R do not overlap in V2.

It remains to show that no two components in P overlap in V1[RB]. We check the sufficient conditions in Lemma 5. The only possible multicolored components of P are bicolored components created by Special-Split on a component in P that is tricolored and in which every tricolored triple is compatible . By property 3 of Lemma 4, the only tricolored components that have a compatible tricolored triple are top components. By Observation 1, the partition at the start of the iteration had at most one tricolored component, and thus there can be at most one top component, say A, that is tricolored in P. Since A is (RB{w})-compatible for all wL, there is a node u^rV2 such that AL(u^r)=AR. Split subdivides A into AR and A\R, where A\R is the unique multicolored component in P. Let A=A\R, and suppose that there exists a component AP that covers a node v^ on the path from lca2(A) to lca2(L). Then lca2(A) must be on this path, too, so lca2(A)lca2(A). Observe that A cannot be A\A=AL(u^r). Also, since A was the unique top component in P, no component created in the current iteration has a lowest common ancestor above lca2(A). So A must have been a component in the partition at the start of the iteration, and by Lemma 5 we conclude that P does not overlap in V1[RB].

Find-merge-pair and merge-components

The astute reader may have noted that the Red-Blue Algorithm sometimes increases the number of components by more than necessary to be (RB)-feasible. One example of this is given in Fig. 5. More generally, it follows from the arguments in the proof of Lemma 6 that if there is a tricolored component in which every tricolored triple is compatible , then not further subdividing this component would also leave a partition that is RB-feasible. Find-Merge-Pair and Merge-Components aim to merge two components of the partition produced at the end of Split, so that the partition with the merged components is still (RB)-feasible. Find-Merge-Pair thus looks for a pair of components that can be merged, by scanning the components of the current partition, and finding two leaves in RB that are in different sets of the partition now, but that were in the same component at the start of the current iteration. We note that a pair of components may also be found when no Special-Split is done on a tricolored component in which every tricolored triple is compatible; in other words, Find-Merge-Pair and Merge-Components can do more than simply reversing those splits on tricolored components in which every tricolored triple is compatible . In the proof of the approximation guarantee (in particular, in Proposition 15), we will show the existence of very specific components that can be merged. However, merging any pair of components created in the current iteration leads to the same approximation guarantee.graphic file with name 10107_2022_1790_Figg_HTML.jpg

Although we could simply merge the components containing x1 and x2 for the pair found by Find-Merge-Pair, we will not do so until the very end of the algorithm. The reason we keep such “superfluous” splits is because they increase the objective value of the dual solution we use to prove the approximation guarantee of 2 (see Sect. 4). We “reverse” these superfluous splits (i.e., we will merge components) at the end of the algorithm; this is reminiscent of a “reverse delete” in approximation algorithms for network design [13]. The reason to delay these merges is thus to simplify the description of the dual solution in the analysis only.graphic file with name 10107_2022_1790_Figh_HTML.jpg

The proof that we will be able to merge the components containing the pair of leaves identified by Find-Merge-Pair at the end of the algorithm will rely on the fact that (i) because the partition is (RB{w})-compatible for any wL, merging the components containing the identified leaves x1,x2RB cannot increase the number of incompatible triples contained in a component, and (ii) because the partition is (RB)-feasible, future iterations of the algorithm will not further refine the partition induced on RB. This is the reason why we do not allow Find-Merge-Pair to choose leaves in W (and only choosing leaves in RB is sufficient to prove the claimed approximation guarantee).

Lemma 7

Let (RBW) be the coloring during some iteration of the Red-Blue Algorithm, and let P be the partition at the end of the iteration. Then the algorithm does not refine the partitioning restricted to RB in later iterations: for any x,xRB that are in the same component of P, x and x are in the same component in any partition at any later point of the algorithm’s execution.

Proof

Suppose for a contradiction that a later iteration with coloring (R,B,W) separates two leaves x,xRB in the same component of P. Let A be the component containing x and x at the start of this iteration. Since P is (RB)-feasible, no vV1[RB] is a root of infeasibility, and hence all leaves in RB, and in particular x and x, must have the same color in the coloring (R,B,W). Notice that by the definition of Split, x and x cannot be separated during Split. Hence, they must be separated during Make-(RB)-compatible or Make-Splittable. In both cases there must exist some u^V2 such that AL(u^) is multicolored with respect to the coloring (R,B,W), and AL(u^) contains precisely one of x,x. By relabeling if needed, assume that xAL(u^) and xA\L(u^). Let wAL(u^) be any leaf with a color (in the coloring (R,B,W)) different from x, and note that

lca2(x,w)u^lca2(x,x,w). 1

Because all leaves in RB, have the same color in (R,B,W), and because w has a different color than x in (R,B,W), we know that lca1(x,x)lca1(x,x,w). But, since P is (RB{w})-compatible, this implies that if w is in the same component as x and x in (a refinement of) P, then lca2(x,x)lca2(x,x,w), contradicting (1), because only one of lca2(x,x) and lca2(x,w) can be strictly below lca2(x,x,w).

Correctness of the algorithm

Theorem 8

The Red-Blue Algorithm returns a feasible solution to MAF.

Proof

In each iteration through the main loop of the algorithm, the partition is strictly refined. Thus there are less than |L| iterations. When the main loop terminates, lca1(L) is not a root of infeasibility , and so the partition at this stage is feasible. It remains to prove that merging components using Merge-Components maintains the feasibility of the partition.

We prove this by induction on k, the number of pairs in pairslist. If k=0, Merge-Components does nothing, and so the returned partition is indeed feasible.

So suppose k>0. Observe that the result of Merge-Components applied to a partition P is the unique finest coarsening of P in which every pair of nodes in pairslist is in the same component, and hence does not depend on the order in which the pairs in pairslist are considered. We may thus assume without loss of generality that they are considered in the reverse order in which they were added to pairslist.

Let P be the partition obtained during Merge-Components after the components have been merged for all pairs on pairslist, except the pair (x1,x2) that was added to pairslist first. Let P be the partition at the moment when (x1,x2) was added to pairslist during the main loop of the algorithm, i.e. the partition at the end of Split in the iteration where (x1,x2) was added to pairslist; let RBW be the three color sets of that iteration. In all subsequent iterations P was further refined, and any of the pairs aside from (x1,x2) added to pairslist consists of two leaves that were in the same component in the partition at the start of the iteration in which were they added to pairslist , and hence in the same component of P. Thus, P is a refinement of P and P is a coarsening of the partition at the end of the last iteration. Thus by Lemma 7, P and P induce the same partition of RB. Moreover, by the induction hypothesis, every component of P is compatible.

Let A1,A2 be the components in P containing x1,x2 respectively. By the choice of x1,x2, (A1A2) is RB{w}-compatible for every wL, and A1A2 does not overlap any component of P\{A1,A2} in V2V1[RB].

If A1,A2 are unicolored , they both contain leaves in RB only, because x1,x2RB by definition of Find-Merge-Pair. As argued above, P contains components A1 and A2 as well. Furthermore, in this case, the set A1A2 is a subset of RB and thus RB{w}-compatibility for all wL implies the set is compatible . Since V1[A1A2]V1[RB], A1A2 cannot overlap any set AP\{A1,A2}; this implies it also does not overlap any set AP\{A1,A2}, since P is a refinement of P.

If A1 and A2 are not both unicolored , observe that only one of A1,A2 is bicolored and contains leaves in BW, because P does not overlap in V1[RB] so it can only have one multicolored component, and the only type of multicolored components after Split, are subsets of BW. Suppose without loss of generality that A1 is unicolored and A2 contains leaves in BW. As mentioned before, by Lemma 7, P and P have the same components restricted to RB, whence P contains component A1 and a component A2A2, where A2(RB)=A2(RB).

We need to show that A1A2 is compatible and does not overlap any component in P\{A1,A2}. For the latter, suppose in order to derive a contradiction that A1A2 overlaps AP\{A1,A2}. Observe that the only nodes in V[A1A2] that are not in V[A1]V[A2] are in V2V1[RB], so the overlap must be on a node vV2V1[RB]. Since P is a refinement of P, there must exist AP such that AA, and thus A1A2 overlaps A in v as well. But then A1A2 also overlaps A in v contradicting that P\{A1,A2}{A1A2} is (RB)-feasible.

To show that A1A2 is compatible , note that A2 is a component of P, and thus, by the induction hypothesis, A2 is compatible . By the choice of A1,A1, we know A1A2A1A2 is (RB{w})-compatible for all wL. So to show that A1A2 is compatible , it suffices to consider x,w,wA1A2 with xA1 and w,wA2W. Fix any xBA2B, and note that xRB. Therefore, lcai(xB,w)=lcai(x,xB,w)=lcai(x,w) for i=1,2, since lcai(x,xB)lcai(x,xB,w) is implied by A1A2 being RB{w}-compatible. So {x,w,w} is compatible exactly when {xB,w,w} is compatible . Because, as we noted, A2 is compatible , we conclude that A1A2 is compatible .

Proof of the approximation guarantee

We showed in the previous section that the Red-Blue Algorithm returns a feasible solution P. In order to prove that our algorithm achieves an approximation guarantee of 2, we will use linear programming duality.

The linear programming relaxation

Let C be the set of all compatible subsets of L. Introduce a variable xL for every compatible set LC, where in an integral solution, xL=1 indicates that L forms part of the solution to MAF. The constraints ensure that in an integral solution, {L:xL=1} is a partition, and that V[L]V[L]= for two distinct sets L,L with xL=xL=1. The objective encodes the size of the partition minus 1. graphic file with name 10107_2022_1790_Figi_HTML.jpg

The equality constraint on the leaves can be replaced by the inequalities L:vLxL1 for all vL. For given a solution x~ for which the constraint for some leaf v is not tight, we can simply choose some set L containing v with x~L>0, and decrease x~L while (if |L|>1) increasing x~L\{v}. This cannot increase the cost of the solution, and clearly maintains feasibility. By repeating this process, we obtain a solution to (LP) of cost no larger than the cost of the original x~.

In fact, it will be convenient for our analysis to expand the first set of constraints (in their inequality rather than equality form) to contain a constraint for every (not necessarily compatible) set of leaves A, stating that every such set must be intersected by at least one component in the chosen MAF solution. All these constraints of this expanded set are clearly implied by the constraints for A a singleton, which are exactly the first set of constraints in (LP). graphic file with name 10107_2022_1790_Figj_HTML.jpg This expanded formulation provides us a more expressive dual: graphic file with name 10107_2022_1790_Figk_HTML.jpg We will refer to the left-hand side of the first family of constraints, i.e., vV[L]\Lyv+A:ALzA, as the load on set L, and denote it by load(y,z)(L). By weak duality, we have that the objective value of any feasible dual solution provides a lower bound on the objective value of any feasible solution to (LP), and hence also on the optimal value of any feasible solution to MAF. Hence, in order to prove that an agreement forest that has |P| components is a 2-approximation, it suffices to find a feasible dual solution with objective value 12(|P|-1), i.e., for every new component created by the algorithm, the dual objective value should increase by 12 (on average).

The dual solution

The dual solution maintained is as follows. Throughout the main loop of the algorithm, zA=1 if and only if A is a component in P. In the last part of the algorithm, when we merge components according to pairslist , we do not update the dual solution; these operations affect the primal solution (i.e., P) only.

Initially, yv=0 for all v(V1V2)\L. At the start of each iteration, we decrease yu by 1, where u=lca1(RB). Whenever in the algorithm we choose a component A and a node u^V2[A], and separate the component A into AL(u^) and A\L(u^), we decrease yu^ by 1. To be precise this happens in Make-(RB) -compatible, Make-splittable and in one case in Special-Split (where we actually further refine AL(u^)). The lines where such nodes are chosen are indicated by in the description of the algorithm and the procedures it contains.

Lemma 9

The dual solution maintained by the algorithm is feasible.

Proof

We prove the lemma by induction on the number of iterations. Initially, zA=0 for all AL and zL=1 and hence every compatible set L has a load of 1.

At the start of an iteration, we decrease ylca1(RB) by 1, thus decreasing the load by 1 on any multicolored compatible set L. We show that the remainder of the iteration increases the load by at most 1 on a multicolored compatible set and that it does not increase the load on any unicolored compatible set.

First, observe that Make-(RB) -compatible and Make-Splittable do not increase the load on any set: Separating A into AL(u^) and A\L(u^) increases the load on sets L that intersect both AL(u^) and A\L(u^), since zA gets decreased from 1 to 0, and zAL(u^) and zA\L(u^) increase from 0 to 1. However, in this case u^V[L], and thus decreasing yu^ by 1 ensures that the load on L does not increase.

To analyze the effect of Split, we use the following two claims.

Claim 10

In the procedure Split(P,(R,B,W)) the load on any compatible set L is increased by at most the number of components AP such that LA is multicolored .

Proof. If the load on L is increased because Split splits a bicolored component A into two unicolored components, then L must intersect both new components, so LA is bicolored (and thus multicolored ) and the load on L is increased by 1.

Consider the case where the load on L is increased because a tricolored component A is split into AR, AB and AW. This split happens when all tricolored triples in A are incompatible . Therefore LA cannot be tricolored . Since the load on L increased by splitting A, we conclude that LA must be bicolored and the load on L is increased by 1.

Finally, suppose the load on L is increased because Special-Split(A,P,(R,B,W)) is executed for a component A. We consider the two cases of Special-Split. In the first case, A is split into two components, one of which contains all red leaves in A. The load on a set L thus increases by 1 if LA is multicolored and LAR and by 0 otherwise. In the second case, A is split into four components; we think of this as first splitting A into AL(u^) and A\L(u^), and then splitting AL(u^) by intersecting with RB and W. Since yu^ is decreased by 1, splitting A into AL(u^) and A\L(u^) does not affect the load on any set L. Splitting AL(u^) by intersecting with RBW increases the load on L by 1 if LAL(u^) is bicolored and by 2 if it is tricolored . We show below that LAL(u^) cannot be tricolored , which implies that the load on L increases by at most 1 if AL is multicolored , thus proving the claim. Suppose LAL(u^) contains a triple xBB,xRR,xWW. The fact that A is (RB)-compatible implies that lca2(xB,xR)=lca2(A(RB))=u^. Since xWL(u^), we thus have either lca2(xB,xW)u^=lca2(xB,xR) or lca2(xR,xW)u^=lca2(xB,xR). In either case, {xB,xR,xW} is incompatible, contradicting that L is compatible.

Claim 11

If L is compatible, and A and A do not overlap in V2, then LA and LA cannot both be multicolored .

Proof. Assume that |A|2,|A|2 (otherwise, the claim is vacuously true). Since V2[A] and V2[A] are disjoint, we may assume without loss of generality that lca2(x,y)lca2(x,y,x) for all x,yA and xA. Hence, if LA and LA are both multicolored sets, then there exist x,y,x,yL where xy have different colors, x,y have different colors, lca2(x,y)lca2(x,y,x), and lca2(x,y)lca2(x,y,y). We claim this implies {x,y,x,y} is incompatible, a contradiction since x,y,x,yL and L is compatible.

Clearly one of xy has the same color as one of x,y. Suppose without loss of generality that x,x have the same color. If x and x are both red, y is either blue or white. x and x being red implies lca1(x,x)lca1(x,y,x), which, since lca2(x,y)lca2(x,y,x), shows that {x,x,y} is an incompatible triple. The case when x and x are blue is analogous. If x and x are both white, then y and y are in RB. This implies lca1(y,y)lca1(x,y,y), and so, since lca2(x,y)lca2(x,y,y), this implies {x,y,y} is an incompatible triple.

It follows immediately from the two claims that Split increases the load by at most 1 on any multicolored compatible set and that it does not increase the load on any unicolored set, which completes the proof of the lemma.

The primal and dual objective values

Let P, pairslist be the partition and pairslist at the end of an iteration, and let D=vV\Lyv+|P|-1 be the objective value of the dual solution at this time. In this section, we show that every iteration of our algorithm maintains the invariant that

2D|P|-1-|pairslist|. 2

Observe that the approximation guarantee immediately follows from this inequality, since the objective value of the algorithm’s solution is P-1-|pairslist| (where P, pairslist are the partition and pairslist at the end of the final iteration), and by weak duality D gives a lower bound on the optimal value of the MAF instance.

To prove that the algorithm maintains the invariant, we will show that a given iteration increases the left-hand side of (2) by at least as much as the right-hand side. We let ΔD be the change in the dual objective during the iteration and ΔP be the increase in the number of components minus the number of pairs added to pairslist (either 0 or 1) during the current iteration.

Since at the start of the algorithm, the partition consists of exactly one component, and yv=0 for all vV\L, (2) holds before the first iteration. So to show (2), it suffices to show that

2ΔDΔP 3

for any iteration.

In what follows, we use the following to refer to the state of the partition at various points in the current iteration: P(0) at the start; P(1) after Make-(RB) -compatible; P(2) after Make-Splittable; and P(3) after Split.

We begin by showing that the coloring (RBW) and the partition P(0) satisfy the conditions of one of three cases.

Lemma 12

Given an infeasible partition P(0) that does not overlap in V2, let uV1 be a lowest root of infeasibility, and let u and ur be u’s children in T1. Let R=L(ur),B=L(u), and W=L\(RB). Then P(0) is R-compatible and B-compatible and satisfies exactly one of the following three additional properties:

Case 1

P(0) has exactly one multicolored component, say A0, where A0 is tricolored , not (RB)-compatible , and there exists xWA0\L(lca2(A0(RB))), i.e., A0 contains a compatible tricolored triple.

Case 2

P(0) has exactly two multicolored components, say AB,AR, where ABR= and ARB=.

Case 3

P(0) has exactly one multicolored component, say A0, where A0 is tricolored , (RB)-compatible and A0 contains no compatible tricolored triple.

We will see in the proof below that Cases 1, 2 and 3 correspond to a lowest root of infeasibility satisfying (a), (b) and (c) respectively in Definition 4. We refer the reader to Fig. 1 for an illustration of the three cases.

Proof

Observe that if P(0) is infeasible, then the root of T1, i.e., lca1(L) is a root of infeasibility, and that no vL is a root of infeasibility. Hence, u is well-defined and R and B are non-empty. Note that P(0) is R-compatible and B-compatible, since u’s children are not roots of infeasibility.

We will show that if u satisfies condition (a) in the definition of a root of infeasibility, then the conditions of Case 1 are satisfied, if (b) holds, the conditions of Case 2 are satisfied, and if (c) holds, then the conditions of Case 3 are satisfied.

We start with (b): P(0) overlaps in V1[L(u)]. Observe that, because u is a lowest root of infeasibility, the only node in V1[L(u)] on which P(0) overlaps is u, and thus there must be at least two multicolored components if (b) holds. If there are two multicolored components, both containing, say, red leaves, then they overlap in ur=lca1(R), which implies ur is a root of infeasibility, contradicting the choice of u. Similarly, there is at most one multicolored component containing blue leaves. Hence, the conditions of Case 2 are satisfied.

If (b) does not hold, i.e., the partition does not overlap in V1[L(u)], then there is at most one multicolored component; the conditions in (a) and (c) both imply there is at least one. Thus there is exactly one multicolored component, which we will call A0. We let R0=RA0,B0=BA0 and u^=lca2(R0B0)=lca2(A0(RB)) (where we stress that u^ is a node in V2, whereas u is a node in V1).

If (a) holds, then A0 is not (RB)-compatible , and thus R0,B0. To derive a contradiction, suppose that Case 1 is not implied, i.e., there does not exist xWA0\L(lca2(A0(RB))), i.e., A0L(u^). Observe that, because A0 is not (RB)-compatible , lca2(R0)=u^ or lca2(B0)=u^. Suppose the former holds without loss of generality. But then lca1(R0) is a root of infeasibility satisfying (c), because A0\R0, and for all wA0\R0, R0{w} is incompatible, by the fact that wL(lca2(R0)), and wL(lca1(R0)). But lca1(R0) is a descendant of u, thus contradicting the choice of u.

Suppose now (c) holds, i.e., P(0) is (RB)-compatible , and in particular A0 is (RB)-compatible . Because A0 is multicolored , we can assume without loss of generality that R0. If B0=, then (c) holds for lca1(R0), which is a descendant of u, thus contradicting the choice of u. Since A0\(R0B0) by condition (c), we conclude A0 is tricolored . It remains to show every tricolored triple is incompatible. Suppose for a contradiction that {x,y,w}A0 is a tricolored triple that is compatible. Let w be the white leaf in the triple, then compatibility requires that lca2(x,y)lca2(x,y,w). On the other hand, the fact that A0 is (RB)-compatible implies that lca2(x,y)=lca2(R0B0). But then any tricolored triple in A0 containing w is compatible, so that A(RB){w} is compatible, contradicting that condition (c) holds.

Recall that the coloring is defined only at the start of the iteration. The lemma ensures that the partitions during the iteration always have either one (in Cases 1 and 3) or two (in Case 2) top components. Furthermore, we can use the lemma to show that the components created by Make-(RB) -compatible and Make-Splittable are multicolored .

Lemma 13

Only multicolored components are created by Make-(RB) -compatible and Make-Splittable, i.e., if AP(2)\P(0), then A is multicolored.

Proof

It follows immediately from the description of Make-Splittable that components created by this procedure are multicolored . Observe that Make-(RB) -compatible is used only if P(0) is not (RB)-compatible. By Lemma 12, this implies P(0) must have exactly one multicolored component A0, which contains a white leaf xW that is not a descendent of lca2(A0)(RB)). From the description of Make-(RB) -compatible, we (possibly repeatedly) subdivide the top component A0 into A0L(u^) and A0\L(u^). From the description of Make-(RB) -compatible, it is clear that the newly created non-top component A0L(u^) intersects both R and B, and the new top component A0\L(u^) must have a leaf in RB, because otherwise A0 is already (RB)-compatible. So u^lca2(A0(RB)), and xW must also be in the new top component A0\L(u^), thus ensuring that the top component remains multicolored .

For Cases 2 and 3, the analysis is quite simple.

Proposition 14

Let the initial partition P(0) and coloring (RBW) satisfy the conditions of Cases 2 or 3 in Lemma 12. Then 2ΔDΔP.

Proof

We first make two observations that apply in Cases 2 and 3: (i) P(0) is already (RB)-compatible , so P(1)=P(0), and (ii) Split(P(2),(R,B,W)) will not perform any Special-Split, because no component of a refinement of P(0) can have a tricolored triple that is compatible (since we are in Case 2 or 3). From these two observations we derive that

|P(3)|-|P(2)|=|P(2)|-|P(0)|+2. 4

To see this, note that, since no Special-Split is performed, |P(3)|-|P(2)| is equal to the number of bicolored components in P(2) plus twice the number of tricolored components in P(2). By Lemma 13P(2) has |P(2)|-|P(0)| more multicolored components than P(0), and, since P(1)=P(0), property 2 of Lemma 4 implies that P(2) has the same number of tricolored components as P(0). So in Case 2, P(2) has |P(2)|-|P(0)|+2 bicolored components and zero tricolored components, and in Case 3, P(2) has |P(2)|-|P(0)| bicolored components plus one tricolored component, and indeed (4) holds.

In addition, we note that

ΔD=|P(3)|-|P(2)|-1. 5

To see this, note that at the start of the iteration, the dual objective value is reduced by 1 when yu is decreased by 1 for u=lca1(RB). Make-splittable does not change the dual objective value, because, even though |P| increases by 1 every time the number of components increases by 1, vyv decreases by 1 as well. Finally, since Split will not perform any Special-Split, the increase in the dual objective value due to Split is equal to the increase in the number of components due to Split, which is |P(3)|-|P(2)|.

Note that the size of pairslist may increase but will never decrease, and thus

ΔP|P(3)|-|P(2)|+|P(2)|-|P(0)|=2|P(3)|-|P(2)|-2by(4)=2ΔDby(5).

We now prove a similar proposition for Case 1, the proof of which is more involved.

Proposition 15

Suppose the initial partition P(0) and coloring (RBW) satisfy the conditions of Case 1 in Lemma 12. Then 2ΔDΔP.

Proof

In Case 1, we start with P(0) containing one tricolored component A0, which is not (RB)-compatible . A0 is the only component that will be subdivided in the current iteration (by property 1 of Lemma 4). Note that P(1) and P(2) therefore have exactly one top component.

Let xW be a white leaf in A0 that is not a descendant of lca2(A0(RB)), which exists by the definition of Case 1. By property 5 in Lemma 4, xW is contained in the top component of P(2), and by Lemma 13 this component is multicolored . Therefore, the top component of P(2) is either bicolored , or it is tricolored and a Special-Split is performed on the top component.

Let χ be an indicator variable that is 1 if the top component in P(2) is tricolored and has a tricolored triple that is incompatible ; in other words, χ=1 if Special-Split subdivides the top component into four components. If χ=0, then either the top component is bicolored or it is tricolored and all its triplets are compatible ; in other words, χ=0 if the top component is subdivided into two components by Split (possibly via Special-Split). Thus splitting the top component increases the number of components by 1+2χ.

Now, let t be the number of tricolored components in P(2) that are not top components. We claim that

|P(3)|-|P(2)|=|P(2)|-|P(0)|+1+2χ+t. 6

To show this, we need to argue that the increase in the number of components due to splitting the multicolored non-top components is |P(2)|-|P(0)|+t. Since P(0) has one multicolored component, Lemma 13 implies that P(2) has |P(2)|-|P(0)|+1 multicolored components. Precisely one of these is a top component, so P(2) has |P(2)|-|P(0)| multicolored non-top components. By property 3, each of the tricolored components that are not top components do not require a Special-Split and are thus subdivided into three components by Split. Hence, splitting the components that are not top components increases the number of components by |P(2)|-|P(0)|+t.

Next, we analyze the increase in the dual objective. We claim that

ΔD=|P(3)|-|P(2)|-1-χ. 7

To see this, note that the dual objective is decreased by 1 when we decrease ylca1(RB) by 1 at the start of the iteration. As argued in the proof of the previous proposition, the dual objective is not affected by Make-Splittable. The same argument used there implies that the same holds for Make-(RB) -compatible. Finally, if χ=0, the increase in the dual objective due to Split is equal to the increase in the number of components, |P(3)|-|P(2)|. If χ=1, the same holds, but Special-Split on the top component also decreases yu^0 by 1.

So we get that

|P(3)|-|P(0)|=|P(3)|-|P(2)|+|P(2)|-|P(0)|=2|P(3)|-|P(2)|-1-2χ-tby(6)=2ΔD+1-tby(7).

Hence, if t1 we have ΔP2ΔD as required. So the rest of the proof, which requires quite some extra technicalities, deals with the situation of Case 1 and t=0. Recall that ΔP is equal to |P(3)|-|P(0)| minus the number of pairs added to pairslist in the current iteration; hence, to conclude that ΔP2ΔD if t=0, we need to show a pair is added to pairslist by Find-Merge-Pair.

We will say that a component AP(3) is able to reach u^ if u^V2[A] or if lca2(A)u^ and all intermediate nodes on the path from lca2(A) to u^ are not covered by any component in P(3). The following lemma (which is actually valid in general, and not only for Case 1) enumerates precisely the situations when a merge is possible.

Lemma 16

Let A0P(0), and let Q denote the set of components in P(3) that are subsets of A0. Then there exists a pair of elements in A0 that can be added to pairslist if and only if at least one of the following is true:

  1. Q contains a bicolored component.

  2. There is a node u^V2 that can be reached by two red components or two blue components in Q.

  3. There is a node u^V2 that can be reached by a red and a blue component in Q, but is not covered by these components. Furthermore, the node u^ must satisfy that the nodes on the path from u^ to lca2(A0) are not covered by any red or blue component in Q.

Proof

Since any two multicolored components overlap in lca1(RB) and P(2) does not overlap in lca1(RB) by Lemma 6, there is at most one tricolored component in P(2). By the definitions of Split and Special-Split, P(3) therefore has at most one multicolored component, which has blue and white leaves and is created by applying Special-Split to the tricolored component in P(2). If this blue-white component exists in Q, we denote it by A.

  1. If Q contains a bicolored component A, let AA be the tricolored component from which Special-Split formed a red component A and the bicolored component A. We show that we can merge A and A, which boils down to undoing the Special-Split operation, to obtain a new partition that is (RB)-feasible. Since AA was not overlapping with any other component in V2, undoing the Special-Split yields a component that does not overlap any other component of the partition in V2. For every wW, AA is (RB{w})-compatible since AA is (RB)-compatible and, by the conditions of the Special-Split operation, every tricolored triple in AA is compatible. Since AA was the unique top component in P(2), any component of P(2) (and hence of its refinement P(3)) overlapping a node v^ such that lca2(AA)v^lca2(RB) must be a component in P(0). Therefore, by Lemma 5, the new partition does not overlap in V1[RB].

  2. If Q does not contain a bicolored component A, suppose A,AQ are distinct red components in Q so that A and A can both reach the same node u^ in V2. Then merging A and A gives a new partition that does not overlap in V2, and which has no multicolored components. Since A0R is compatible, so is AA. By Lemma 5 and the fact that the new partition does not have any multicolored components, it does not overlap in V1[RB]. Hence, merging A and A gives a new partition that is (RB)-feasible.

    The same applies if A and A are both blue components in Q.

  3. If Q does not contain a bicolored component A, suppose there exist A,AQ with A red and A blue such that (i) there exists u^V2\(V2[A]V2[A]) that can be reached by both A and A; and (ii) the nodes on the path from u^ to lca2(A0) are not in V2[A] for any red or blue component A in Q. Observe that (ii) implies that any component A such that V2[A] contains nodes on the path from u^ to lca2(A0) must be subsets of W: A must be in Q if V2[A] contains a node on this path, and by the case assumption, Q contains no multicolored component.

    Merging A and A gives a new partition that does not overlap in V2 and the new component AA is (RB)-compatible by (i). Thus the new partition is (RB)-compatible , and since it has no components with white leaves as well as leaves in RB, it is vacuously also (RB{w})-compatible for any wL. AA is the unique bicolored component in this new partition, thus satisfying condition (i) of Lemma 5. Moreover, it satisfies that any node on the path from u^=lca2(AA) to lca2(A0) is not covered by a component that is not white. By Lemma 12, A0 must have been the unique multicolored component in P(0), and thus the components of the partition that overlap a node on the path from lca2(A0) to lca2(RB) were not changed in the current iteration. Therefore, also condition (ii) of Lemma 5 is satisfied, and the lemma implies that the new partition does not overlap in V1[RB]. Hence, merging A and A gives a new partition that is (RB)-feasible.

We note that the above three cases encompass all possible merge opportunities within Q. If two components cannot reach the same node u^V2, then merging them gives a partition that overlaps in V2. If red and blue components A and A can only reach nodes in V2 that are covered by either A or A, then AA is not (RB)-compatible. And if a red and blue component A and A can reach a node u^V2 that is not in V2[A]V2[A], but some node on the path from u^ to lca2(A0) is covered by a component AQ that is red or blue, then AA will overlap A in V1[R] or V1[B]. To see this, assume A is red (the blue case is analogous) and let v^ be the node in V2[A] closest to u^ on the path from u^ to lca2(A0). Then v^=lca2(A(AL(v^))lca2(A), and since AA are compatible in R, we should also have lca1(A(AL(v^))lca1(A). Thus A and A overlap on a node on the path from lca1(A) to lca1(RB).

We are now ready to complete the proof of Proposition 15, by showing that in Case 1 and if t=0 (i.e., if P(2) has no tricolored components that are not top components), then at least one of (a), (b) and (c) in Lemma 16 holds for P(3). By the conditions of Case 1, the unique tricolored component A0 in P(0) is not (RB)-compatible, and there exists xWA0\L(lca2(A0(RB))).

If (a) holds, we are done, so suppose (a) does not hold, i.e., P(3) has only unicolored components. We first make some observations which we later use to conclude that (b) or (c) must hold. Let u^ be the last node chosen in Make-(RB) -compatible to subdivide the top component. Because A0 is not (RB)-compatible, at least one iteration of Make-(RB) -compatible has to be executed on the component, so the existence of u^ follows. Let AA0 be the top component that is subdivided into AL(u^) and A\L(u^) at this point (after which the current partition is P(1)). We observe some properties of the two new components:

  • Letting u^ and u^r be the children of u^, then AL(u^)B and AL(u^r)R. To see this, note that by definition of Make-(RB) -compatible, AL(u^) and AL(u^r) each have a non-empty intersection with exactly one of R and B, and they cannot intersect W because otherwise AL(u^) is a tricolored non-top component of P(1), and then P(2) would also have a tricolored non-top component by the definition of Make-Splittable, contradicting that t=0.

  • Because AA0 was the top component at the moment Make-(RB) -compatible subdivided A into AL(u^) and A\L(u^), A\L(u^) is the top component in P(1). By Lemma 13, A\L(u^) intersects RB. By the conditions of Case 1 and property 5 in Lemma 4, it contains a node xW that is not a descendant of lca2(A0(RB)). Finally, A\L(u^) is the only component in P(1) that can cover a node on the path in T2 from u^ to lca2(A0) (by the fact that A0 was the unique top component in P(0) and A\L(u^) is the unique top component in P(1)).

Using the above, we now show we can find two components in P(3) that are subsets of A and that satisfy condition (b) or (c) of Lemma 16. As argued above, AL(u^)B and AL(u^r)R, so AL(u^) is splittable; Split will subdivide AL(u^) into a blue component AL(u^) and a red component AL(u^r). There are a few cases to consider (illustrated in Fig. 7).

  1. If there is no node on the path in T2 from u^ to lca2(A0) that is covered by a red or blue component in P(3), then we are done because u^ with AL(u^) and AL(u^r) satisfy (c).

  2. If there are nodes on the path from u^ to lca2(A0) that are covered by red or blue components, let v^ be the node closest to u^ for which this is the case. Suppose without loss of generality that v^V2[R] for some red component R. We will show that there is another red component that can reach v^, so these two components and v^ satisfy condition (b).
    1. If the nodes between u^ and v^ are not covered by any components in P(3), then the red component AL(u^r) can reach v^.
    2. Otherwise, let w^ be the node closest to v^ on the path from u^ to v^ that is covered by a component in P(3). By definition of v^, this component is white, say WP(3). We claim (and prove below) that in Make-Splittable a node u^ must have been chosen that created a non-top component A=WR, which was subsequently split into the white component W and red component R to obtain P(3). By property 4 in Lemma 4, u^ is not covered by R nor W, so R and W are the leaves in AL(u^r) and AL(u^), with u^r,u^ being the two children of u^. Hence, R can reach u^. On the other hand, w^ is covered by W, and thus w^u^v^. By definition of w^, the nodes on the path from u^ to v^ are not covered by any component in P(3), and thus R can also reach v^.

It remains to prove that in 2(b), Make-Splittable selected a node u^ that created a non-top component WRL(u^), which was subsequently split into W,R to obtain P(3). First, observe that W and R were part of the top component in P(1) (because they cover nodes on the path from u^ to lca2(A0)). They cannot both have been part of the top component of P(2), because by the conditions of Case 1 and property 5 of Lemma 4, the top component of P(2) contains a white leaf xW that is not a descendant of lca2(A0(RB)), and thus W{xW} covers all nodes on the path from w^ to lca2(A0(RB)), which includes v^. So R and W{xW} overlap and cannot be in the same component of the splittable partition P(2). Thus, Make-Splittable must have selected some u^ when W and R became part of different components. Note that W became part of a non-top component. It remains to show this component contains no blue leaves. Note that otherwise such a blue leaf xB, and a red leaf xRRL(v^) and a red leaf yRR\L(v^) (which exists because v^ is the lowest node on the path from u^ to lca2(A0) covered by R) would all belong to the component A\L(u^)P(1), but since lca2(xB,xR)=v^lca2(xB,xR,yR), this triple would be incompatible , contradicting that P(1) is (RB)-compatible .

Fig. 7.

Fig. 7

Illustration of the last part of the proof of Proposition 15. The sets W and R described in the proof are implicitly shown in the figure: W=W1W2 and R=R1R2

Theorem 17

The Red-Blue Algorithm is a 2-approximation for the maximum agreement forest (MAF) problem.

Proof

By Theorem 8, the Red-Blue Algorithm returns a feasible solution to MAF. We showed how to construct a feasible solution for the dual linear program (D’); by Propositions 14 and 15, the objective value of the solution to MAF returned by the Red-Blue Algorithm is at most twice the objective value of this dual solution. The approximation guarantee follows by linear programming duality.

A compact formulation of the LP

Here we give a compact formulation for (LP). This shows that it can be optimized efficiently. While this is not needed in our algorithm, it is possible that an LP-rounding based algorithm could achieve a better approximation guarantee, in which case this formulation will be of use. Moreover, the compact linear program explicitly encodes the structure of compatible sets in a way that (LP) does not; we believe this may provide additional structural insights in the future.

We remark that (LP) can also be shown to be polynomially solvable by providing a separation oracle for the dual. The dual of (LP) is similar to (D), the dual of (LP), except that z is indexed only by singletons and not arbitrary subsets of L. This dual has a polynomial number of variables, but an exponential number of constraints. By the equivalence of separation and optimization, it suffices to provide a separation oracle for this dual. In particular, it suffices to solve the problem of finding a most violated constraint amongst

vV[L]\Lyv+vLzv1LC,

for some given y and z. If we relabel zv to yv, making y a vector indexed by V, we can restate this as follows. Given some (positive or negative) weights y on the nodes of V, find a compatible subset L which maximizes vV[L]yv. This is a weighted variant of the maximum agreement subtree problem; in other words, the maximum agreement subtree problem is the problem where yv=1 for all vL. Similar to the usual (unweighted) version [26], this can be solved in polynomial time via dynamic programming.

Assume for convenience that L={1,2,,n}. We will deviate from the notational conventions in the previous sections, and use i and j to denote leaves, and t{1,2} to index the two input trees.

Let Z denote the set of all pairs (i1,i2)L2 for which i1i2. Consider a compatible set LL. For t{1,2}, we will use Tt[L] to denote the subtree of Tt on Vt[L]. Compatibility implies that T1[L] and T2[L] are isomorphic. What we will now do is represent the structure of these isomorphic trees by an out-arborescence F(L), where the nodes of the arborescence are elements of Z, and more precisely, are a subset of {(i,j):i,jL,ij}. We do this as follows.

  • If L contains only a single element i, then F(L) is the arborescence consisting of the single vertex (ii).

  • Otherwise, let L1 and L2 be the partition of L into the leaves below the two children of the root of T1[L]. Take i1 be the smallest element of L1 and i2 the smallest element of L2; we assume i1<i2 (otherwise, swap L1 and L2). The root of F(L) will be chosen as r:=(i1,i2). Now recursively apply this procedure to L1 and L2, yielding arborescences F(L1) and F(L2); let r1 and r2 denote their respective roots. Note that F(L1) and F(L2) are necessarily disjoint, since L1 and L2 are disjoint. Then F(L) is defined to be the union of F(L1) and F(L2), along with the arcs (r,r1) and (r,r2).

    Observe that the pair r1 is of the form (i1,i1) for some i1, since i1 remains the smallest element of L1, whereas r2 is of the form (i2,i2) for some i2. We call (r,r1) the left arc and (r,r2) the right arc leaving r.

So put differently, this procedure takes the tree T1[L] (or T2[L]; it makes no difference), contracts all nodes with only a single child, orients all edges away from the root, and then assigns a label to each node. This label consists of a pair of leaves in L, chosen minimally amongst the leaves in each of the two subtrees below the node (aside from leaves, which are labelled by repeating the leaf twice). We also note that if L is not a compatible set, then we could still apply this procedure, but it would return different results when applied to T2[L] instead of T1[L].

With this representation of a compatible set in mind, we now construct a certain directed graph D on the vertex set Z. It essentially contains all possible arcs that could appear in an arborescence constructed from a compatible set. We will use U1 for arcs that can appear as left arcs, and U2 for arcs that can appear as right arcs: the arc set of D is U1U2. With a slight abuse of notation, define lcat(r)=lcat(i1,i2) for any r=(i1,i2)Z; we can think of lcat(r) as being the node in Tt that the pair r identifies. Given two nodes r=(i1,i2) and s=(j1,j2) in Z:

  • (r,s)U1 if lcat(s)lcat(r) for all t{1,2} and i1=j1;

  • (r,s)U2 if lcat(s)lcat(r) for all t{1,2} and i2=j1.

For any LL, define ZL={(i,i):iL}; these are the set of pairs in Z that appear as labels for the leaves L. Let F denote the set of out-arborescences in D with leaf set contained in ZL and where each internal node has one outgoing arc in U1 and one outgoing arc in U2. Then the above discussion implies that LC if and only if there is an F(L)F with leaf set ZL. Let χF{0,1}U1U2 be the characteristic vector of the arc set of F, for any FF. Let CF denote the cone generated by {χF:FF}, i.e., yCF if and only if there exists xRC with x0 such that y=LC:|L|2xLχF(L).

We begin by giving a description of CF. For rZ, let δ+(r) denote the arcs in D leaving r, and δ-(r) the arcs entering r. For SU1U2, let y(S)=aSya.

Lemma 18

 

CF={yR+U1U2:y(δ+(r)U1)=y(δ+(r)U2)rZ\ZLy(δ+(r)U1)y(δ-(r))rZ\ZL}.

Proof

Let Y denote the cone described by the right hand side of the claimed equality. First, we observe that YCF. Consider any FF. Then for any rFZ\ZL, F has precisely one arc entering r, precisely one arc leaving r that is in U1, and one arc leaving r that is in U2. Hence χFY, and therefore any conic combination of the χF’s is also in Y.

It remains to show that YCF. Suppose yY; we prove that yCF, proceeding by induction on the number of nonzero elements of y. The claim trivially holds if y=0, since CF is a cone. So suppose y0.

We first claim that for any rZ for which either rZL or y(δ+(r))>0, there exists an arborescence FF rooted at r and contained in the support of y, by which we mean the set of arcs in D for which y is nonzero. To prove this, we can proceed by induction on |lca1(r)|. The claim is trivial if |lca1(r)|=1, since then rZL and we take an arborescence consisting only of the node r. Otherwise, choose any (r,r1)U1δ+(r) and (r,r2)U2δ+(r) that are both in the support of y. Notice that one of them must exist because y(δ+(r))>0, and then the other must exist as well, because of the equality in the definition of CF. As a result, y(δ-(r1))>0, and so the second constraint in the definition of Y implies that either r1ZL or y(δ+(r1))>0; the same holds for r2. Hence by induction, we obtain arborescences F1 and F2 in the support of y rooted at r1 and r2 respectively. We have already noted that there is no node that both r1 and r2 can reach; thus F1 and F2 are disjoint. We obtain F by combining F1, F2 and the arcs from r.

Now choose r=(i1,i2)Z such that y(δ-(r))=0 but y(δ+(r))>0 (such an r clearly exists, since D is acyclic and y0). By the above, we can find an arborescence FF rooted at r and contained in the support of y, Now set y=y-ϵχF, where ϵ is chosen maximally so that y0. For every node s contained in F that is distinct from r and not in ZL, χF(δ-(s))=χF(δ+(s)U1)=χF(δ+(s)U2)=1. Further, χF(δ-(r)=0<χF(δ+(r)U1)=χF(δ+(r)U2). It follows that yY. The choice of ϵ ensures that y has strictly smaller support than y, and so by induction, we deduce that yCF. Hence y=y+ϵχF is too.

Using Lemma 18, we now describe our compact formulation that we baptize LP. For t{1,2} and vVt, let lca-1(v)={rZ:lcat(r)=v}, i.e., the set of all pairs of leaves with one leaf in v’s left subtree, and the other in its right subtree. graphic file with name 10107_2022_1790_Figl_HTML.jpg

Lemma 19

(LP) is equivalent to (LP).

Proof

We begin by showing the “easy” direction, that a feasible solution x to (LP) can be converted to a feasible solution (y,x¯) to (LP) with the same objective value. Set x¯i=x{i} for all iL, and y=LC:|L|2xLχF(L). By the “easy” direction of Lemma 18 (and the equality LC:iLxL=1 for all iL), we can deduce that (y,x¯) is feasible to (LP). Further, the objective values match: for any LC with |L|2, the contribution of the term xLχF(L) in y to the objective value is exactly xL (only the root of F(L) contributes), and the term iLx¯i captures the fractional value of singleton components.

The “hard” direction, that a feasible solution (y,x¯) to (LP) can be converted to a feasible solution x to (LP) of the same objective value, follows in exactly the same way, but using the “hard” direction of Lemma 18. The constraints (LP-2) and (LP-3) ensure that yCF, and thus we can expand y=LC:|L|2xLχF(L) for some x0. Extending this x to singleton sets by defining x{i}=x¯i for all iL yields the desired solution to (LP).

Conclusion

We have described a factor-2 approximation algorithm for the MAF problem with a quadratic running time. Unlike previous algorithms for the problem, we crucially exploit the power of linear programming duality in our analysis. A number of clear directions remain for future work.

Most obviously, is the question of whether the approximation factor can be further improved. The approximation ratio of our algorithm implies an upper bound of 2 on the integrality gap of our linear program. However, the largest lower bound on the integrality gap of our linear program that we are aware of is 5/4; Fig. 8 in Appendix 2 shows one of many examples achieving this bound. Despite extensive computational experiments on instances with a small number of leaves, we have not been able to find any examples with an integrality gap larger than 5/4. It is thus possible that our formulation could be used as the basis for an improved algorithm (though we would expect such an algorithm to be quite different from the algorithm presented here).

Fig. 8.

Fig. 8

Example with integrality gap of 54 for the new LP introduced in Section 4. An optimal solution to the LP-relaxation is indicated by the colors: the compatible sets corresponding to each of the components {1,2,3},{1,5,8} and {4,6,7} (indicated by the colors red, blue and green respectively) have an x-value of 12, as well as every singleton leaf set, except leaf 1; all other x-values are 0. The objective value of this solution is 12(3+7)-1=4. An optimal solution to the ILP has 6 components, i.e., an objective value of 5. (One such optimal solution has the component {1,2,3} along with singleton components)

One natural idea would be to apply another powerful and successful technique in the theory of approximation algorithms, namely LP rounding. We have shown that the LP relaxation can be efficiently optimized via an equivalent compact formulation. It should, however, be noted that such an approach will be much slower than the purely combinatorial algorithm presented here, where the ILP formulation is used only in the analysis. It may thus not be the most promising approach for an algorithm of practical relevance.

Our ILP formulation may also be useful for exactly solving the MAF problem. Although it is NP-hard, ILP solvers are very successful in practice. Since our formulation appears to be quite strong, it may work better in practice than simpler formulations, such as the one of Wu.

Aside from improving the approximation factor, another natural avenue to pursue is improving the running time. A factor-2 approximation algorithm with a linear or near-linear running time, to match what has been achieved with a factor-3 approximation, would clearly be very desirable. It does not seem straightforward to improve the running time of our current algorithm; quite substantial changes would likely be needed.

Another very natural direction is to consider other variants of the MAF problem. For instance, the variation with more than two trees, where the current best approximation factor is 3 [7]; or the generalization to non-binary trees. It is straightforward to extend our formulation to both of these setting.

Finally, it must be admitted that our algorithm, and especially its analysis, is far from simple. A truly simple algorithm for MAF with an approximation factor of 2, if one can be found, will certainly require understanding its structure even more deeply.

Acknowledgements

We acknowledge the support of the Tinbergen Institute and the Hausdorff Research Institute for Mathematics, where portions of this research were pursued. We thank the anonymous referees for their very careful readings and detailed feedback on improving the presentation.

N.O. was supported in part by NWO Veni grant 639.071.307 and NWO Vidi grant 016.Vidi.189.087. F.S. was supported in part by NSF grants CCF-1526067 and CCF-1522054. L.S. was supported by NWO Gravitation Programme Networks 024.002.003. A.v.Z. was supported in part by grant #359525 from the Simons Foundation.

Appendix A The running time

It is quite clear from the definition of the Red-Blue Algorithm that it runs in polynomial time. In this section we show that it can be implementated to run in O(n2) time, where n denotes the number of leaves. (We work in the random access machine model of computation, and assume a word size of Ω(logn).)

We note that our presentation is focused on showing the bound on the running time as straightforwardly as possible, and there are some places where a more careful implementation is more efficient. However, we have not been able to find an implementation with an overall running time of o(n2).

We assume P is a given partition (not overlapping in V2), that is stored such that we can query the size of any component in constant time, and that for each node u^V2, we can query Au^, the component in P that covers u^ (which will be equal to if P does not cover u^), and s(u^)=|L(u^)Au^|. Note that we can determine this information by a bottom-up pass of T2 in O(n) time. We will recompute it whenever we refine P; since there can only be at most n-1 refinement operations, the total time to maintain this information is O(n2).

By Harel [14] (see also [2, 15]), we furthermore may assume that the computation of lcai(u,v) for given nodes u,vVi takes constant time (after a linear preprocessing time). It immediately follows from this that we can determine whether or not uv in tree Ti in constant time as well.

We will show that the time between subsequent refinements of P is O(n). This bounds the time of the main loop of the algorithm by O(n2). The only remaining part of the algorithm is the Merge-Components step, which will perform at most n-1 merges, each of which can clearly be done in O(n) time.

Finding a lowest root of infeasibility

We make a single pass through T1, in bottom-up order (starting from the leaves), until we find a root of infeasibility . We will spend constant time per node, thus showing that the time to find a lowest root of infeasibility is O(n).

For each node uV1 that we have already considered, Au references the component AP which covers u, with Au= if there are no such components. (If there are multiple such components, u is a root of infeasibility.) Furthermore, p^u is equal to lca2(AuL(u)), and s(u) is the size of AuL(u). Observe that for any xL, we know Ax, the component containing x, and s(x)=1 and p^x=x.

Given a non-leaf node uV1, with children u1 and u2 that have already been considered, we can determine whether u is a root of infeasibility , and, if not, determine Au,p^u and s(u), in constant time: If either (or both) of Au1 and Au2 do not cover u (which can be determined by checking if Aui= or s(ui)=|Aui|), set all the values according to which child (if any) does cover u, and end the consideration of node u. So assume from now on that both do cover u.

If Au1Au2, then u satisfies the second condition of a root of infeasibility , and we are done. Otherwise, Au=Au1=Au2. Set p^u=lca2(p^u1,p^u2) and s(u)=s(u1)+s(u2). If p^u1p^u or p^u2p^u, then L(u) is incompatible, and u satisfies the first condition of a root of infeasibility . If s(p^u)=|Au| and s(u)<|Au| then u satisfies the third condition for being a root of infeasibility : by s(u)<|Au|, we know Au\L(u). For any wAu\L(u), lca1(AuL(u))ulca1(AuL(u){w}), while by s(p^u)=|Au| we know that lca2(AuL(u))=lca2(Au). So AuL(u){w} is incompatible for any wAu\L(u). Otherwise, u is not a root of infeasibility , and we finish our consideration of u.

Once we have determined the coloring (RBW), we compute |AC| for each component AP and C{R,B,W}. We also compute three additional labels for each node u^V2: sC(u^)=|Au^L(u^)C| for C{R,B,W}. This information can be determined by a bottom-up traversal of T2 in O(n) time. We assume this information is updated whenever the partition is refined.

Make-(RB)-compatible

Consider the nodes of T2 in bottom-up order, until we find a node u^ such that both sB(u^)>1 and sR(u^)>1. Since u^ is a lowest such node, if |Au^R|=sR(u^) and |Au^B|=sB(u^), then Au^ is (RB)-compatible ; otherwise u^ is precisely as indicated in Make-(RB) -compatible.

Make-splittable

We again consider the nodes in T2 in bottom-up order. For any node u^ with Au^, using sC(u^) for C{R,B,W}, we can check in O(1) time whether Au^L(u^) is bicolored, and that for any C{R,B,W} with Au^C, that sC(u^)<|Au^C| (and hence (Au^\L(u^))C).

Split

Note that a regular split of a component A can be done in O(n) time, by simply checking the color of each leaf in A and partitioning A accordingly. We now show how to check if A needs a Special-Split (and if so which of the two possible refinements is applied) by considering the nodes in V2[A] in bottom-up order.

If A is tricolored , then the fact that A is (RB)-compatible and splittable implies that there exist u^R and u^B that are covered by A and for which AL(u^R)=AR and AL(u^B)=AB. Using a bottom-up traversal of V2 will find u^R and u^B (they are the first nodes v^ encountered such that Av^=A and sC(v^)=|AC| for C=R and B respectively).

Given u^R and u^B, we check if a Special-Split is required, by considering u^=lca2(A(RB))=lca2(u^R,u^B); a Special-Split is required exactly if s(lca2(u^R,u^B))<|A|, since in that case any xWA\L(lca2(u^R,u^B)) forms a compatible triple with any xRAR,xBAB. If, in addition, sW(lca2(u^R,u^B))=0, we know that every tricolored triple in A is compatible.

Find-merge-pair

We need to determine if there exist two components (both intersecting RB) that can be merged, in time O(n). If such components are found, then we can take a non-white leaf in each component and add this pair to pairslist . Recall Lemma 16, which enumerates all possible situations where a potential merge may exist.

  1. P has a bicolored component. This component must have been created by an application of Special-Split, splitting some component AAP(2) into A and A. As discussed in the proof of Lemma 16, simply undoing this split is a valid merge. Since Special-Split is invoked at most once per iteration, we can simply add a pair to pairslist during Special-Split.

  2. There is a node u^V2 that can be reached by two red or two blue components that were part of the same component at the start of the current iteration. By Lemma 12 (and property 1 of Lemma 4), any two components that are not white that were created in the current iteration must have been part of the same partition at the start of the iteration. We may assume that we can check for each component in constant time whether it was created in the current iteration.

    We work bottom-up in T2, and set Bu^ to be the set of red and blue components that were created in the current iteration, and that can reach u^ for every u^V2. If Bu^ contains two components of the same color, these two components can be merged, and we terminate.

    Note that if Bu^ ever contains three components, we will have found a merge and the algorithm will terminate. This ensures that we can compute this for u^ in constant time, given the values of Au^ and Bu^1 and Bu^2 for the children of u^.

  3. There is a node u^V2 that can be reached by a red and a blue component that were part of the same component A0 at the start of the current iteration, but is not covered by these components. Furthermore, the node u^ must satisfy that the nodes on the path from u^ to lca2(A0) are not covered by any red or blue component. Note that by Lemma 12, A0 is the only component that was modified in the current iteration, so for this second condition we can simply check that no node on the path from u^ to the root of V2 is covered by a red or blue component that was created in the current iteration.

    If we did not find two components of the same color that can be merged, we have found Bu^ for every u^V2, where |Bu^|2. We now work top-down in T2. If we encounter a node u^ that is covered by a red or blue component that was created in the current iteration, we stop and do not consider the descendants of u^ (since for any such a descendant, u^ is on its path to lca2(A0)). If we encounter a node u^ such that |Bu^|=2 and u^ is not covered by any component, the two components in Bu^ can be merged, and we may terminate.

Appendix B Integrality gap lower bounds

We show a lower bound on the integrality gap of 165 for the integer linear program formulation of Wu [31]. Recall that a solution to MAF can be viewed as the leaf sets of the trees in a forest, obtained by deleting edges from the input trees. The formulation has binary variables xe for every edge eT1, indicating whether e is deleted from T1. We use P1(i,j) to denote the set of edges in T1 on the path between leaves i and j, and P2(i,j) to denote the set of edges in T2 on the path between leaves i and j. Wu’s linear program [31] is given by:

minimizeexes.t.eP1(i,j)P1(i,k)P1(j,k)xe1forallincompatibletriplesi,j,keP1(i,j)xe+eP1(k,)xe1for all two pairs(i,j)and(k,)for whichP1(i,j)P1(k,)=,andP2(i,j)P2(k,)

The first family of constraints ensures that at least one edge of the paths between i and j, i and k, and j and k has to be deleted for each inconsistent triple i, j and k. The second family of constraints ensures that at least one edge is deleted for every pair of paths between i and j, and k and that are disjoint in T1, but for which the corresponding paths in T2 are not disjoint.

Lemma 20

The integrality gap of the linear program of Wu [31] is at least 165.

Proof

Let n=2k for some k even. We label each internal node in both T1 and T2 with a binary string: the roots get the empty string as label, and given an internal node u its left child gets u’s label with a “0” appended, and its right child gets u’s label with a “1” appended. In T1, the leaves are labelled in the same way as the internal nodes, with a binary string of length k. In T2, the binary string is reversed to give the label of the leaf. For example, the leftmost leaf (of both trees) has label 000, and the leaf to the right of it has label 001 in T1, and 100 in T2.

Consider the internal nodes whose labels are strings of length strictly less than k/2; there are exactly 2k/2-1=n-1 such nodes in each tree. We claim that any component A must cover at least |A|-1 of these internal nodes. To see this, consider the set of internal nodes in Vt that have two children in the subtree of Tt induced by A for t=1,2. We will call such nodes bifurcating. Observe that there are 2(|A|-1) such nodes. Furthermore, since A is compatible , there is a 1-1 mapping f from the bifurcating nodes in V1 to the bifurcating nodes in V2, where, L(u)A=L(f(u))A. Now, the label for a bifurcating node uV1 is the maximum length prefix that the binary strings for the leaves in L(u)A have in common, and the label for f(u) is the reverse of the maximum length suffix the leaves in L(u)A have in common. Hence, at least one of u and f(u)’s labels has length less than k/2.

The fact that any component A must cover at least |A|-1 of the 2n-2 internal nodes with labels of length less than k/2 implies that any partition that does not overlap must have at least n-2n+2 components. Thus the optimal value of the integer program is at least n-2n+1.

On the other hand, the LP relaxation of the integer program has a feasible solution with objective value 516n: set a value of 14 on the edges to each leaf in the tree (i.e., from an internal node with a label of length k-1 to a node with a label length k), and a value of 18 on all edges between nodes with labels of length k-2 to nodes with labels of length k-1. This implies a lower bound of limnn-2n+1516n=165 on the integrality gap.

As remarked in the introduction, the largest integrality gap for our formulation that we are aware of is 5/4. The instance is described in Fig. 8.

Footnotes

This paper is based on the (substantially different) extended abstract [23].

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Allen BL, Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 2001;5(1):1–15. doi: 10.1007/s00026-001-8006-8. [DOI] [Google Scholar]
  • 2.Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN), pp. 88–94 (2000)
  • 3.Bonet ML, John KS, Mahindru R, Amenta N. Approximating subtree distances between phylogenies. J. Comput. Biol. 2006;13(8):1419–1434. doi: 10.1089/cmb.2006.13.1419. [DOI] [PubMed] [Google Scholar]
  • 4.Bordewich M, McCartin C, Semple C. A 3-approximation algorithm for the subtree distance between phylogenies. J. Discret. Algorithms. 2008;6(3):458–471. doi: 10.1016/j.jda.2007.10.002. [DOI] [Google Scholar]
  • 5.Bordewich M, Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 2004;8(4):409–423. doi: 10.1007/s00026-004-0229-z. [DOI] [Google Scholar]
  • 6.Chataigner F. Approximating the maximum agreement forest on k trees. Inf. Process. Lett. 2005;93(5):239–244. doi: 10.1016/j.ipl.2004.11.004. [DOI] [Google Scholar]
  • 7.Chen J, Shi F, Wang J. Approximating maximum agreement forest on multiple binary trees. Algorithmica. 2016;76(4):867–889. doi: 10.1007/s00453-015-0087-6. [DOI] [Google Scholar]
  • 8.Chen Z-Z, Harada Y, Wang L. A new 2-approximation algorithm for rSPR distance. In: Cai Z, Daescu O, Li M, editors. Bioinformatics Research and Applications. Cham: Springer International Publishing; 2017. pp. 128–139. [Google Scholar]
  • 9.Chen, Z.-Z., Machida, E., Wang, L.: A cubic-time 2-approximation algorithm for rSPR distance. arXiv preprint arXiv:1609.04029 (2016)
  • 10.Chen, Z.Z., Machida, E., Wang, L.: An improved approximation algorithm for rSPR distance. In International Computing and Combinatorics Conference, pp. 468–479. Springer (2016)
  • 11.Darwin, C.: Notebook B: Transmutation of species (1837?-1838). In: John van Wyhe: The Complete Work of Charles Darwin Online (2002). http://darwin-online.org.uk/
  • 12.Gascuel O, editor. Mathematics of Evolution and Phylogeny. Oxford: Oxford University Press Inc.; 2005. [Google Scholar]
  • 13.Goemans MX, Williamson DP. The primal-dual method for approximation algorithms and its application to network design problems. In: Hochbaum DS, editor. Approximation Algorithms for NP-hard Problems. Boston: PWS Publishing Co.; 1997. pp. 144–191. [Google Scholar]
  • 14.Harel, D.: A linear time algorithm for the lowest common ancestors problem. In Proceedings of the 21st Annual Symposium on Foundations of Computer Science (FOCS), pp. 308–319 (1980)
  • 15.Harel D, Tarjan RE. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 1984;13(2):338–355. doi: 10.1137/0213024. [DOI] [Google Scholar]
  • 16.Hein J, Jiang T, Wang L, Zhang K. On the complexity of comparing evolutionary trees. Discret Appl. Math. J. Comb. Algorithms Inf. Comput. Sci. 1996;71(1–3):153–169. [Google Scholar]
  • 17.Huson D, Rupp R, Scornavacca C. Phylogenetic Networks: Concepts. Cambridge: Cambridge University Press, Algorithms and Applications; 2010. [Google Scholar]
  • 18.Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L, Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. New York: Springer; 2009. [Google Scholar]
  • 19.Olver, N., Schalekamp, F., Stougie, L., van Zuylen, A.: Implementation of the MAF algorithm and compact formulation. Available at http://nolver.net/maf and http://fransschalekamp.com/MAF (2018)
  • 20.Rodrigues, E.M.: Algoritmos para Comparação de Árvores Filogenéticas e o Problema dos Pontos de Recombinação. PhD thesis, University of São Paulo, Brazil (2003). Chapter 7, available at http://www.ime.usp.br/~estela/studies/tese-traducao-cp7.ps.gz
  • 21.Rodrigues, E.M., Sagot, M.-F., Wakabayashi, Y.: Some approximation results for the maximum agreement forest problem. In: Proceedings of APPROX-RANDOM, Lecture Notes in Computer Science, pp. 159–169. Springer (2001)
  • 22.Rodrigues EM, Sagot M-F, Wakabayashi Y. The maximum agreement forest problem: approximation algorithms and computational experiments. Theor. Comput. Sci. 2007;374(1–3):91–110. doi: 10.1016/j.tcs.2006.12.011. [DOI] [Google Scholar]
  • 23.Schalekamp, F., van Zuylen, A., van der Ster, S.: A duality based 2-approximation algorithm for maximum agreement forest. In: Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming (ICALP), Vol. 55 of LIPIcs, pp. 70:1–70:14. Leibniz-Zentrum für Informatik, (2016)
  • 24.Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003. [Google Scholar]
  • 25.Shi F, Feng Q, You J, Wang J. Improved approximation algorithm for maximum agreement forest of two rooted binary phylogenetic trees. J. Comb. Optim. 2015;32(1):111–143. doi: 10.1007/s10878-015-9921-7. [DOI] [Google Scholar]
  • 26.Steel M, Warnow T. Kaikoura tree theorems: Computing the maximum agreement subtree. Inf. Process. Lett. 1993;48(2):77–82. doi: 10.1016/0020-0190(93)90181-8. [DOI] [Google Scholar]
  • 27.van Iersel L, Kelk S, Lekic N, Stougie L. Approximation algorithms for nonbinary agreement forests. SIAM J. Discret. Math. 2014;28(1):49–66. doi: 10.1137/120903567. [DOI] [Google Scholar]
  • 28.Whidden C, Beiko RG, Zeh N. Fixed-parameter algorithms for maximum agreement forests. SIAM J. Comput. 2013;42(4):1431–1466. doi: 10.1137/110845045. [DOI] [Google Scholar]
  • 29.Whidden C, Matsen FA. Calculating the unrooted subtree prune-and-regraft distance. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019;16(3):898–911. doi: 10.1109/TCBB.2018.2802911. [DOI] [PubMed] [Google Scholar]
  • 30.Whidden, C., Zeh, N.: A unifying view on approximation and FPT of agreement forests. In: Algorithms in Bioinformatics. Lecture Notes in Computer Science, Vol. 5724 , pp. 390–402. Springer, Berlin Heidelberg (2009)
  • 31.Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25(2):190–196. doi: 10.1093/bioinformatics/btn606. [DOI] [PubMed] [Google Scholar]
  • 32.Wu, Y., Wang, J.: Fast computation of the exact hybridization number of two phylogenetic trees. In: Bioinformatics Research and Applications. Lecture Notes in Computer Science, Vol. 6053, pp. 203–214. Springer, Berlin Heidelberg (2010)

Articles from Mathematical Programming are provided here courtesy of Springer

RESOURCES