Abstract
We give a 2-approximation algorithm for the Maximum Agreement Forest problem on two rooted binary trees. This NP-hard problem has been studied extensively in the past two decades, since it can be used to compute the rooted Subtree Prune-and-Regraft (rSPR) distance between two phylogenetic trees. Our algorithm is combinatorial and its running time is quadratic in the input size. To prove the approximation guarantee, we construct a feasible dual solution for a novel exponential-size linear programming formulation. In addition, we show this linear program has a smaller integrality gap than previously known formulations, and we give an equivalent compact formulation, showing that it can be solved in polynomial time.
Keywords: Maximum agreement forest, Phylogenetic tree, SPR distance, Subtree prune-and-regraft distance, Computational biology
Introduction
Evolutionary relationships are often modeled by a rooted tree, where the leaves represent a set of species, and internal nodes are (putative) common ancestors of the leaves below the internal node. Such phylogenetic trees date back to Darwin [11], who used them in his notebook to elucidate his thoughts on evolution. For an introduction to phylogenetic trees we refer to [12, 24].
The topology of phylogenetic trees can be based on different sources of data, e.g., morphological data, behavioral data, genetic data, etc., which can lead to different phylogenetic trees on the same set of species. Such partly incompatible trees may actually be unavoidable: there exist non-tree-like evolutionary processes that preclude the existence of a phylogenetic tree, so-called reticulation events, such as hybridization, recombination and horizontal gene transfer [17, 18]. Irrespective of the cause of the conflict, the natural question arises to quantify the dissimilarity between such trees. Especially in the context of reticulation, a particularly meaningful measure of comparing phylogenetic trees is the Subtree Prune-and-Regraft distance for rooted trees (rSPR-distance), which provides a lower bound on a certain type of these non-tree evolutionary events. The problem of finding the exact value of this measure for a set of species motivated the formulation of the Maximum Agreement Forest Problem (MAF) by Hein, Jian, Wang and Zhang [16].
In the definition of MAF by Hein et al. we are given two rooted binary trees and a bijection from the leaves of each tree to a given set of labels . The problem is to find a minimum set of edges to be deleted from the two trees, so that the rooted trees in the resulting two forests form isomorphic pairs. Here, and throughout the paper, two rooted trees are said to be isomorphic if (i) the labelled nodes of the two trees have the same subset of labels, say A, and (ii) the two trees give rise to the same tree if we take the minimal subtree spanning the nodes labelled by A and repeatedly identify a node with its child if it only has a single child.
Since the introduction by Hein et al. in [16], in which they also proved NP-hardness, MAF has been extensively studied, mostly in its version of two rooted binary input trees. After Allen and Steel [1] pointed out that the claim by Hein et al. that solving MAF on two rooted directed trees computes the rSPR-distance between the trees is incorrect, Bordewich and Semple [5] presented a subtle redefinition of MAF, whose optimal value does coincide with the rSPR-distance. In this redefinition, the set of labels is extended with a label , which is assigned to the roots of the two input trees. As before, we want to find a minimum set of edges so that the trees in the resulting forests form isomorphic pairs; note that the fact that the roots of the input trees have labels means that now there must be an isomorphic pair of trees in the resulting forests containing the (original) roots. This has now become the standard definition of MAF, for which Bordewich and Semple [5] showed that NP-hardness still holds, and Rodrigues [20] showed that it is in fact APX-hard.
The problem has attracted a lot of attention, and indeed has become a canonical problem in the field of phylogenetic networks. Many variants of MAF have been studied, including versions where the input consists of more than two trees [6, 7], and where the input trees are unrooted [29, 30] or non-binary [22, 27]. We will concentrate on MAF in its classical form with two rooted binary input trees, and we will be concerned with the worst-case approximability of the problem. The literature includes many other approaches to the problem, including fixed-parameter tractable algorithms (e.g., [28, 30]) and integer linear programming [31, 32]. But the quest for better approximation algorithms has become central within the MAF literature.
The first approximation algorithm for the problem with a fully correct analysis was given by Bonet et al. [3] in 2006; they obtain an approximation factor of 5, with a running time that is linear in the number of leaves. (The algorithm follows closely the approach taken by Hein et al. [16] and Rodrigues et al. [21], who both claimed 3-approximation algorithms; but both papers turned out to have flaws in the analysis.) This was followed by a sequence of three papers, each obtaining a 3-approximation algorithm. The first, by Bordewich et al. [4], had a running time of , where n denotes the number of leaves; Rodrigues et al. [22] substantially improved the running time to . Finally, Whidden and Zeh [28] simplified the analysis and improved the running time to O(n), matching the running time of the previous 5-approximation.
These algorithms all take a similar approach, and make decisions that are in a certain sense based on “local” information. We focus here on the algorithm and analysis of Whidden and Zeh [28] (based on [22]), since it is the cleanest. The algorithm maintains a tree and a forest ; initially, these are precisely the two input trees. and always have the same leaf set, which shrinks as the algorithm progresses; a leaf is removed when the part of the algorithm’s solution involving that leaf has been determined. The algorithm proceeds by considering any pair of leaves a, b in that are siblings (two nodes are siblings in a tree if they have the same parent). Consider their situation in . If they are also siblings in , then there is clearly no reason to separate a and b in a solution, and they can be contracted together in both and to yield a smaller instance. Otherwise, the algorithm deletes the edges directly above a and b in both and , resulting in two “trivial” trees consisting of a single leaf each that can essentially be removed from the instance; and also makes one further cut in , which will be the edge directly above a sibling of either a or b in . The process of merging and deleting edges is then continued on the new instance, until eventually a valid solution is found. (Note that the algorithm might at first glance appear to create many trivial trees consisting of only a single leaf; however, single leaves later in the algorithm may represent larger collections of leaves that have been merged together in earlier iterations.) A fairly direct combinatorial charging argument is used to show that in each iteration of the algorithm (where the algorithm makes three cuts), at least one edge deleted in the optimal solution can be uniquely charged for this iteration.
The next improvement in approximation factor, to a 2.5-approximation (at the cost of an increased quadratic running time) came from Shi et al. [25]. Their approach, like the 3-approximation algorithm described above, starts by choosing a pair of leaves a, b that are siblings in the first tree. However, it pays more attention to the configuration of the second tree and the positioning of a and b within it when deciding what edges to cut. Since larger structures are considered, the analysis is substantially more involved. A further improvement to a factor of 7/3 was then obtained by Chen, Machida and Wang [10]; their algorithm also runs in quadratic time. Again, larger combinatorial structures play a role; further, it does not begin with an arbitrary pair of sibling leaves in the first tree, but chooses the pair more carefully.
The first 2-approximation algorithm was given by a subset of the authors of the current work [23] (independently and essentially concurrently with the 7/3-approximation algorithm of Chen et al. [10]). They do not explicitly discuss (or attempt to optimize) the running time of the algorithm, beyond showing that it is polynomial time. Subsequently, Chen, Harada and Wang [8] (see also [9]), building on the 7/3-approximation algorithm [10], gave a very different factor 2 approximation algorithm, with a cubic running time.
The 2-approximation algorithm presented in the current paper may be viewed as the full version of the algorithm in [23]. However, while the algorithm presented here is similar in spirit, it differs in many details, and the exposition is entirely new. Although the algorithm and analysis remain quite subtle, this version is significantly shorter and clearer. Moreover, we show how our algorithm can, with some care, be implemented in quadratic time ( [23] discusses only a polynomial time bound). This improves over the cubic running time of Chen et al. [8].
Our 2-approximation algorithm differs from previous works in two key aspects.
Our algorithm takes a global approach; choices made by the algorithm may depend on large parts of the instance. This is in contrast to the “local” algorithms discussed above. The cubic 2-approximation by Chen et al. [8] also requires non-local substructures, suggesting this may be a crucial factor in achieving this approximation bound.
-
We introduce a novel integer linear programming formulation for the analysis. Our approximation guarantee is proved by constructing a feasible solution to the dual of this linear program, rather than arguing locally about the objective of the optimal solution. We thus bring a powerful tool from the theory of approximation algorithms to bear, one that has not been exploited in the study of MAF so far.
We use the integer linear programming formulation, and in particular, its linear relaxation, only in our analysis. The algorithm itself is purely combinatorial. It is essentially a dual-fitting algorithm: the analysis explicitly constructs a dual solution with objective value at least half the cost of the primal solution returned by the algorithm.
Although we do not need to solve the linear programming (LP) relaxation, it is an interesting object of study, and it is natural to ask if it can indeed be efficiently optimized. This is not immediately clear, since the formulation has an exponential number of variables. Being able to solve the LP may, for example, be of future utility in obtaining better approximation guarantees using LP-rounding techniques. We show that the relaxation can be reformulated as a compact LP, with only a polynomial number of variables and constraints. This immediately implies that it can be optimized efficiently (in polynomial time). This may make the integer linear program amenable for use with commercial integer programming solvers. There is a previous formulation due to Wu [31], but our formulation is significantly stronger: the integrality gap of the relaxation of Wu is at least 3.2, whereas for ours we show it is at most 2, and in fact the worst example that we are aware of has integrality gap 1.25 (see the Appendix).
We have implemented and tested our algorithm, as well as the compact formulation [19]. The implementation has been designed so that it is easy to step through the algorithm and explore its behaviour on a given instance; the reader may find it helpful when examining the technical details of the algorithm.
Outline We define the problem and introduce necessary notation in Sect. 2. Section 3 describes the algorithm, and proves that it produces a feasible solution to MAF. In Sect. 4, we introduce the linear program, and describe a feasible solution to its dual that can be maintained by the algorithm. We then show the objective value of this dual solution is always at least half the objective value of the MAF solution, which proves the approximation ratio of 2. In Sect. 5, we show a compact formulation of the (exponential sized) linear program used for the analysis. Section 6 gives some concluding remarks and directions for further research. Finally, in the appendices, we provide the details on how to implement our algorithm so that it runs in time quadratic in the size of the input, and we give an example that shows that a previously known integer linear program [31] is not as strong as the formulation introduced here.
Preliminaries
The input to the Maximum Agreement Forest problem (MAF) consists of two rooted binary trees and . There is a bijection from the leaves of each tree to a given set of labels .
Let and denote the node sets of and respectively, and let . We will take a small liberty, and treat as being a subset of and a subset of . We call all nodes in internal nodes. We let denote the set of leaves that are descendants of a node .
We will use the following notational conventions: we use u and v to denote arbitrary nodes (including leaves); if the node we refer to is an internal node in , we will use and ; and we use the letters x, y and w to refer to leaves.
For we use to denote the set of nodes in that lie on a path between any two leaves in A for , and define .
Definition 1
We say that a set covers a node if . We say that overlap if ; we also say that A overlaps in U, for , if . We say a partition of overlaps in if there exist , , such that A and overlap in U.
To give some intuition for the use of this definition, recall from the introduction that the goal of the MAF problem is to find a minimum set of edges to be deleted from the two input trees, so that the trees in the resulting two forests can be matched up into isomorphic pairs. One of the requirements for a pair of trees to be isomorphic is that they have the same set of labelled nodes. In other words, the trees in the two forests induce the same partition of , and the fact that the forests are formed by deleting edges from the input trees means that no two sets in overlap.
Next, we will give a definition that allows us to precisely express the other requirement for a pair of trees to be isomorphic. For , we let denote the lowest common ancestor of A in . We will sometimes omit braces of explicit sets and write, e.g., instead of . For nodes u, v in the same tree, we use to indicate that u is a descendant of v and if u is equal to v or a descendant of v.
Definition 2
A set is compatible if for all
We call a set of leaves incompatible if it is not a compatible set. Note that is compatible precisely if the minimum subtree spanning L in and the minimum subtree spanning L in are isomorphic.
A feasible solution to MAF is a partition of such that every component is compatible, and does not overlap , for any . The cost of this solution is defined to be . This cost corresponds to the number of edges that must be deleted from , as well as the same number from , so that in both of the resulting forests, each is the leaf set of a single tree.
Remark
In order for MAF to correspond to the rSPR distance, it is necessary to add an additional label to (see figure below), that is assigned to the roots of and . This is the distinction between the original definition of MAF by Hein [16] and the correction by Bordewich and Semple [5]. To maintain the property that only leaves have labels, we instead add a new root to and , which has as its two children a leaf labelled and the original root. We simply assume that this addition is already included in the input instance, after which there is no need to distinguish this additional leaf from the others.

When we describe and analyze our algorithm, the following extended notion of compatibility is convenient.
Definition 3
Given , we say a set is K-compatible if is compatible. A partition of is K-compatible if is K-compatible for all .
The Red-Blue algorithm
The algorithm maintains a partition of , which at the end of the algorithm will correspond to a feasible solution to MAF. The algorithm will maintain the invariant that does not overlap in . Observe that this is equivalent to defining to be the leaf sets of the trees in a forest, obtained by deleting edges from . Initially .
Very informally, an iteration begins by coloring the leaves with three colors, red, blue, and white. The coloring is such that in , there is a node u that has the red and blue leaves as its descendants; the set B of blue leaves is the set of “left” descendants of u and the set R of red leaves is the set of “right” descendants of u. The remaining leaves W are white. Furthermore, it will be the case that the current partition is feasible for the problem restricted to R and for the problem restricted to B. The current iteration will work to make the partition feasible for the problem restricted to (in fact, it will be feasible for the problem restricted to for all ). Observe that a forest corresponding to a feasible solution to the full instance can have at most one tree that has leaves of multiple colors, because if there were two such trees then their leaf sets overlap on node u in . Also, a multicolored tree in a feasible solution must be such that there is a node in such that (i) no white leaf of the tree is a descendant of , and (ii) the blue and red leaves of the tree are left and right descendants of . We say the component is -compatible if (ii) holds. The iteration will refine the multicolored components of the partition into (all but one) unicolored components. The natural idea would be to do this by intersecting each (or all but one) component with each color, but then the resulting partition might overlap in ; if not, we call the original partition splittable. So we first refine the partition such that it is splittable. In order to achieve the desired approximation guarantee, we need to be careful about the ordering of the steps we take to make the partition splittable , so that we can simultaneously maintain a feasible dual LP solution with an objective value that tracks the number of components; we do this by first making it -compatible (which works toward splittability as well). Once the partition is -compatible and splittable , we refine the partition by splitting all but at most one component into unicolored components. Finally, we look for a split that can be undone; the careful order in which the components are refined also serves to guarantee that such a merging of components is possible where needed to prove the approximation guarantee. We now give a precise definition, using the notation from the previous section.
As explained above, our algorithm works towards feasibility by iteratively refining , focusing each iteration on a set of leaves for some ; u is a node such that the current partition is infeasible for in some (quite narrowly defined) way. At the end of the iteration the solution is feasible if we restrict our attention to , and even if we consider for any arbitrary .
We use the following definition to specify which sets the algorithm considers.
Definition 4
Given an infeasible partition that does not overlap in , we call a root of infeasibility if at least one of the following holds:
is not -compatible;
overlaps in ;
is -compatible, and there exists a component such that and is incompatible for all .
While the first two conditions can be naturally interpreted as failures of feasibility within , condition (c) is more subtle. It says that while A is -compatible, every leaf provides a certificate that A is in fact incompatible. A different view of this is that every leaf in lies below in . We note that replacing condition (c) by requiring only the existence of at least one such leaf leads to an algorithm that appears to be “too greedy”; more precisely, the approximation guarantee we can prove in that case is worse than 2.
Observe that if is a root of infeasibility , then any ancestor of u is a root of infeasibility as well. We will say an internal node u in tree is the “lowest” node with property if property does not hold for any of u’s descendants in . The algorithm will thus identify a lowest node that is a root of infeasibility.
We illustrate the three conditions of a root of infeasibility in Fig. 1. , , , , , and represent nonempty subtrees that appear in both and — for the examples it suffices to think of these as a subtree consisting of a single leaf. We will adopt this viewpoint and, with a slight abuse of notation, we will refer to the labels of these leaves as , , , , , and , respectively. If , u satisfies (a). Note that u is indeed a lowest root of infeasibility, since and are compatible sets, so and do not satisfy (c) (nor (a) or (b)). If , node u satisfies (b). Again, u is a lowest root of infeasibility (clearly and do not satisfy (a) or (b); they also do not satisfy (c) since is compatible, as is ). Finally, if , node u satisfies (c). Observe that in this case u is again a lowest root of infeasibility. For , (a) and (b) are clearly not satisfied; neither is (c) because the only such that is , but then is not incompatible for (and also not incompatible for if ).
Fig. 1.
If , then node u satisfies case (a) of Definition 4; if , , , it satisfies case (b) and if , it satisfies (c)
Given a root of infeasibility , we partition into R, B, W, where and for the two children and of u. We will refer to this partition as a coloring of the leaves; we will refer to the leaves in R as red leaves, the leaves in B as blue leaves and the leaves in W as white leaves. We note that and are and , respectively, and we use these interchangeably. We call a component of tricolored if it has a nonempty intersection with R, B and W, and bicolored if it has a nonempty intersection with exactly two of the sets R, B, W. A component is called multicolored if it is either tricolored or bicolored , and unicolored otherwise.
Observation 1
Let u be a lowest root of infeasibility for , and consider the coloring R, B, W, where and for the two children and of u. Then the set of multicolored components of consists of either at most two bicolored components or exactly one tricolored component.
Proof
If u is a lowest root of infeasibility, does not overlap in and , and so at most one component of covers , and at most one covers . Since any multicolored component covers at least one of and , there can be at most two multicolored components. Furthermore, because any tricolored component covers both and , if there is a tricolored component there can be no other multicolored component.
We note that the above observation can be refined; it is possible to show that contains either one tricolored component or exactly two bicolored components; see Lemma 12 in Sect. 4.3.
We now give the overall algorithm. In the description, but also in the descriptions of the various procedures that follow, the in front of certain lines will be used to refer to these lines in the analysis in Sect. 4.2.
The various procedures in the Red-Blue Algorithm will be described in detail in the subsequent subsections, along with lemmas regarding the properties they ensure. For now, we give a very high-level description.
An iteration of the main while-loop starts by finding a lowest root of infeasibility u, yielding a coloring (R, B, W) of the vertices; if there is no root of infeasibility, then the current partition is feasible, and the main loop terminates. The goal of the iteration, essentially, is to ensure that by the end of the iteration, u is no longer a root of infeasibility, while maintaining the invariant that the partition does not overlap on . Until the very end of the algorithm, the partition is only ever refined; since each iteration must modify the partition, the number of iterations is bounded by . (Alternatively, our analysis shows that if u is chosen for some iteration of the algorithm, then from the end of the iteration until the very end of the algorithm, u will never again be a root of infeasibility.)
The process of refining the partition to make u no longer a root of infeasibility proceeds in two main stages. First, the procedure Make--compatible refines the partition if necessary so that it is -compatible, i.e., so that condition (a) fails to hold. The procedures Split and Make-Splittable will together ensure that conditions (b) and (c) also both fail to hold, so that u is no longer a root of infeasibility at the end of the iteration. In particular, they ensure that the partition does not overlap in , and that the final partition is -compatible for every (which is stronger than (c) not holding).
Finally, Find-Merge-Pairs and Merge-Components are needed for the approximation bound only. All the other steps in the algorithm only refine the current partition. In some particular cases, it is possible and necessary to undo some of these refinements. This is done in a careful way at the very end of the algorithm by Merge-Components, using information prepared by Find-Merge-Pairs. The reason that the merges are done at the end, rather than during the main loop, is primarily for analysis purposes.
In order to simplify the statement of the lemmas, we will make statements like “let be the partition after ProcedureName”. This implicitly assumes that (R, B, W) was a coloring chosen in the beginning of the current iteration of the Red-Blue Algorithm (and thus, that was a lowest root of infeasibility at that moment), and that is the partition resulting from calling ProcedureName in the current iteration.
Make--compatible
If is not -compatible, we start by refining with the following procedure so that each of its components is -compatible.
An example is given in Fig. 2. We note that in general, the choice of does not have to be unique, and that multiple refinements may be needed to make the partition -compatible .
Fig. 2.
Illustration of Make- -compatible(). Because and do not overlap in , we can represent these partitions as the leaf sets of trees in a forest obtained by deleting edges from . In this figure and the following figures the dashed edges represent deleted edges. In this example . Then Make- -compatible() must choose , and refines the partition to , which is -compatible
As observed above, for any partition that does not overlap in , there is a set of edges in such that consists of the leaf sets of the trees in the forest obtained after deleting these edges. Our refinement is equivalent to deleting the parent edge of , and hence the resulting partition does not overlap in if the original partition did not overlap in .
Lemma 1
Let be the partition after Make- -compatible. Then is a refinement of that does not overlap in and is -compatible .
Proof
First, observe is R-compatible and B-compatible, since u’s children are not roots of infeasibility. If is -compatible then is not modified by the procedure, and the lemma is vacuously true. Otherwise, the procedure refines , and, as argued above, the resulting partition does not overlap in provided that does not overlap in . The procedure ends when there are no sets in that are not -compatible , so the only thing left to show is that this procedure halts. Because was chosen to be the lowest internal node in such that intersects both R and B, the children of , say and , are so that and can only intersect one of R and B. Therefore is -compatible , where A was not, and thus the number of -compatible components in increases, which can only happen at most times.
Observe that if is -compatible , then any refinement of is also -compatible , and hence we may assume that the partition at any later point in the current iteration of the Red-Blue Algorithm is -compatible .
Make-splittable
The goal of the next two procedures is to further refine the partition so that there is no overlap in . We will do this in two steps. The first of these procedures will make the partition “splittable ”. To describe this informally, we view the components of the partition as the trees of the forest obtained by deleting edges from . We call a component A that intersects k colors splittable , if there are edges that can be deleted from to “split” the tree into k unicolored components. We can phrase this property succinctly using the notion of overlapping: if the sets , and do not overlap in , then there are disjoint trees in that have each of these sets as leaf sets, and we can therefore split the tree associated with A in into these three trees by deleting at most two edges.
Definition 5
Given a coloring (R, B, W) of , a set is splittable if , and do not overlap in . A partition is splittable if every component in the partition is splittable.
As a first example of Make-Splittable, consider that was the output of Make- -compatible depicted in Fig. 2. In this example is already splittable . In Fig. 3 a more interesting example is given.
Fig. 3.
Illustration of Make-Splittable(). , and the set is not splittable . Make-Splittable() would choose and replace A by and
Lemma 2
Make-Splittable is well-defined, in that a node satisfying the desired properties in line can always be found.
Proof
If A is bicolored and not splittable , then there exists such that both and are bicolored : just take to be a lowest node in for distinct ; such a node exists because A is not splittable, and the fact that is in for implies that and intersect .
It remains to prove the lemma for the case that A is tricolored . For this to hold, we need that is -compatible , which by Lemma 1 is indeed true when Make-Splittable is called. So suppose A is tricolored and not splittable . Note that and cannot intersect because A is -compatible . Assume without loss of generality that , and let be a lowest node in . Note that both and must intersect W and R, and that cannot intersect B, since then A would not be -compatible . So is bicolored , and is tricolored .
Lemma 3
Let be the partition after Make-Splittable. Then is a refinement of that does not overlap in and in which every component is splittable .
Proof
By Lemma 2, and since each iteration increases the number of components in , Make-Splittable must terminate, and by its definition, the final partition contains only splittable components. Clearly is a refinement of ; it does not overlap in by the same arguments as used in the proof of Lemma 1.
Before continuing, we summarize the properties of the partition resulting after Make-Splittable that will be useful in the proof of the approximation guarantee in Sect. 4. To describe these, we need the notion of a top component.
Definition 6
If A is a component in the partition at the beginning of an iteration, and A is multicolored , then A is a top component. If A is a top component of the current partition, and A gets subdivided into and by Make- -compatible or Make-Splittable, then (but not ) is a top component of the resulting partition.
We note that by Observation 1, there are always either exactly one or two top components at the start of the iteration, and hence throughout (until the call to Split, after which the notion is no longer defined).
Lemma 4
Let denote the partition at the start of a given iteration, and (R, B, W) the coloring of the leaves that is selected, let denote the partition after Make- -compatible(), and let denote the partition after Make-Splittable(). Then the following properties hold:
Only multicolored components are subdivided by the iteration, i.e., if , then A is multicolored.
The number of tricolored components in is the same as in .
Any tricolored component in or that is not a top component contains no compatible tricolored triple.
Any bicolored component A in that is not a top component satisfies that is not covered by for any color . In other words, and are unicolored where and are the children of .
If is in component A in , and is not a descendant of (and thus is a white leaf) , then either or is in a top component in .
Proof
The fact that property 1 holds can be read from the description of Make- -compatible and Make-Splittable. Property 2 follows from the description of Make-Splittable.
For property 3, we prove that when a non-top component is created from a top component, this non-top component cannot have compatible tricolored triples. This implies that no non-top component can have a compatible tricolored triple. First consider non-top components created by Make- -compatible from a top component A. The fact that node picked in Make- -compatible is always chosen as low as possible implies that when the non-top component is created, it holds that for any . Therefore, for any , it must be the case that either or . But then is incompatible , because . So non-top components in can indeed not have compatible tricolored triples. Non-top components created by Make-Splittable from a top component are bicolored by definition, so these cannot have compatible tricolored triples either. Therefore, property 3 holds.
A similar argument shows property 4. First, consider a non-top component A created by Make- -compatible. A intersects R and B, so if A is bicolored , it contains no white leaves, so is not covered by . Now, because is the node picked in Make- -compatible, which is as low as possible, is not covered by nor . For a non-top component A created by Make-Splittable, the fact that is the node picked in Make-Splittable which is chosen as low as possible again implies that is not covered by for any color .
For property 5, if , consider a node selected by Make- -compatible or Make-Splittable that leads to a subdivision of A. It suffices to argue that , because then the fact that is not a descendant of implies that always remains in a top component. For selected by Make- -compatible this fact holds because is a lowest node such that intersects R and B. For selected by Make-Splittable this fact holds because is a lowest node such that is bicolored , and intersects the same colors as A.
Split
We now “split” the multicolored components of the partition: essentially, we further refine the partition by intersecting each multicolored component with R, B and W. Thus a component intersecting k colors will be split into k unicolored components. The fact that the components of the partition were splittable ensures that the resulting partition does not overlap in . We will, however, need to be slightly more careful in order to achieve the approximation guarantee; in particular, we will sometimes need to perform what we call a Special-Split.
Remark
Our analysis in Sect. 4 needs the Special-Split, Find-Merge-Pair and Merge-Components procedures only in one (of three) cases that will be described in Lemma 12. Without these procedures, it is trivial to see that the resulting partition is feasible, and we will see in Sect. 4 that the proof of the approximation ratio is quite simple in these cases. On first reading, the reader may thus choose to skip the description of these procedures, and also read Sect. 4 only up to the proof of Proposition 14.
We emphasize that the Special-Split procedure is only called if A is tricolored , and there is at least one tricolored compatible triple in A. Hence, by property 3 of Lemma 4, Special-Split is only applied to tricolored top components.
We refer to Fig. 4 for examples of the split operations in the two cases.
Fig. 4.
Two illustrations of Split(). In the top example and Split() would simply refine each set of by intersecting it with the three color classes. The result is that every leaf is a singleton in . In the bottom example, . The set is tricolored and contains triple that is tricolored and compatible , but not every tricolored triple in A is compatible , e.g., is not compatible . In this case, the Special-Split replaces A by
We now describe the property that the partition produced by Split will have, which goes beyond merely being -compatible and non-overlapping in and .
Definition 7
Let . A partition is K-feasible if for all , is -compatible, and no two components in overlap in .
We will simply say is feasible if it is -feasible, which we note does indeed coincide with the definition of a feasible solution to MAF. We make two additional remarks about the notion of K-feasibility:
This stronger compatibility notion will be used in Lemma 7 to show that if is -feasible, then future iterations of the Red-Blue Algorithm will not further subdivide (the restriction of the partition to) . This is not necessarily true if is only -compatible and does not overlap in . See Fig. 5 for an example.
If is a root of infeasibility for , then is not -feasible. The converse is not true, however: if contains a single component containing which is -compatible, but this component contains both such that is compatible, and such that is not compatible, then is not -feasible, but u is not a root of infeasibility. See Fig. 6 for an example. The stronger notion of a u being a root of infeasibility versus not being -feasible is needed when we prove the approximation guarantee in Sect. 4.
Fig. 5.
An example where is -compatible and does not overlap in , but that is not )-feasible. In this example, , which clearly does not overlap in any node. If we stop the current iteration with , then and are lowest roots of infeasibility; no matter which one is chosen, the next iteration would further subdivide the partition restricted to . Because we want to ensure this does not happen, the current iteration of the Red-Blue Algorithm will further subdivide the partition induced on : it will create components in Make-Splittable and split everything into singleton components in Split
Fig. 6.
An example where is not -feasible, but u is not a root of infeasibility. (To emphasize that u is not a root of infeasibility, the leaves are labelled with , , , and , in contrast to earlier figures.) In this example, , which does not overlap in any node, and is –compatible because the triple is compatible. But is not -feasible because is not compatible . On the other hand, u is not a root of infeasibility because is compatible
Before we prove that the outcome of Split is -feasible, we prove the following technical lemma that gives sufficient conditions for a partition to not overlap in .
Lemma 5
Let be the partition and (R, B, W) be the coloring at the start of an iteration. Let be a refinement of that does not overlap in and that is -compatible . Then does not overlap in if the following two conditions are met:
-
(i)
has at most one multicolored component;
-
(ii)
for the multicolored component (if it exists), either or any node with is covered only by components in that are subsets of W, or that are also components of .
Proof
Suppose the conditions of the lemma hold for . First, observe that having at most one multicolored component implies that contains at most one component covering . Hence, if we suppose for a contradiction exist that overlap in , then they must overlap in or . Without loss of generality, assume that overlap in . Since they do not overlap in , we may assume also without loss of generality that and .
Since was chosen as a lowest root of infeasibility, was not a root of infeasibility for . This implies that no two components of overlap in , so it must be the case that and were both part of a single component in and were split. Also, must have been R-compatible, so is a compatible set. We will show that these facts imply that if and overlap in , then they must overlap in , thus contradicting that does not overlap in .
Let v be a lowest node in such that and where we note that v exists since overlap in some node in . Observe that a child of v cannot be in both and , as this contradicts the choice of v, and v itself is in and only if and also contain leaves in . Let be in and respectively, and choose in and . Note that because , and because is a descendant of , and the coloring guarantees that all descendants of nodes in are red.
First, assume both and are unicolored (that is, both red). Then also , so is a compatible set. Note that and similarly . Since is compatible, we must also have and . But then is on the path from to as well as on the path from to . Hence, and overlap in , contradicting that does not overlap in .
Now, suppose that while is unicolored, is multicolored . Since is compatible, , so the fact that implies that . Now, it must be the case that , because and otherwise and overlap in , contradicting that does not overlap in . The fact that implies that . So , and by property (ii), it must thus be the case that is covered only by components in that are subsets of W or that are also components of . But this is a contradiction because is covered by .
The next lemma states that the partition resulting after Split is -feasible.
Lemma 6
Let be the partition after Split. Then is a refinement of that is -feasible.
Proof
It is easy to see that every component is -compatible for all : each component is either unicolored (and thus -compatible by the fact that the partition is R-compatible and B-compatible), or it is the result of a Special-Split on a component in which all tricolored triples are compatible, and hence, since all triples in are compatible by the fact the component is -compatible , it was already -compatible for all before the Special-Split.
To see that does not overlap in , note that the fact that does not overlap in and is splittable (by Lemma 3) implies that do not overlap in for any . If A is split by a Special-Split into and , then A is -compatible for all (again, because A has no incompatible tricolored triples and A is -compatible ). This implies that there is a node such that ; hence, and do not overlap in .
It remains to show that no two components in overlap in . We check the sufficient conditions in Lemma 5. The only possible multicolored components of are bicolored components created by Special-Split on a component in that is tricolored and in which every tricolored triple is compatible . By property 3 of Lemma 4, the only tricolored components that have a compatible tricolored triple are top components. By Observation 1, the partition at the start of the iteration had at most one tricolored component, and thus there can be at most one top component, say A, that is tricolored in . Since A is ()-compatible for all , there is a node such that . Split subdivides A into and , where is the unique multicolored component in . Let , and suppose that there exists a component that covers a node on the path from to . Then must be on this path, too, so . Observe that cannot be . Also, since A was the unique top component in , no component created in the current iteration has a lowest common ancestor above . So must have been a component in the partition at the start of the iteration, and by Lemma 5 we conclude that does not overlap in .
Find-merge-pair and merge-components
The astute reader may have noted that the Red-Blue Algorithm sometimes increases the number of components by more than necessary to be -feasible. One example of this is given in Fig. 5. More generally, it follows from the arguments in the proof of Lemma 6 that if there is a tricolored component in which every tricolored triple is compatible , then not further subdividing this component would also leave a partition that is -feasible. Find-Merge-Pair and Merge-Components aim to merge two components of the partition produced at the end of Split, so that the partition with the merged components is still -feasible. Find-Merge-Pair thus looks for a pair of components that can be merged, by scanning the components of the current partition, and finding two leaves in that are in different sets of the partition now, but that were in the same component at the start of the current iteration. We note that a pair of components may also be found when no Special-Split is done on a tricolored component in which every tricolored triple is compatible; in other words, Find-Merge-Pair and Merge-Components can do more than simply reversing those splits on tricolored components in which every tricolored triple is compatible . In the proof of the approximation guarantee (in particular, in Proposition 15), we will show the existence of very specific components that can be merged. However, merging any pair of components created in the current iteration leads to the same approximation guarantee.
Although we could simply merge the components containing and for the pair found by Find-Merge-Pair, we will not do so until the very end of the algorithm. The reason we keep such “superfluous” splits is because they increase the objective value of the dual solution we use to prove the approximation guarantee of 2 (see Sect. 4). We “reverse” these superfluous splits (i.e., we will merge components) at the end of the algorithm; this is reminiscent of a “reverse delete” in approximation algorithms for network design [13]. The reason to delay these merges is thus to simplify the description of the dual solution in the analysis only.
The proof that we will be able to merge the components containing the pair of leaves identified by Find-Merge-Pair at the end of the algorithm will rely on the fact that (i) because the partition is ()-compatible for any , merging the components containing the identified leaves cannot increase the number of incompatible triples contained in a component, and (ii) because the partition is -feasible, future iterations of the algorithm will not further refine the partition induced on . This is the reason why we do not allow Find-Merge-Pair to choose leaves in W (and only choosing leaves in is sufficient to prove the claimed approximation guarantee).
Lemma 7
Let (R, B, W) be the coloring during some iteration of the Red-Blue Algorithm, and let be the partition at the end of the iteration. Then the algorithm does not refine the partitioning restricted to in later iterations: for any that are in the same component of , x and are in the same component in any partition at any later point of the algorithm’s execution.
Proof
Suppose for a contradiction that a later iteration with coloring separates two leaves in the same component of . Let A be the component containing x and at the start of this iteration. Since is -feasible, no is a root of infeasibility, and hence all leaves in , and in particular x and , must have the same color in the coloring . Notice that by the definition of Split, x and cannot be separated during Split. Hence, they must be separated during Make--compatible or Make-Splittable. In both cases there must exist some such that is multicolored with respect to the coloring , and contains precisely one of . By relabeling if needed, assume that and . Let be any leaf with a color (in the coloring ) different from x, and note that
| 1 |
Because all leaves in , have the same color in , and because w has a different color than x in , we know that . But, since is ()-compatible, this implies that if w is in the same component as x and in (a refinement of) , then , contradicting (1), because only one of and can be strictly below .
Correctness of the algorithm
Theorem 8
The Red-Blue Algorithm returns a feasible solution to MAF.
Proof
In each iteration through the main loop of the algorithm, the partition is strictly refined. Thus there are less than iterations. When the main loop terminates, is not a root of infeasibility , and so the partition at this stage is feasible. It remains to prove that merging components using Merge-Components maintains the feasibility of the partition.
We prove this by induction on k, the number of pairs in pairslist. If , Merge-Components does nothing, and so the returned partition is indeed feasible.
So suppose . Observe that the result of Merge-Components applied to a partition is the unique finest coarsening of in which every pair of nodes in pairslist is in the same component, and hence does not depend on the order in which the pairs in pairslist are considered. We may thus assume without loss of generality that they are considered in the reverse order in which they were added to pairslist.
Let be the partition obtained during Merge-Components after the components have been merged for all pairs on pairslist, except the pair that was added to pairslist first. Let be the partition at the moment when was added to pairslist during the main loop of the algorithm, i.e. the partition at the end of Split in the iteration where was added to pairslist; let R, B, W be the three color sets of that iteration. In all subsequent iterations was further refined, and any of the pairs aside from added to pairslist consists of two leaves that were in the same component in the partition at the start of the iteration in which were they added to pairslist , and hence in the same component of . Thus, is a refinement of and is a coarsening of the partition at the end of the last iteration. Thus by Lemma 7, and induce the same partition of . Moreover, by the induction hypothesis, every component of is compatible.
Let be the components in containing respectively. By the choice of , is -compatible for every , and does not overlap any component of in .
If are unicolored , they both contain leaves in only, because by definition of Find-Merge-Pair. As argued above, contains components and as well. Furthermore, in this case, the set is a subset of and thus -compatibility for all implies the set is compatible . Since , cannot overlap any set ; this implies it also does not overlap any set , since is a refinement of .
If and are not both unicolored , observe that only one of is bicolored and contains leaves in , because does not overlap in so it can only have one multicolored component, and the only type of multicolored components after Split, are subsets of . Suppose without loss of generality that is unicolored and contains leaves in . As mentioned before, by Lemma 7, and have the same components restricted to , whence contains component and a component , where .
We need to show that is compatible and does not overlap any component in . For the latter, suppose in order to derive a contradiction that overlaps . Observe that the only nodes in that are not in are in , so the overlap must be on a node . Since is a refinement of , there must exist such that , and thus overlaps A in v as well. But then also overlaps A in v contradicting that is -feasible.
To show that is compatible , note that is a component of , and thus, by the induction hypothesis, is compatible . By the choice of , we know is ()-compatible for all . So to show that is compatible , it suffices to consider with and . Fix any , and note that . Therefore, for , since is implied by being -compatible. So is compatible exactly when is compatible . Because, as we noted, is compatible , we conclude that is compatible .
Proof of the approximation guarantee
We showed in the previous section that the Red-Blue Algorithm returns a feasible solution . In order to prove that our algorithm achieves an approximation guarantee of 2, we will use linear programming duality.
The linear programming relaxation
Let be the set of all compatible subsets of . Introduce a variable for every compatible set , where in an integral solution, indicates that L forms part of the solution to MAF. The constraints ensure that in an integral solution, is a partition, and that for two distinct sets with . The objective encodes the size of the partition minus 1. 
The equality constraint on the leaves can be replaced by the inequalities for all . For given a solution for which the constraint for some leaf v is not tight, we can simply choose some set L containing v with , and decrease while (if ) increasing . This cannot increase the cost of the solution, and clearly maintains feasibility. By repeating this process, we obtain a solution to (LP) of cost no larger than the cost of the original .
In fact, it will be convenient for our analysis to expand the first set of constraints (in their inequality rather than equality form) to contain a constraint for every (not necessarily compatible) set of leaves A, stating that every such set must be intersected by at least one component in the chosen MAF solution. All these constraints of this expanded set are clearly implied by the constraints for A a singleton, which are exactly the first set of constraints in (LP).
This expanded formulation provides us a more expressive dual:
We will refer to the left-hand side of the first family of constraints, i.e., , as the load on set L, and denote it by . By weak duality, we have that the objective value of any feasible dual solution provides a lower bound on the objective value of any feasible solution to (LP), and hence also on the optimal value of any feasible solution to MAF. Hence, in order to prove that an agreement forest that has components is a 2-approximation, it suffices to find a feasible dual solution with objective value , i.e., for every new component created by the algorithm, the dual objective value should increase by (on average).
The dual solution
The dual solution maintained is as follows. Throughout the main loop of the algorithm, if and only if A is a component in . In the last part of the algorithm, when we merge components according to pairslist , we do not update the dual solution; these operations affect the primal solution (i.e., ) only.
Initially, for all . At the start of each iteration, we decrease by 1, where . Whenever in the algorithm we choose a component A and a node , and separate the component A into and , we decrease by 1. To be precise this happens in Make- -compatible, Make-splittable and in one case in Special-Split (where we actually further refine ). The lines where such nodes are chosen are indicated by in the description of the algorithm and the procedures it contains.
Lemma 9
The dual solution maintained by the algorithm is feasible.
Proof
We prove the lemma by induction on the number of iterations. Initially, for all and and hence every compatible set L has a load of 1.
At the start of an iteration, we decrease by 1, thus decreasing the load by 1 on any multicolored compatible set L. We show that the remainder of the iteration increases the load by at most 1 on a multicolored compatible set and that it does not increase the load on any unicolored compatible set.
First, observe that Make- -compatible and Make-Splittable do not increase the load on any set: Separating A into and increases the load on sets L that intersect both and , since gets decreased from 1 to 0, and and increase from 0 to 1. However, in this case , and thus decreasing by 1 ensures that the load on L does not increase.
To analyze the effect of Split, we use the following two claims.
Claim 10
In the procedure Split the load on any compatible set L is increased by at most the number of components such that is multicolored .
Proof. If the load on L is increased because Split splits a bicolored component A into two unicolored components, then L must intersect both new components, so is bicolored (and thus multicolored ) and the load on L is increased by 1.
Consider the case where the load on L is increased because a tricolored component A is split into , and . This split happens when all tricolored triples in A are incompatible . Therefore cannot be tricolored . Since the load on L increased by splitting A, we conclude that must be bicolored and the load on L is increased by 1.
Finally, suppose the load on L is increased because Special-Split() is executed for a component A. We consider the two cases of Special-Split. In the first case, A is split into two components, one of which contains all red leaves in A. The load on a set L thus increases by 1 if is multicolored and and by 0 otherwise. In the second case, A is split into four components; we think of this as first splitting A into and , and then splitting by intersecting with R, B and W. Since is decreased by 1, splitting A into and does not affect the load on any set L. Splitting by intersecting with R, B, W increases the load on L by 1 if is bicolored and by 2 if it is tricolored . We show below that cannot be tricolored , which implies that the load on L increases by at most 1 if is multicolored , thus proving the claim. Suppose contains a triple . The fact that A is -compatible implies that . Since , we thus have either or . In either case, is incompatible, contradicting that L is compatible.
Claim 11
If L is compatible, and A and do not overlap in , then and cannot both be multicolored .
Proof. Assume that (otherwise, the claim is vacuously true). Since and are disjoint, we may assume without loss of generality that for all and . Hence, if and are both multicolored sets, then there exist where x, y have different colors, have different colors, , and . We claim this implies is incompatible, a contradiction since and L is compatible.
Clearly one of x, y has the same color as one of . Suppose without loss of generality that have the same color. If x and are both red, y is either blue or white. x and being red implies , which, since , shows that is an incompatible triple. The case when x and are blue is analogous. If x and are both white, then y and are in . This implies , and so, since , this implies is an incompatible triple.
It follows immediately from the two claims that Split increases the load by at most 1 on any multicolored compatible set and that it does not increase the load on any unicolored set, which completes the proof of the lemma.
The primal and dual objective values
Let , pairslist be the partition and pairslist at the end of an iteration, and let be the objective value of the dual solution at this time. In this section, we show that every iteration of our algorithm maintains the invariant that
| 2 |
Observe that the approximation guarantee immediately follows from this inequality, since the objective value of the algorithm’s solution is (where , pairslist are the partition and pairslist at the end of the final iteration), and by weak duality D gives a lower bound on the optimal value of the MAF instance.
To prove that the algorithm maintains the invariant, we will show that a given iteration increases the left-hand side of (2) by at least as much as the right-hand side. We let be the change in the dual objective during the iteration and be the increase in the number of components minus the number of pairs added to pairslist (either 0 or 1) during the current iteration.
Since at the start of the algorithm, the partition consists of exactly one component, and for all , (2) holds before the first iteration. So to show (2), it suffices to show that
| 3 |
for any iteration.
In what follows, we use the following to refer to the state of the partition at various points in the current iteration: at the start; after Make- -compatible; after Make-Splittable; and after Split.
We begin by showing that the coloring (R, B, W) and the partition satisfy the conditions of one of three cases.
Lemma 12
Given an infeasible partition that does not overlap in , let be a lowest root of infeasibility, and let and be u’s children in . Let , and . Then is R-compatible and B-compatible and satisfies exactly one of the following three additional properties:
- Case 1
has exactly one multicolored component, say , where is tricolored , not -compatible , and there exists , i.e., contains a compatible tricolored triple.
- Case 2
has exactly two multicolored components, say , where and .
- Case 3
has exactly one multicolored component, say , where is tricolored , -compatible and contains no compatible tricolored triple.
We will see in the proof below that Cases 1, 2 and 3 correspond to a lowest root of infeasibility satisfying (a), (b) and (c) respectively in Definition 4. We refer the reader to Fig. 1 for an illustration of the three cases.
Proof
Observe that if is infeasible, then the root of , i.e., is a root of infeasibility, and that no is a root of infeasibility. Hence, u is well-defined and R and B are non-empty. Note that is R-compatible and B-compatible, since u’s children are not roots of infeasibility.
We will show that if u satisfies condition (a) in the definition of a root of infeasibility, then the conditions of Case 1 are satisfied, if (b) holds, the conditions of Case 2 are satisfied, and if (c) holds, then the conditions of Case 3 are satisfied.
We start with (b): overlaps in . Observe that, because u is a lowest root of infeasibility, the only node in on which overlaps is u, and thus there must be at least two multicolored components if (b) holds. If there are two multicolored components, both containing, say, red leaves, then they overlap in , which implies is a root of infeasibility, contradicting the choice of u. Similarly, there is at most one multicolored component containing blue leaves. Hence, the conditions of Case 2 are satisfied.
If (b) does not hold, i.e., the partition does not overlap in , then there is at most one multicolored component; the conditions in (a) and (c) both imply there is at least one. Thus there is exactly one multicolored component, which we will call . We let and (where we stress that is a node in , whereas u is a node in ).
If (a) holds, then is not -compatible , and thus . To derive a contradiction, suppose that Case 1 is not implied, i.e., there does not exist , i.e., . Observe that, because is not -compatible , or . Suppose the former holds without loss of generality. But then is a root of infeasibility satisfying (c), because , and for all , is incompatible, by the fact that , and . But is a descendant of u, thus contradicting the choice of u.
Suppose now (c) holds, i.e., is -compatible , and in particular is -compatible . Because is multicolored , we can assume without loss of generality that . If , then (c) holds for , which is a descendant of u, thus contradicting the choice of u. Since by condition (c), we conclude is tricolored . It remains to show every tricolored triple is incompatible. Suppose for a contradiction that is a tricolored triple that is compatible. Let w be the white leaf in the triple, then compatibility requires that . On the other hand, the fact that is -compatible implies that . But then any tricolored triple in containing w is compatible, so that is compatible, contradicting that condition (c) holds.
Recall that the coloring is defined only at the start of the iteration. The lemma ensures that the partitions during the iteration always have either one (in Cases 1 and 3) or two (in Case 2) top components. Furthermore, we can use the lemma to show that the components created by Make- -compatible and Make-Splittable are multicolored .
Lemma 13
Only multicolored components are created by Make- -compatible and Make-Splittable, i.e., if , then A is multicolored.
Proof
It follows immediately from the description of Make-Splittable that components created by this procedure are multicolored . Observe that Make- -compatible is used only if is not -compatible. By Lemma 12, this implies must have exactly one multicolored component , which contains a white leaf that is not a descendent of . From the description of Make- -compatible, we (possibly repeatedly) subdivide the top component into and . From the description of Make- -compatible, it is clear that the newly created non-top component intersects both R and B, and the new top component must have a leaf in , because otherwise is already -compatible. So , and must also be in the new top component , thus ensuring that the top component remains multicolored .
For Cases 2 and 3, the analysis is quite simple.
Proposition 14
Let the initial partition and coloring (R, B, W) satisfy the conditions of Cases 2 or 3 in Lemma 12. Then .
Proof
We first make two observations that apply in Cases 2 and 3: (i) is already -compatible , so , and (ii) Split() will not perform any Special-Split, because no component of a refinement of can have a tricolored triple that is compatible (since we are in Case 2 or 3). From these two observations we derive that
| 4 |
To see this, note that, since no Special-Split is performed, is equal to the number of bicolored components in plus twice the number of tricolored components in . By Lemma 13 has more multicolored components than , and, since , property 2 of Lemma 4 implies that has the same number of tricolored components as . So in Case 2, has bicolored components and zero tricolored components, and in Case 3, has bicolored components plus one tricolored component, and indeed (4) holds.
In addition, we note that
| 5 |
To see this, note that at the start of the iteration, the dual objective value is reduced by 1 when is decreased by 1 for . Make-splittable does not change the dual objective value, because, even though increases by 1 every time the number of components increases by 1, decreases by 1 as well. Finally, since Split will not perform any Special-Split, the increase in the dual objective value due to Split is equal to the increase in the number of components due to Split, which is .
Note that the size of pairslist may increase but will never decrease, and thus
We now prove a similar proposition for Case 1, the proof of which is more involved.
Proposition 15
Suppose the initial partition and coloring (R, B, W) satisfy the conditions of Case 1 in Lemma 12. Then .
Proof
In Case 1, we start with containing one tricolored component , which is not -compatible . is the only component that will be subdivided in the current iteration (by property 1 of Lemma 4). Note that and therefore have exactly one top component.
Let be a white leaf in that is not a descendant of , which exists by the definition of Case 1. By property 5 in Lemma 4, is contained in the top component of , and by Lemma 13 this component is multicolored . Therefore, the top component of is either bicolored , or it is tricolored and a Special-Split is performed on the top component.
Let be an indicator variable that is 1 if the top component in is tricolored and has a tricolored triple that is incompatible ; in other words, if Special-Split subdivides the top component into four components. If , then either the top component is bicolored or it is tricolored and all its triplets are compatible ; in other words, if the top component is subdivided into two components by Split (possibly via Special-Split). Thus splitting the top component increases the number of components by .
Now, let t be the number of tricolored components in that are not top components. We claim that
| 6 |
To show this, we need to argue that the increase in the number of components due to splitting the multicolored non-top components is . Since has one multicolored component, Lemma 13 implies that has multicolored components. Precisely one of these is a top component, so has multicolored non-top components. By property 3, each of the tricolored components that are not top components do not require a Special-Split and are thus subdivided into three components by Split. Hence, splitting the components that are not top components increases the number of components by .
Next, we analyze the increase in the dual objective. We claim that
| 7 |
To see this, note that the dual objective is decreased by 1 when we decrease by 1 at the start of the iteration. As argued in the proof of the previous proposition, the dual objective is not affected by Make-Splittable. The same argument used there implies that the same holds for Make- -compatible. Finally, if , the increase in the dual objective due to Split is equal to the increase in the number of components, . If , the same holds, but Special-Split on the top component also decreases by 1.
So we get that
Hence, if we have as required. So the rest of the proof, which requires quite some extra technicalities, deals with the situation of Case 1 and . Recall that is equal to minus the number of pairs added to pairslist in the current iteration; hence, to conclude that if , we need to show a pair is added to by Find-Merge-Pair.
We will say that a component is able to reach if or if and all intermediate nodes on the path from to are not covered by any component in . The following lemma (which is actually valid in general, and not only for Case 1) enumerates precisely the situations when a merge is possible.
Lemma 16
Let , and let denote the set of components in that are subsets of . Then there exists a pair of elements in that can be added to pairslist if and only if at least one of the following is true:
contains a bicolored component.
There is a node that can be reached by two red components or two blue components in .
There is a node that can be reached by a red and a blue component in , but is not covered by these components. Furthermore, the node must satisfy that the nodes on the path from to are not covered by any red or blue component in .
Proof
Since any two multicolored components overlap in and does not overlap in by Lemma 6, there is at most one tricolored component in . By the definitions of Split and Special-Split, therefore has at most one multicolored component, which has blue and white leaves and is created by applying Special-Split to the tricolored component in . If this blue-white component exists in , we denote it by .
If contains a bicolored component , let be the tricolored component from which Special-Split formed a red component A and the bicolored component . We show that we can merge A and , which boils down to undoing the Special-Split operation, to obtain a new partition that is -feasible. Since was not overlapping with any other component in , undoing the Special-Split yields a component that does not overlap any other component of the partition in . For every , is ()-compatible since is -compatible and, by the conditions of the Special-Split operation, every tricolored triple in is compatible. Since was the unique top component in , any component of (and hence of its refinement ) overlapping a node such that must be a component in . Therefore, by Lemma 5, the new partition does not overlap in .
-
If does not contain a bicolored component , suppose are distinct red components in so that A and can both reach the same node in . Then merging A and gives a new partition that does not overlap in , and which has no multicolored components. Since is compatible, so is . By Lemma 5 and the fact that the new partition does not have any multicolored components, it does not overlap in . Hence, merging A and gives a new partition that is -feasible.
The same applies if A and are both blue components in .
-
If does not contain a bicolored component , suppose there exist with A red and blue such that (i) there exists that can be reached by both A and ; and (ii) the nodes on the path from to are not in for any red or blue component in . Observe that (ii) implies that any component such that contains nodes on the path from to must be subsets of W: must be in if contains a node on this path, and by the case assumption, contains no multicolored component.
Merging A and gives a new partition that does not overlap in and the new component is -compatible by (i). Thus the new partition is -compatible , and since it has no components with white leaves as well as leaves in , it is vacuously also ()-compatible for any . is the unique bicolored component in this new partition, thus satisfying condition (i) of Lemma 5. Moreover, it satisfies that any node on the path from to is not covered by a component that is not white. By Lemma 12, must have been the unique multicolored component in , and thus the components of the partition that overlap a node on the path from to were not changed in the current iteration. Therefore, also condition (ii) of Lemma 5 is satisfied, and the lemma implies that the new partition does not overlap in . Hence, merging A and gives a new partition that is -feasible.
We note that the above three cases encompass all possible merge opportunities within . If two components cannot reach the same node , then merging them gives a partition that overlaps in . If red and blue components A and can only reach nodes in that are covered by either A or , then is not -compatible. And if a red and blue component A and can reach a node that is not in , but some node on the path from to is covered by a component that is red or blue, then will overlap in or . To see this, assume is red (the blue case is analogous) and let be the node in closest to on the path from to . Then , and since are compatible in R, we should also have . Thus and A overlap on a node on the path from to .
We are now ready to complete the proof of Proposition 15, by showing that in Case 1 and if (i.e., if has no tricolored components that are not top components), then at least one of (a), (b) and (c) in Lemma 16 holds for . By the conditions of Case 1, the unique tricolored component in is not -compatible, and there exists .
If (a) holds, we are done, so suppose (a) does not hold, i.e., has only unicolored components. We first make some observations which we later use to conclude that (b) or (c) must hold. Let be the last node chosen in Make- -compatible to subdivide the top component. Because is not -compatible, at least one iteration of Make- -compatible has to be executed on the component, so the existence of follows. Let be the top component that is subdivided into and at this point (after which the current partition is ). We observe some properties of the two new components:
Letting and be the children of , then and . To see this, note that by definition of Make- -compatible, and each have a non-empty intersection with exactly one of R and B, and they cannot intersect W because otherwise is a tricolored non-top component of , and then would also have a tricolored non-top component by the definition of Make-Splittable, contradicting that .
Because was the top component at the moment Make- -compatible subdivided A into and , is the top component in . By Lemma 13, intersects . By the conditions of Case 1 and property 5 in Lemma 4, it contains a node that is not a descendant of . Finally, is the only component in that can cover a node on the path in from to (by the fact that was the unique top component in and is the unique top component in ).
Using the above, we now show we can find two components in that are subsets of A and that satisfy condition (b) or (c) of Lemma 16. As argued above, and , so is splittable; Split will subdivide into a blue component and a red component . There are a few cases to consider (illustrated in Fig. 7).
If there is no node on the path in from to that is covered by a red or blue component in , then we are done because with and satisfy (c).
- If there are nodes on the path from to that are covered by red or blue components, let be the node closest to for which this is the case. Suppose without loss of generality that for some red component . We will show that there is another red component that can reach , so these two components and satisfy condition (b).
- If the nodes between and are not covered by any components in , then the red component can reach .
- Otherwise, let be the node closest to on the path from to that is covered by a component in . By definition of , this component is white, say . We claim (and prove below) that in Make-Splittable a node must have been chosen that created a non-top component , which was subsequently split into the white component and red component to obtain . By property 4 in Lemma 4, is not covered by nor , so and are the leaves in and , with being the two children of . Hence, can reach . On the other hand, is covered by , and thus . By definition of , the nodes on the path from to are not covered by any component in , and thus can also reach .
It remains to prove that in 2(b), Make-Splittable selected a node that created a non-top component , which was subsequently split into to obtain . First, observe that and were part of the top component in (because they cover nodes on the path from to ). They cannot both have been part of the top component of , because by the conditions of Case 1 and property 5 of Lemma 4, the top component of contains a white leaf that is not a descendant of , and thus covers all nodes on the path from to , which includes . So and overlap and cannot be in the same component of the splittable partition . Thus, Make-Splittable must have selected some when and became part of different components. Note that became part of a non-top component. It remains to show this component contains no blue leaves. Note that otherwise such a blue leaf , and a red leaf and a red leaf (which exists because is the lowest node on the path from to covered by ) would all belong to the component , but since , this triple would be incompatible , contradicting that is -compatible .
Fig. 7.

Illustration of the last part of the proof of Proposition 15. The sets and described in the proof are implicitly shown in the figure: and
Theorem 17
The Red-Blue Algorithm is a 2-approximation for the maximum agreement forest (MAF) problem.
Proof
By Theorem 8, the Red-Blue Algorithm returns a feasible solution to MAF. We showed how to construct a feasible solution for the dual linear program (D’); by Propositions 14 and 15, the objective value of the solution to MAF returned by the Red-Blue Algorithm is at most twice the objective value of this dual solution. The approximation guarantee follows by linear programming duality.
A compact formulation of the LP
Here we give a compact formulation for (LP). This shows that it can be optimized efficiently. While this is not needed in our algorithm, it is possible that an LP-rounding based algorithm could achieve a better approximation guarantee, in which case this formulation will be of use. Moreover, the compact linear program explicitly encodes the structure of compatible sets in a way that (LP) does not; we believe this may provide additional structural insights in the future.
We remark that (LP) can also be shown to be polynomially solvable by providing a separation oracle for the dual. The dual of (LP) is similar to (D), the dual of (LP), except that z is indexed only by singletons and not arbitrary subsets of . This dual has a polynomial number of variables, but an exponential number of constraints. By the equivalence of separation and optimization, it suffices to provide a separation oracle for this dual. In particular, it suffices to solve the problem of finding a most violated constraint amongst
for some given y and z. If we relabel to , making y a vector indexed by V, we can restate this as follows. Given some (positive or negative) weights y on the nodes of V, find a compatible subset L which maximizes . This is a weighted variant of the maximum agreement subtree problem; in other words, the maximum agreement subtree problem is the problem where for all . Similar to the usual (unweighted) version [26], this can be solved in polynomial time via dynamic programming.
Assume for convenience that . We will deviate from the notational conventions in the previous sections, and use i and j to denote leaves, and to index the two input trees.
Let denote the set of all pairs for which . Consider a compatible set . For , we will use to denote the subtree of on . Compatibility implies that and are isomorphic. What we will now do is represent the structure of these isomorphic trees by an out-arborescence F(L), where the nodes of the arborescence are elements of Z, and more precisely, are a subset of . We do this as follows.
If L contains only a single element i, then F(L) is the arborescence consisting of the single vertex (i, i).
-
Otherwise, let and be the partition of L into the leaves below the two children of the root of . Take be the smallest element of and the smallest element of ; we assume (otherwise, swap and ). The root of F(L) will be chosen as . Now recursively apply this procedure to and , yielding arborescences and ; let and denote their respective roots. Note that and are necessarily disjoint, since and are disjoint. Then F(L) is defined to be the union of and , along with the arcs and .
Observe that the pair is of the form for some , since remains the smallest element of , whereas is of the form for some . We call the left arc and the right arc leaving r.
So put differently, this procedure takes the tree (or ; it makes no difference), contracts all nodes with only a single child, orients all edges away from the root, and then assigns a label to each node. This label consists of a pair of leaves in L, chosen minimally amongst the leaves in each of the two subtrees below the node (aside from leaves, which are labelled by repeating the leaf twice). We also note that if L is not a compatible set, then we could still apply this procedure, but it would return different results when applied to instead of .
With this representation of a compatible set in mind, we now construct a certain directed graph D on the vertex set . It essentially contains all possible arcs that could appear in an arborescence constructed from a compatible set. We will use for arcs that can appear as left arcs, and for arcs that can appear as right arcs: the arc set of D is . With a slight abuse of notation, define for any ; we can think of as being the node in that the pair r identifies. Given two nodes and in :
if for all and ;
if for all and .
For any , define ; these are the set of pairs in that appear as labels for the leaves L. Let denote the set of out-arborescences in D with leaf set contained in and where each internal node has one outgoing arc in and one outgoing arc in . Then the above discussion implies that if and only if there is an with leaf set . Let be the characteristic vector of the arc set of F, for any . Let denote the cone generated by , i.e., if and only if there exists with such that .
We begin by giving a description of . For , let denote the arcs in D leaving r, and the arcs entering r. For , let .
Lemma 18
Proof
Let Y denote the cone described by the right hand side of the claimed equality. First, we observe that . Consider any . Then for any , F has precisely one arc entering r, precisely one arc leaving r that is in , and one arc leaving r that is in . Hence , and therefore any conic combination of the ’s is also in Y.
It remains to show that . Suppose ; we prove that , proceeding by induction on the number of nonzero elements of y. The claim trivially holds if , since is a cone. So suppose .
We first claim that for any for which either or , there exists an arborescence rooted at r and contained in the support of y, by which we mean the set of arcs in D for which y is nonzero. To prove this, we can proceed by induction on . The claim is trivial if , since then and we take an arborescence consisting only of the node r. Otherwise, choose any and that are both in the support of y. Notice that one of them must exist because , and then the other must exist as well, because of the equality in the definition of . As a result, , and so the second constraint in the definition of Y implies that either or ; the same holds for . Hence by induction, we obtain arborescences and in the support of y rooted at and respectively. We have already noted that there is no node that both and can reach; thus and are disjoint. We obtain F by combining , and the arcs from r.
Now choose such that but (such an r clearly exists, since D is acyclic and ). By the above, we can find an arborescence rooted at r and contained in the support of y, Now set , where is chosen maximally so that . For every node s contained in F that is distinct from r and not in , . Further, . It follows that . The choice of ensures that has strictly smaller support than y, and so by induction, we deduce that . Hence is too.
Using Lemma 18, we now describe our compact formulation that we baptize . For and , let , i.e., the set of all pairs of leaves with one leaf in v’s left subtree, and the other in its right subtree. 
Lemma 19
() is equivalent to (LP).
Proof
We begin by showing the “easy” direction, that a feasible solution x to (LP) can be converted to a feasible solution to () with the same objective value. Set for all , and . By the “easy” direction of Lemma 18 (and the equality for all ), we can deduce that is feasible to (). Further, the objective values match: for any with , the contribution of the term in y to the objective value is exactly (only the root of F(L) contributes), and the term captures the fractional value of singleton components.
The “hard” direction, that a feasible solution to () can be converted to a feasible solution x to (LP) of the same objective value, follows in exactly the same way, but using the “hard” direction of Lemma 18. The constraints (-2) and (-3) ensure that , and thus we can expand for some . Extending this x to singleton sets by defining for all yields the desired solution to (LP).
Conclusion
We have described a factor-2 approximation algorithm for the MAF problem with a quadratic running time. Unlike previous algorithms for the problem, we crucially exploit the power of linear programming duality in our analysis. A number of clear directions remain for future work.
Most obviously, is the question of whether the approximation factor can be further improved. The approximation ratio of our algorithm implies an upper bound of 2 on the integrality gap of our linear program. However, the largest lower bound on the integrality gap of our linear program that we are aware of is 5/4; Fig. 8 in Appendix 2 shows one of many examples achieving this bound. Despite extensive computational experiments on instances with a small number of leaves, we have not been able to find any examples with an integrality gap larger than 5/4. It is thus possible that our formulation could be used as the basis for an improved algorithm (though we would expect such an algorithm to be quite different from the algorithm presented here).
Fig. 8.
Example with integrality gap of for the new LP introduced in Section 4. An optimal solution to the LP-relaxation is indicated by the colors: the compatible sets corresponding to each of the components and (indicated by the colors red, blue and green respectively) have an x-value of , as well as every singleton leaf set, except leaf 1; all other x-values are 0. The objective value of this solution is . An optimal solution to the ILP has 6 components, i.e., an objective value of 5. (One such optimal solution has the component along with singleton components)
One natural idea would be to apply another powerful and successful technique in the theory of approximation algorithms, namely LP rounding. We have shown that the LP relaxation can be efficiently optimized via an equivalent compact formulation. It should, however, be noted that such an approach will be much slower than the purely combinatorial algorithm presented here, where the ILP formulation is used only in the analysis. It may thus not be the most promising approach for an algorithm of practical relevance.
Our ILP formulation may also be useful for exactly solving the MAF problem. Although it is NP-hard, ILP solvers are very successful in practice. Since our formulation appears to be quite strong, it may work better in practice than simpler formulations, such as the one of Wu.
Aside from improving the approximation factor, another natural avenue to pursue is improving the running time. A factor-2 approximation algorithm with a linear or near-linear running time, to match what has been achieved with a factor-3 approximation, would clearly be very desirable. It does not seem straightforward to improve the running time of our current algorithm; quite substantial changes would likely be needed.
Another very natural direction is to consider other variants of the MAF problem. For instance, the variation with more than two trees, where the current best approximation factor is 3 [7]; or the generalization to non-binary trees. It is straightforward to extend our formulation to both of these setting.
Finally, it must be admitted that our algorithm, and especially its analysis, is far from simple. A truly simple algorithm for MAF with an approximation factor of 2, if one can be found, will certainly require understanding its structure even more deeply.
Acknowledgements
We acknowledge the support of the Tinbergen Institute and the Hausdorff Research Institute for Mathematics, where portions of this research were pursued. We thank the anonymous referees for their very careful readings and detailed feedback on improving the presentation.
N.O. was supported in part by NWO Veni grant 639.071.307 and NWO Vidi grant 016.Vidi.189.087. F.S. was supported in part by NSF grants CCF-1526067 and CCF-1522054. L.S. was supported by NWO Gravitation Programme Networks 024.002.003. A.v.Z. was supported in part by grant #359525 from the Simons Foundation.
Appendix A The running time
It is quite clear from the definition of the Red-Blue Algorithm that it runs in polynomial time. In this section we show that it can be implementated to run in time, where n denotes the number of leaves. (We work in the random access machine model of computation, and assume a word size of .)
We note that our presentation is focused on showing the bound on the running time as straightforwardly as possible, and there are some places where a more careful implementation is more efficient. However, we have not been able to find an implementation with an overall running time of .
We assume is a given partition (not overlapping in ), that is stored such that we can query the size of any component in constant time, and that for each node , we can query , the component in that covers (which will be equal to if does not cover ), and . Note that we can determine this information by a bottom-up pass of in O(n) time. We will recompute it whenever we refine ; since there can only be at most refinement operations, the total time to maintain this information is .
By Harel [14] (see also [2, 15]), we furthermore may assume that the computation of for given nodes takes constant time (after a linear preprocessing time). It immediately follows from this that we can determine whether or not in tree in constant time as well.
We will show that the time between subsequent refinements of is O(n). This bounds the time of the main loop of the algorithm by . The only remaining part of the algorithm is the Merge-Components step, which will perform at most merges, each of which can clearly be done in O(n) time.
Finding a lowest root of infeasibility
We make a single pass through , in bottom-up order (starting from the leaves), until we find a root of infeasibility . We will spend constant time per node, thus showing that the time to find a lowest root of infeasibility is O(n).
For each node that we have already considered, references the component which covers u, with if there are no such components. (If there are multiple such components, u is a root of infeasibility.) Furthermore, is equal to , and is the size of . Observe that for any , we know , the component containing x, and and .
Given a non-leaf node , with children and that have already been considered, we can determine whether u is a root of infeasibility , and, if not, determine and , in constant time: If either (or both) of and do not cover u (which can be determined by checking if or ), set all the values according to which child (if any) does cover u, and end the consideration of node u. So assume from now on that both do cover u.
If , then u satisfies the second condition of a root of infeasibility , and we are done. Otherwise, . Set and . If or , then is incompatible, and u satisfies the first condition of a root of infeasibility . If and then u satisfies the third condition for being a root of infeasibility : by , we know . For any , , while by we know that . So is incompatible for any . Otherwise, u is not a root of infeasibility , and we finish our consideration of u.
Once we have determined the coloring (R, B, W), we compute for each component and . We also compute three additional labels for each node : for . This information can be determined by a bottom-up traversal of in O(n) time. We assume this information is updated whenever the partition is refined.
Make--compatible
Consider the nodes of in bottom-up order, until we find a node such that both and . Since is a lowest such node, if and , then is -compatible ; otherwise is precisely as indicated in Make- -compatible.
Make-splittable
We again consider the nodes in in bottom-up order. For any node with , using for , we can check in O(1) time whether is bicolored, and that for any with , that (and hence ).
Split
Note that a regular split of a component A can be done in O(n) time, by simply checking the color of each leaf in A and partitioning A accordingly. We now show how to check if A needs a Special-Split (and if so which of the two possible refinements is applied) by considering the nodes in in bottom-up order.
If A is tricolored , then the fact that A is -compatible and splittable implies that there exist and that are covered by A and for which and . Using a bottom-up traversal of will find and (they are the first nodes encountered such that and for and B respectively).
Given and , we check if a Special-Split is required, by considering ; a Special-Split is required exactly if , since in that case any forms a compatible triple with any . If, in addition, , we know that every tricolored triple in A is compatible.
Find-merge-pair
We need to determine if there exist two components (both intersecting ) that can be merged, in time O(n). If such components are found, then we can take a non-white leaf in each component and add this pair to pairslist . Recall Lemma 16, which enumerates all possible situations where a potential merge may exist.
has a bicolored component. This component must have been created by an application of Special-Split, splitting some component into A and . As discussed in the proof of Lemma 16, simply undoing this split is a valid merge. Since Special-Split is invoked at most once per iteration, we can simply add a pair to pairslist during Special-Split.
-
There is a node that can be reached by two red or two blue components that were part of the same component at the start of the current iteration. By Lemma 12 (and property 1 of Lemma 4), any two components that are not white that were created in the current iteration must have been part of the same partition at the start of the iteration. We may assume that we can check for each component in constant time whether it was created in the current iteration.
We work bottom-up in , and set to be the set of red and blue components that were created in the current iteration, and that can reach for every . If contains two components of the same color, these two components can be merged, and we terminate.
Note that if ever contains three components, we will have found a merge and the algorithm will terminate. This ensures that we can compute this for in constant time, given the values of and and for the children of .
-
There is a node that can be reached by a red and a blue component that were part of the same component at the start of the current iteration, but is not covered by these components. Furthermore, the node must satisfy that the nodes on the path from to are not covered by any red or blue component. Note that by Lemma 12, is the only component that was modified in the current iteration, so for this second condition we can simply check that no node on the path from to the root of is covered by a red or blue component that was created in the current iteration.
If we did not find two components of the same color that can be merged, we have found for every , where . We now work top-down in . If we encounter a node that is covered by a red or blue component that was created in the current iteration, we stop and do not consider the descendants of (since for any such a descendant, is on its path to ). If we encounter a node such that and is not covered by any component, the two components in can be merged, and we may terminate.
Appendix B Integrality gap lower bounds
We show a lower bound on the integrality gap of for the integer linear program formulation of Wu [31]. Recall that a solution to MAF can be viewed as the leaf sets of the trees in a forest, obtained by deleting edges from the input trees. The formulation has binary variables for every edge , indicating whether e is deleted from . We use to denote the set of edges in on the path between leaves i and j, and to denote the set of edges in on the path between leaves i and j. Wu’s linear program [31] is given by:
The first family of constraints ensures that at least one edge of the paths between i and j, i and k, and j and k has to be deleted for each inconsistent triple i, j and k. The second family of constraints ensures that at least one edge is deleted for every pair of paths between i and j, and k and that are disjoint in , but for which the corresponding paths in are not disjoint.
Lemma 20
The integrality gap of the linear program of Wu [31] is at least .
Proof
Let for some k even. We label each internal node in both and with a binary string: the roots get the empty string as label, and given an internal node u its left child gets u’s label with a “0” appended, and its right child gets u’s label with a “1” appended. In , the leaves are labelled in the same way as the internal nodes, with a binary string of length k. In , the binary string is reversed to give the label of the leaf. For example, the leftmost leaf (of both trees) has label , and the leaf to the right of it has label in , and in .
Consider the internal nodes whose labels are strings of length strictly less than k/2; there are exactly such nodes in each tree. We claim that any component A must cover at least of these internal nodes. To see this, consider the set of internal nodes in that have two children in the subtree of induced by A for . We will call such nodes bifurcating. Observe that there are such nodes. Furthermore, since A is compatible , there is a 1-1 mapping f from the bifurcating nodes in to the bifurcating nodes in , where, . Now, the label for a bifurcating node is the maximum length prefix that the binary strings for the leaves in have in common, and the label for f(u) is the reverse of the maximum length suffix the leaves in have in common. Hence, at least one of u and f(u)’s labels has length less than k/2.
The fact that any component A must cover at least of the internal nodes with labels of length less than k/2 implies that any partition that does not overlap must have at least components. Thus the optimal value of the integer program is at least .
On the other hand, the LP relaxation of the integer program has a feasible solution with objective value : set a value of on the edges to each leaf in the tree (i.e., from an internal node with a label of length to a node with a label length k), and a value of on all edges between nodes with labels of length to nodes with labels of length . This implies a lower bound of on the integrality gap.
As remarked in the introduction, the largest integrality gap for our formulation that we are aware of is 5/4. The instance is described in Fig. 8.
Footnotes
This paper is based on the (substantially different) extended abstract [23].
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Allen BL, Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 2001;5(1):1–15. doi: 10.1007/s00026-001-8006-8. [DOI] [Google Scholar]
- 2.Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN), pp. 88–94 (2000)
- 3.Bonet ML, John KS, Mahindru R, Amenta N. Approximating subtree distances between phylogenies. J. Comput. Biol. 2006;13(8):1419–1434. doi: 10.1089/cmb.2006.13.1419. [DOI] [PubMed] [Google Scholar]
- 4.Bordewich M, McCartin C, Semple C. A 3-approximation algorithm for the subtree distance between phylogenies. J. Discret. Algorithms. 2008;6(3):458–471. doi: 10.1016/j.jda.2007.10.002. [DOI] [Google Scholar]
- 5.Bordewich M, Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 2004;8(4):409–423. doi: 10.1007/s00026-004-0229-z. [DOI] [Google Scholar]
- 6.Chataigner F. Approximating the maximum agreement forest on trees. Inf. Process. Lett. 2005;93(5):239–244. doi: 10.1016/j.ipl.2004.11.004. [DOI] [Google Scholar]
- 7.Chen J, Shi F, Wang J. Approximating maximum agreement forest on multiple binary trees. Algorithmica. 2016;76(4):867–889. doi: 10.1007/s00453-015-0087-6. [DOI] [Google Scholar]
- 8.Chen Z-Z, Harada Y, Wang L. A new 2-approximation algorithm for rSPR distance. In: Cai Z, Daescu O, Li M, editors. Bioinformatics Research and Applications. Cham: Springer International Publishing; 2017. pp. 128–139. [Google Scholar]
- 9.Chen, Z.-Z., Machida, E., Wang, L.: A cubic-time 2-approximation algorithm for rSPR distance. arXiv preprint arXiv:1609.04029 (2016)
- 10.Chen, Z.Z., Machida, E., Wang, L.: An improved approximation algorithm for rSPR distance. In International Computing and Combinatorics Conference, pp. 468–479. Springer (2016)
- 11.Darwin, C.: Notebook B: Transmutation of species (1837?-1838). In: John van Wyhe: The Complete Work of Charles Darwin Online (2002). http://darwin-online.org.uk/
- 12.Gascuel O, editor. Mathematics of Evolution and Phylogeny. Oxford: Oxford University Press Inc.; 2005. [Google Scholar]
- 13.Goemans MX, Williamson DP. The primal-dual method for approximation algorithms and its application to network design problems. In: Hochbaum DS, editor. Approximation Algorithms for NP-hard Problems. Boston: PWS Publishing Co.; 1997. pp. 144–191. [Google Scholar]
- 14.Harel, D.: A linear time algorithm for the lowest common ancestors problem. In Proceedings of the 21st Annual Symposium on Foundations of Computer Science (FOCS), pp. 308–319 (1980)
- 15.Harel D, Tarjan RE. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 1984;13(2):338–355. doi: 10.1137/0213024. [DOI] [Google Scholar]
- 16.Hein J, Jiang T, Wang L, Zhang K. On the complexity of comparing evolutionary trees. Discret Appl. Math. J. Comb. Algorithms Inf. Comput. Sci. 1996;71(1–3):153–169. [Google Scholar]
- 17.Huson D, Rupp R, Scornavacca C. Phylogenetic Networks: Concepts. Cambridge: Cambridge University Press, Algorithms and Applications; 2010. [Google Scholar]
- 18.Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L, Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. New York: Springer; 2009. [Google Scholar]
- 19.Olver, N., Schalekamp, F., Stougie, L., van Zuylen, A.: Implementation of the MAF algorithm and compact formulation. Available at http://nolver.net/maf and http://fransschalekamp.com/MAF (2018)
- 20.Rodrigues, E.M.: Algoritmos para Comparação de Árvores Filogenéticas e o Problema dos Pontos de Recombinação. PhD thesis, University of São Paulo, Brazil (2003). Chapter 7, available at http://www.ime.usp.br/~estela/studies/tese-traducao-cp7.ps.gz
- 21.Rodrigues, E.M., Sagot, M.-F., Wakabayashi, Y.: Some approximation results for the maximum agreement forest problem. In: Proceedings of APPROX-RANDOM, Lecture Notes in Computer Science, pp. 159–169. Springer (2001)
- 22.Rodrigues EM, Sagot M-F, Wakabayashi Y. The maximum agreement forest problem: approximation algorithms and computational experiments. Theor. Comput. Sci. 2007;374(1–3):91–110. doi: 10.1016/j.tcs.2006.12.011. [DOI] [Google Scholar]
- 23.Schalekamp, F., van Zuylen, A., van der Ster, S.: A duality based 2-approximation algorithm for maximum agreement forest. In: Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming (ICALP), Vol. 55 of LIPIcs, pp. 70:1–70:14. Leibniz-Zentrum für Informatik, (2016)
- 24.Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003. [Google Scholar]
- 25.Shi F, Feng Q, You J, Wang J. Improved approximation algorithm for maximum agreement forest of two rooted binary phylogenetic trees. J. Comb. Optim. 2015;32(1):111–143. doi: 10.1007/s10878-015-9921-7. [DOI] [Google Scholar]
- 26.Steel M, Warnow T. Kaikoura tree theorems: Computing the maximum agreement subtree. Inf. Process. Lett. 1993;48(2):77–82. doi: 10.1016/0020-0190(93)90181-8. [DOI] [Google Scholar]
- 27.van Iersel L, Kelk S, Lekic N, Stougie L. Approximation algorithms for nonbinary agreement forests. SIAM J. Discret. Math. 2014;28(1):49–66. doi: 10.1137/120903567. [DOI] [Google Scholar]
- 28.Whidden C, Beiko RG, Zeh N. Fixed-parameter algorithms for maximum agreement forests. SIAM J. Comput. 2013;42(4):1431–1466. doi: 10.1137/110845045. [DOI] [Google Scholar]
- 29.Whidden C, Matsen FA. Calculating the unrooted subtree prune-and-regraft distance. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019;16(3):898–911. doi: 10.1109/TCBB.2018.2802911. [DOI] [PubMed] [Google Scholar]
- 30.Whidden, C., Zeh, N.: A unifying view on approximation and FPT of agreement forests. In: Algorithms in Bioinformatics. Lecture Notes in Computer Science, Vol. 5724 , pp. 390–402. Springer, Berlin Heidelberg (2009)
- 31.Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25(2):190–196. doi: 10.1093/bioinformatics/btn606. [DOI] [PubMed] [Google Scholar]
- 32.Wu, Y., Wang, J.: Fast computation of the exact hybridization number of two phylogenetic trees. In: Bioinformatics Research and Applications. Lecture Notes in Computer Science, Vol. 6053, pp. 203–214. Springer, Berlin Heidelberg (2010)








