Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees

Yufeng Wu

doi:10.1093/bioinformatics/btq198

. 2010 Jun 1;26(12):i140–i148. doi: 10.1093/bioinformatics/btq198

Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees

Yufeng Wu ¹

PMCID: PMC2881383 PMID: 20529899

Abstract

Motivation: Reticulate network is a model for displaying and quantifying the effects of complex reticulate processes on the evolutionary history of species undergoing reticulate evolution. A central computational problem on reticulate networks is: given a set of phylogenetic trees (each for some region of the genomes), reconstruct the most parsimonious reticulate network (called the minimum reticulate network) that combines the topological information contained in the given trees. This problem is well-known to be NP-hard. Thus, existing approaches for this problem either work with only two input trees or make simplifying topological assumptions.

Results: We present novel results on the minimum reticulate network problem. Unlike existing approaches, we address the fully general problem: there is no restriction on the number of trees that are input, and there is no restriction on the form of the allowed reticulate network. We present lower and upper bounds on the minimum number of reticulation events in the minimum reticulate network (and infer an approximately parsimonious reticulate network). A program called PIRN implements these methods, which also outputs a graphical representation of the inferred network. Empirical results on simulated and biological data show that our methods are practical for a wide range of data. More importantly, the lower and upper bounds match for many datasets (especially when the number of trees is small or reticulation level is low), and this allows us to solve the minimum reticulate network problem exactly for these datasets.

Availability: A software tool, PIRN, is available for download from the web page: http://www.engr.uconn.edu/~ywu.

Contact: ywu@engr.uconn.edu

Supplementary information: Supplementary data is available at Bioinformatics online.

1 INTRODUCTION

Reticulate evolution, a form of evolution with hybridization and genetic exchanges between two species, are common in many organisms: bacteria, plants, fish, amphibians and many others. For better understanding of reticulate evolution, several reticulate evolutionary models have been proposed and actively studied to address various reticulate processes, such as hybrid speciation, lateral gene transfer and recombination. Since most of these models are in the forms of networks, we call them reticulate networks¹. We refer the readers to (Huson, 2007; Huson and Bryant, 2006; Nakhleh, 2009; Semple, 2007) for surveys of different reticulate network models.

The key computational problem related to these models is the inference of reticulate networks. Depending on the types of biological processes involved, data for network inference may be in different forms, such as phylogenetic trees for some short genomic regions (called genes in this article) or aligned DNA sequences. In this article, we focus on inferring reticulate networks from a set of correlated phylogenetic trees. Here is the biological motivation for our problem. Suppose multiple phylogenetic trees (called gene trees in this article) are reconstructed, each from some gene for these species. Due to reticulate evolution, different genomic regions (say genes) may be inherited from different ancestral genomes and their evolutionary histories may not be the same (but are still related). Thus, these trees are correlated but not identical. No single phylogenetic tree can faithfully model the evolution of the species, and a more complex network model (i.e. reticulate network as studied in e.g. Baroni et al., 2004; Huson, 2007; Huson et al., 2005; Nakhleh et al., 2004; Semple, 2007) is needed.

Imagine we are given a set of ‘true’ gene trees and a ‘true’ reticulate network that models the evolutionary history of these genes. The network can be considered as a compact representation of these gene trees in the sense that one should be able to ‘trace’ a gene tree within the network. We say such a gene tree is displayed in the network. This motivates a natural problem, which is called ‘the holy grail of reticulate evolution’ in (Nakhleh, 2009): given a set of gene trees, reconstruct a reticulate network that displays every given gene tree. Such an inferred network reveals important correlation of evolutionary history of multiple genes. Since there exists many such networks, a common formulation is to find the one with the fewest reticulation events. Such a network is called the minimum reticulate network. The central computational problem on reticulate networks, the minimum reticulate network problem, is: given a set of gene trees, reconstruct the minimum reticulate network that displays these gene trees. This formulation may be reasonable when reticulation is believed to be rare.

In general, this problem is computationally challenging: even the case with only two gene trees is known to be NP-complete (Bordewich and Semple, 2004, 2007; Hein et al., 1996). There are several existing approaches for reconstructing the exact minimum reticulate networks when there are only two gene trees (Bordewich et al., 2007; Linz and Semple, 2009; Wu, 2009; Wu and Wang, 2010). Clearly restricting to just two gene trees is a big limitation: more gene trees will be more informative to phylogenetic inference, and DNA sequences of many genes are available. Alternatively, there are also a number of approaches making simplifications to the reticulate network model, e.g. by imposing additional topological constraints on reticulate networks (Gusfield, 2005; Huson and Klopper, 2007; Huson et al., 2009; Nakhleh et al., 2005) or working with small-scale tree topological features (Huson et al., 2005; Huson and Klopper, 2007, van Iersel et al., 2008). Such simplification often leads to significantly faster approaches. However, it is sometimes unclear how biologically meaningful these added topological constraints are. Even in the case where additional simplifications are reasonable, one may still want to compare with the unconstrained minimum reticulate networks.

Contributions: In this article, we present new approaches for the minimum reticulate network problem with three or more gene trees for unconstrained, general, reticulate networks (e.g. without needing to assume that the network has some restricted form, such as being a galled-tree or galled network). Thus our work is more general than some previous approaches (Huson and Klopper, 2007; Huson et al., 2005; Huson et al., 2009). In particular, we develop a lower bound (the RH bound) and an upper bound (the SIT bound) for the minimum reticulate network problem with multiple gene trees. We show the correctness of the bounds. We give a closed-form formula for the RH bound for the case of three gene trees. We also show how to compute these bounds efficiently in practice using integer linear programming (ILP). Practical results on simulated and real biological data show that the bounds can be computed for wide range of data. Moreover, the lower and upper bounds are often close, especially when the number of trees is small or reticulation level is relatively low. In fact, for many simulated datasets of this type, the lower and upper bounds often match, which means our methods can reconstruct the exact minimum reticulate networks for these datasets. We also show the RH bound clearly outperforms a simple bound.

2 DEFINITIONS AND BACKGROUND

Throughout this article, we assume trees are rooted. A phylogenetic tree is rooted and leaf-labeled by a set of species (called taxa). A leaf of a phylogenetic tree corresponds to an extant species. An internal vertex corresponds to a speciation event. In-degrees of all vertices (also called nodes), except the root, in a tree are one, while out-degrees are zero for leaves and at least two for internal nodes. A binary phylogenetic tree requires out-degrees of internal nodes to be two. A non-binary phylogenetic tree contains nodes with out-degree of three or more. Many existing phylogenetic methods assume binary phylogenetic trees, although sometimes only non-binary trees can be reconstructed in practice.

Our definition of reticulate networks is similar to that in (Huson, 2007; Huson et al., 2005; Semple, 2007) (Hallett and Lagergren, 2001; Nakhleh, 2009; Nakhleh et al., 2004). A reticulate network (sometimes simply network) is a directed acyclic graph with vertex set V and edge set E, where some nodes in V are labeled by taxa. V can be partitioned into V_T (called tree nodes) and V_R (called reticulation nodes). E can be partitioned into E_T (called tree edges) and E_R (called reticulation edges). Moreover,

No nodes with total (in and out) degree of two is allowed. Except the root, each node must have at least one incoming edge.
V_R contains nodes whose in−degrees are two or more. V_T contains nodes whose in-degrees are one.
E_R contains edges that go into some reticulation nodes. E_T contains edges that go into some tree nodes.
A node is labeled by some taxon iff its out-degree is zero. This helps to ensure labeled nodes correspond to extant species and remove some redundancy in the network.

In addition, we have one more restriction:

R₁ For a reticulate network 𝒩, when only one of the incoming edges of each reticulation node is kept and the rest are deleted, we always derive a tree T′.

We first consider the derived tree T′ (that is embedded in 𝒩) as in restriction R₁. When we recursively remove non-labeled leaves and contract edges to remove degree-two nodes of T′ (called cleanup), we obtain a phylogenetic tree T (for the same set of species as in 𝒩). Now suppose we are given a phylogenetic tree T. We say T is displayed in 𝒩 when we can obtain an induced tree T′ from 𝒩 by properly choosing a single edge to keep at each reticulation node so that T′ is topologically equivalent to T after cleanup. We denote the induced T′ (if exists) as T_𝒩. See Figure 1 for an illustration.

Fig. 1. — An illustration of a reticulate network with three reticulation events for three trees. Each tree is displayed in the network: the tree can be obtained by keeping one incoming edge at each reticulation node.

Note restriction R₁ implies the network is acyclic. Biologically, reticulate networks often forbid cycles. This is because many reticulation events need to be properly time-ordered. Thus, we focus on acyclic reticulate networks in this paper. That is, when we refer to a reticulate network, we mean an acyclic reticulate network (unless otherwise stated).

There are subtle issues related to networks with nodes whose out-degrees are more than two (called non-binary nodes). See the Supplementary Materials for more discussion. Note that we do not require that in-degrees of reticulation nodes are precisely two as what was imposed in (Huson et al., 2005). We also assume the root of each input tree T_i is attached to an outgroup species o. The root of a reticulate network for these trees is also attached to o.

We define the reticulation number of a reticulation node as its in-degree minus one. For a reticulate network 𝒩, we define the reticulation number (denoted as R_𝒩) as the summation of the reticulation number of each reticulation node in the network. Sometimes R_𝒩 is also called the number of reticulation events in 𝒩. For the reticulate network in Figure 1, the reticulation node 2 has three entering edges, and the other reticulation node 1 has two entering edges. Thus, R_𝒩=(3−1)+(2−1)=3. Our definition of reticulation number is similar to that of the hybridization number in (Bordewich et al., 2007; Semple, 2007).

Suppose we are given a set of K gene trees T₁, T₂, …, T_K (for the same set of species). The minimum reticulate network 𝒩_min for T₁, T₂,…, T_K is a reticulate network 𝒩 that displays each T_i and R_𝒩 is minimized among all possible 𝒩. We call R_{𝒩_min} the reticulation number of T₁,…, T_K, which is denoted as R(T₁, T₂,…, T_K). For the special case of K=2, we call D_{T_i,T_j}=R(T_i, T_j) the reticulation distance between two trees T_i and T_j. Now we formulate the central problem in this article.

The general minimum reticulate network (GMRN) problem: Given a set of phylogenetic trees T={T₁,…, T_K}, reconstruct the minimum reticulate network 𝒩_min for T₁, T₂,…, T_K. This formulation is based on parsimony, and may be justified when reticulation is relatively rare in the evolutionary history.

One should note that the GMRN problem can be further specified by the type of input trees. There are two types of input phylogenetic trees: binary or non-binary. For a non-binary tree T, we say T is displayed in 𝒩 if some refinement of T (i.e. splitting the non-binary nodes in T in some way to make T binary) is displayed in 𝒩. When input trees are binary, network reconstruction may be easier. For simplicity, in this following, the input trees are assumed to be binary, unless otherwise stated. We remark that some of our results are applicable to non-binary input trees: the RH bound in Section 3 clearly works for non-binary trees too, and the high-level approach of the SIT bound may also be applicable to non-binary trees.

Previous work on the GMRN problem: There is an exact method for the K=2 case of the minimum reticulate network problem (Bordewich et al., 2007), although this special case is known to be NP-complete (Bordewich and Semple, 2007). It is useful to note that when we allow cycles in the network, the minimum reticulate network problem is equivalent to the rooted subtree prune and regraft (rSPR) distance problem. The rSPR distance problem, another NP-complete problem (Bordewich and Semple, 2004; Hein et al., 1996), is well known to be closely related to reticulate evolution. Previously, we showed that the rSPR distance can often be practically computed for many moderately sized trees (Wu, 2009). We give more background to the rSPR distance problem in the Supplementary Material. It was shown in (Baroni et al., 2005) that the reticulation number (called hybridization number in (Baroni et al., 2005)) for trees T₁ and T₂ is closely related to the rSPR distance between T₁ and T₂, although the two values are not always equal. The main difference between the rSPR distance and the reticulation number is that the latter forbids cycles and thus can be more realistic biologically. Recently, we have extended our previous approach in (Wu, 2009) to allow computing the pairwise reticulation distance between two rooted binary trees (Wu and Wang, 2010). Although the worst case running time of the practical methods in (Bordewich et al., 2007; Wu, 2009; Wu and Wang, 2010) are exponential, these methods may work reasonably well in practice. As shown in (Bordewich et al., 2007; Wu, 2009; Wu and Wang, 2010), exact reticulation number (with or without cycles) can be computed for two quite different trees with 20 or more leaves. Thus, although intractable theoretically, the two-tree minimum reticulate network problem can be solved in practice if the size of two trees is moderate or the two trees are not very different topologically.

It becomes more computationally challenging when there are three or more gene trees. There is currently no known practical methods for either computing the reticulation number R(T₁,…, T_K) or reconstructing 𝒩_min for trees T₁,…, T_K when K ≥ 3. Often approximation is made. A common approach is to impose structural constraints to limit the complexity of the network (Gusfield, 2005; Huson and Klopper, 2007; Huson et al., 2009; Nakhleh et al., 2005). Although these approaches are theoretically interesting and have been shown to work for some biological data, it is still very desirable to explore the reconstruction of reticulated networks displaying multiple complete gene trees without additional structural constraints.

3 A LOWER BOUND

We now focus on developing a lower bound on R(T₁,…, T_K). The lower bound helps to better quantify the range of R(T₁,…, T_K).

Recall that several exact methods (Bordewich et al., 2007; Wu and Wang, 2010) exist for computing the pairwise reticulation distance D_{T_i,T_j} for two trees T_i, T_j, which are practical for many pairs of trees of moderate sizes. Now suppose that we compute D_{T_i,T_j} for each pair of trees T_i and T_j, using the methods (Bordewich et al., 2007; Wu and Wang, 2010). We store these pairwise distances in a matrix D, where D[i, j]=D_{T_i,T_j}². Admittingly, computing D[i, j] for all T_i and T_j can be slow when K and/or the size of trees are large (unless T_i and T_j are very similar). One should note that the GMRN problem is much more complex, and thus, calculation of D[i, j] is justifiable computationally. This leads to the following question: can we use the pairwise reticulation distances D to estimate R(T₁,…, T_K) when K ≥ 3?

Clearly, the largest value D[i₀, j₀] in D is necessarily a lower bound of R(T₁,…, T_K) when K ≥ 3: a reticulate network displaying all trees certainly also displays trees T_i₀ and T_j₀ and thus is a reticulate network for T_i₀ and T_j₀. We now show a stronger lower bound (called RH bound) based on D values.

Here is the high-level idea. The pairwise distance D_{T_i,T_j} specifies how similar trees T_i and T_j are: the larger D_{T_i,T_j} is, the more different T_i and T_j are. Recall that if tree T_i is displayed in a network 𝒩, we should be able to derive T_i by keeping only one incoming edge at each reticulation node and performing cleanup. The choice (called display choice for T_i) of keeping which incoming edge at each reticulation node for a tree T_i may not be unique. However, clearly if one makes the same display choices for T_i and T_j when displaying T_i and T_j in 𝒩, then T_i and T_j will be identical. More generally, the more similar the display choices for trees T_i and T_j, the closer T_i and T_j will be. Thus, to allow an 𝒩 for trees with pairwise distances D[i, j]=D_{T_i,T_j}, we need to make display choices for the trees different enough: if the display choices for T_i and T_j are too similar, it will lead to contradiction when D[i, j] suggests T_i and T_j are more different. In the following, a rigorous analysis based on this idea allows us to decide whether an 𝒩 with a specific number (say r) of reticulation events is feasible.

To fix ideas, we first consider the situation where each reticulation node in 𝒩 has in-degree of two. We will remove this assumption in a moment. Suppose that 𝒩 has r reticulation nodes (each with two incoming edges). For each reticulation node, we arbitrarily call one incoming edge the left edge and the right edge for the other. We encode the left edge as 0 and the right edge as 1, and call these two edges 0-edge and 1-edge. Recall that to display a tree, we need to keep exactly one of these two edges. Since a tree T_i is displayed in 𝒩, we create a binary vector v_i[1 … r] to represent which incoming edge T_i is kept at each reticulation node V_j in 𝒩. Here, v_i[j] is 0 if T_i keeps the 0-edge at reticulation node V_j, and 1 if T_i keeps the 1-edge at V_j. We call v_i the display vector for T_i. For example, in Figure 1, consider the reticulation node labeled as 1, and we assign the left/right edges as shown. T₁ and T₃ keep the left edge at this node, while T₂ keeps the right edge. Thus, v₁ and v₃ have value 0, and v₂ has value 1 at the node.

For a given T_i and a network 𝒩, v_i can always be constructed (at least conceptually) based on how T_i is displayed in 𝒩. Note that if there are multiple choices to display T_i, we simply pick an arbitrary one and this does not affect our solution. We define D_h[v_i, v_j] as the Hamming distance between two display vectors v_i and v_j. Here, v_i and v_j (and thus D_h[v_i, v_j]) depend on 𝒩. To simplify notations, we do not explicitly include 𝒩 in their definitions. Lemma 3.1 is crucial to our lower bound.

Lemma 3.1. —

For any two trees T_i and T_j displayed in a reticulate network 𝒩, D_h[v_i, v_j] ≥ D[i, j].

Proof. —

For contradiction, assume D_h[v_i, v_j]<D[i, j]. Thus T_i and T_j make different choices at less than D[i, j] reticulation nodes of 𝒩. Imagine we remove from 𝒩 those incoming edges at reticulation nodes that are not kept by both T_i and T_j. This produces a network with less than D[i, j] reticulation nodes. This is because all reticulation nodes where v_i and v_j match (and thus T_i and T_j keep the same incoming edges) have only one incoming edge and are no longer reticulation nodes in the reduced network. This contradicts the fact that D[i, j] is the reticulation distance between T_i and T_j. ▪

Lemma 3.1 implies that if a network 𝒩 with r reticulation events exists, then we should be able to find binary vectors v_i (of length r) for each tree T_i, and D_h[v_i, v_j] ≥ D[i, j] for any two such vectors v_i and v_j. On the other hand, if such vectors do not exist, we know that at least r+1 reticulation events are needed (and the value r+1 is a lower bound on R(T₁,…, T_K)). We can illustrate this formulation more intuitively using a binary hypercube. On a hypercube with r binary bits per node, we want to know whether we can pick K points v₁ … v_K that are far apart enough such that the Hamming distance between v_i and v_j is at least D[i, j] for each i and j. One should note this is not always feasible due to the limited size of the hypercube. Formally,

The binary hypercube point placement problem: Can we choose K nodes v₁,…, v_K from a r-dimensional binary hypercube so that D_h[v_i, v_j] ≥ D[i, j] for each pair of v_i and v_j?

A lower bound on R(T₁,…, T_K) based on the Hypercube Point Placement problem is to find (possibly in a binary search style) the smallest integer r such that the hypercube point placement problem has a solution. Such r is necessarily a lower bound on R(T₁,…, T_K). We call this lower bound reticulation on hypercube bound (or RH bound).

We do not know a polynomial-time algorithm for the binary hypercube point placement problem with more than three trees. When K=3, however, the RH bound has a simple analytical form (see Section 3.2). To develop a practical method for the general case, we use integer linear programming (ILP) to solve this problem. We create a binary variable V_i,k to represent the coordinates for point v_i. That is, the coordinates of v_i on the hypercube are specified by V_i,1 … V_i,r. Without loss of generality, we set V_1,k=0 for all 1≤k≤r. We create a binary variable M_i,j,k for each v_i, v_j and position k (1≤k≤r) to indicate whether two vectors v_i and v_j match at position k. M_i,j,k=1 if V_i,k=V_j,k, and 0 otherwise. Now, we have the following formulation.

Optimization goal: none (since this is a feasibility problem)
Subject to

M_i,j,k+V_i,k+V_j,k ≥ 1, for each v_i, v_j, where i<j and 1≤k≤r.
M_i,j,k − V_i,k − V_j,k ≥ −1, for each v_i, v_j, where i<j and 1≤k≤r.
∑_k=1^r M_i,j,k ≤r−D[i, j], for each v_i, v_j, where i<j.

For each 1≤i≤K and 1≤k≤r, there is a binary variable V_{i, k}.
For each 1≤i<j≤K, and 1≤k≤r, there is a binary variable M_i,j,k.

Constraint 1 says if both V_i,k and V_j,k are 0, M_i,j,k is 1 (i.e. matched). Similarly, constraint 2 says if both V_i,k and V_j,k are 1, M_i,j,k is 1 (i.e. matched). Constraint 3 imposes the pairwise Hamming distance requirement. Our experience shows that the ILP is practical to solve for all datasets we simulated (see Section 5).

3.1 Networks with in-degree of three or more

We now resolve the remaining issue where some reticulation nodes have in-degree of three or more. In this section, we call a reticulation node ‘refined’ if its in-degree is two, and ‘unrefined’ if its in-degree is at least three.

Here, we can no longer represent a reticulation node as binary value, as done previously. So we extend our definitions of display vectors v_i to allow v_i to be non-binary. That is, if there are d incoming edges at a reticulation node, we allow v_i to be from 0 to d−1, where the value indicates which one of the d branches T_i is kept at this node. The incoming edges are numbered starting from zero on the left and to the right with increment of one. We still let D_h[v_i, v_j] be the Hamming distance between vectors v_i and v_j. In this general case, Lemma 3.1 still holds for non-binary vectors v_i and v_j. To see this, we prune any incoming edge at reticulation nodes if it is not chosen by T_i and T_j. Then each remaining reticulation node has only two incoming edges (since we only have two trees). Thus, there are D_h[v_i, v_j] reticulation events in this reduced network, and the rest of proof for Lemma 3.1 follows.

We now show that it is not necessary to consider unrefined reticulation nodes in the sense that if a network 𝒩 with unrefined reticulation nodes satisfies pairwise distances D, then there exists another network 𝒩′ that has only refined reticulation nodes and gives the binary vectors v_i satisfying the pairwise distance constraints of D. That is, if we can not find a network with only refined reticulation nodes, we also can not find a network with unrefined reticulation nodes and the same reticulation number.

To see this property, we consider a network 𝒩 with one reticulation node q with d ≥ 3 incoming edges. Then we transform 𝒩 to 𝒩′ by replacing q with q₁,…, q_d−1, where each q_i is a reticulation node with in-degree of two. Note that we do not have to ensure 𝒩 and 𝒩′ are equivalent: we only need to show 𝒩′ gives a solution to the Binary Hypercube Point Placement problem. Clearly, 𝒩 and 𝒩′ have the same reticulation number (although vectors for 𝒩′ are longer). Now, suppose tree T_i keeps edge j at q (where 0≤j≤d−1), we then keep edge 1 at q_j in 𝒩′ if j ≥ 1 (and 0 if j=0), and keep edge 0 for all other q_j′ (where j′≠j). In other words, we create a mapping of the display vectors v_i from 𝒩 to 𝒩′ for each T_i. Note that such mapping ensures that if two trees keep the same edge at q, they will keep the same edges at q₁,…, q_d−1 in 𝒩′; otherwise, they will keep at least one different incoming edge at q₁,…, q_d−1 in 𝒩′. In either case, if the pairwise distance constraints are satisfied in 𝒩, they are also satisfied in 𝒩′. So, if we can not find display vectors 𝒩′ for networks with refined reticulation nodes only, we also can not find display vectors for networks allowing unrefined reticulation nodes. In other words, the RH bound holds for networks with unrefined nodes.

Remark. The RH lower bound is still applicable when the input trees are non-binary, as long as the pairwise reticulation distances are obtained for the non-binary trees. These are easy to verify and we omit the details due to the lack of space.

Remark. A commonly used concept in reticulate networks is the so-called maximum agreement forest (MAF). A brief description on MAF is given in the Supplementary Material. Also see e.g. in (Semple, 2007) for more details. It is easy to see that the size of a MAF of multiple trees is a lower bound on R(T₁,…, T_K). However, experience show that the RH bound is often higher than the MAF bound (see Section 5).

3.2 Special case of three trees

The special case of K=3 allows us to study the RH bound in an analytical way. We let d₁, d₂ and d₃ be the pairwise reticulation distances of the three trees, where d₁ ≥ d₂ ≥ d₃. Proposition 3.2 shows the RH bound for three trees in an analytical form.

Proposition 3.2. —

The RH lower bound for three trees T₁, T₂ and T₃ is equal to if d₂+d₃>d₁, and equal to d₁ if d₂+d₃≤d₁.

Proof. —

We first consider the case d₂+d₃>d₁. Clearly, the RH bound is at least d₁, which is the minimum size of the hypercube. Now we investigate whether there exists a reticulate network with d₁+e reticulation nodes for these three trees. Without loss of generality, let T₁ be the input tree where d₁=D_T₁,T₂ and d₂=D_T₁,T₃, and the display vector v₁ (for T₁) is fixed to be all-0. Then, the display vector v₂ for T₂ must have at least d₁ positions with value 1 (and thus v₂ has no more than e positions with value 0). Similarly, the display vector v₃ must have at least d₂ positions with value 1 (and thus v₃ has no more than d₁+e−d₂ positions with value 0). Note that D_h[v₂, v₃] ≥ d₃. We claim that D_h[v₂, v₃]≤d₁+2e−d₂. This is because the Hamming distance between v₂ and v₃ counts the positions where v₂ has value 0 and v₃ has value 1 (or vice versa). Since the number of 0s in v₂ is no more than e, there are at most e positions where v₂ has 0 and v₃ has 1. Similarly, there are at most d₁+e−d₂ positions where v₂ has 1 and v₃ has 0. Thus, D_h[v₂, v₃]≤e+d₁+e−d₂=d₁+2e − d₂. Also note that we can always construct v₂ and v₃ so that D_h[v₂, v₃]=d₁+2e−d₂. See Figure 2 for an illustration.

Therefore, if d₁+2e−d₂ < d₃, we can not find three vectors v₁, v₂ and v₃ satisfying the pairwise distances D and thus R(T₁, T₂, T₃) ≥ d₁+e+1 in this case. The largest such e is equal to (which is non-negative since d₂+d₃ > d₁). The RH bound is then = .

The case when d₂+d₃≤d₁ is simple. We create three vectors of d₁ bits: v₁ is an all-0 vector, v₂ is an all-1 vector and v₃ contains d₂ 1s. It is easy to verify these three vectors satisfy all three pairwise distance constraints. ▪

Fig. 2. — Vectors v₁, v₂ and v₃ (listed from top to bottom) that maximize the Hamming distance between v₂ and v₃. v₁ is all-0, while the suffix of v₂ and prefix of v₃ are zeros.

In practice, it is very likely d₂+d₃ > d₁. In this case, Inline graphic , where d₁ is a trivial lower bound.

4 AN UPPER BOUND

We now present an upper bound on R(T₁,…, T_K). The combination of the RH lower bound and the upper bound quantifies the range of R(T₁,…, T_K). In the best scenario, if the upper bound matches the RH bound, these bounds would actually determine the exact value of R(T₁,…, T_K) (and also reconstruct 𝒩_min). On the high level, the upper bound performs stepwise insertion of trees into a reticulate network (and thus is called the SIT bound). The SIT bound is very accurate and also computable in practice for many datasets.

The basic idea of the SIT bound is to reconstruct a reticulate network 𝒩 in a step-by-step way: ‘insert’ the given gene trees one by one into 𝒩 in some fixed order. When we say a tree T is inserted into 𝒩, we mean adding reticulation edges into 𝒩 such that T is displayed in the updated network 𝒩′. Note that addition of new reticulation edges increases R_𝒩. Thus, every time we insert a new tree, we seek to add as few new reticulation edges as possible by reusing existing reticulation edges. At the same time, we also ensure no cycles exists in 𝒩′. Often, it is unclear which order of inserting trees gives the best result. For now, we assume that K is relatively small so that we can enumerate all possible orders of insertion to find the best result. See Section 4.2 for ways to handle larger K. Thus, we can assume the order of tree insertion is fixed to T₁, T₂,…, T_K. The general procedure of the SIT bound (for a fixed order) is as follows.

Initialize 𝒩 to be T₁.
for i=2 to K
Insert T_i into 𝒩 by adding the smallest number of new reticulation edges.

Note that we only add reticulation edges in 𝒩 and do not delete any existing edges. Thus, any tree already displayed in 𝒩 is still displayed in the updated 𝒩′ by choosing the original reticulation edges when the tree is first inserted for their display vectors in 𝒩′. This ensures that each of the input trees is displayed in the final 𝒩. Obviously, step 3 is most critical, which we will discuss next.

4.1 Inserting tree T into 𝒩

We consider the ‘min-cost tree insertion problem’, where we want to update 𝒩 by adding the fewest reticulation edges to 𝒩 so that a given tree T is displayed in the updated 𝒩′, and 𝒩′ remains acyclic. Note that the min-cost tree insertion problem is NP-complete because it contains the two-tree minimum reticulate network problem (an NP-complete problem) as a sub-problem. That is, constructing the minimum reticulate network for trees T₁ and T₂ can be solved by inserting T₂ into T₁ with the minimum cost. In the following, we develop a practical method to solve the min-cost tree insertion problem. Each node of the reconstructed network here has one or two incoming edges (except the root), and one or two outgoing edges (except the leaves).

After inserting T (and some new reticulation edges are added), T is displayed in the updated network 𝒩′. Suppose we remove all the new reticulation edges in 𝒩′. The edge removals break tree T(𝒩′) (the tree created by keeping edges in 𝒩′ according to a display vector of T) into a forest F(T(𝒩′)). Thus, the number of newly added reticulation edges is exactly the number of trees in F(T(𝒩′)) minus one. To minimize the number of needed new reticulation events, we need to minimize the number of trees in F(T(𝒩′)). A useful observation is that the problem of finding F(T(𝒩′)) with the fewest subtrees is closely related to the maximum agreement forest problem (see the Supplementary Material) as follows.

Imagine that we choose a tree T′ that is displayed in 𝒩 so that the display vector of T′ agrees with that of T for 𝒩′ at each reticulation node of 𝒩. Recall that 𝒩′ may contain a number of new reticulation nodes that are not in 𝒩. Also note T′ is not necessarily one of the input trees T_i. We claim that F(T(𝒩′)) is an agreement forest for T and T′. To see this, we note that the display choices made by T′ are identical to T except those at the new reticulation nodes (where T′ follows the original edge and T(N′) follows the new edge). So the subtrees in F(T(𝒩′)) must also be subtrees of T′. So, we have:

Lemma 4.1. —

The forest induced by removing newly added reticulation edges of T(𝒩′) is an agreement forest between T and some tree T′ that is displayed in the original 𝒩.

Lemma 4.1 implies that to find the best tree insertion, we can find some tree T′ displayed in 𝒩 s.t. the number of trees in the maximum agreement forest between T and T′ is minimized. Figure 3 shows an example of tree insertion. The dashed lines in the tree (left) divide the tree into a forest, which also appears in the existing network (middle, thick lines). Inserting the tree into the network is to add new reticulation edges (right, thick lines) into the networks so that the subtrees in the forests are properly connected to match the given tree.

Fig. 3. — Inserting a tree (left) to a network (middle). After adding new reticulation edges (thick lines), the resulting network (right) displays the tree.

When the number of reticulation nodes in 𝒩 is small, we may simply enumerate all trees T′ displayed in 𝒩 and then find which T′ gives the smallest agreement forest with T. This quickly becomes infeasible as the number of reticulation nodes in 𝒩 grows: when there are r reticulation nodes in 𝒩, there may exist 2^r trees T′ displayed in 𝒩. To develop a practical method, we develop an integer linear programming (ILP) formulation to solve the tree insertion problem in an optimal way (without explicit enumeration). The output of the ILP formulation includes the display choices of T′ as well as the associated agreement forest formed by cutting edges in T. See the Supplementary Material for detailed description of the formulation.

Updating 𝒩: after tree T′ and the associated agreement forest are found, we update 𝒩 as follows. We add new reticulation edges in 𝒩 to connect subtrees of T′ in the found agreement forest to make T displayed in the updated network 𝒩′. First, we determine the order of subtree connection with an approach similar to the algorithm building two-tree hybridization networks in (Semple, 2007). The subtree with the special outgroup taxon o acts as the base. Then we repeatedly pick the subtree not intersecting any already connected subtree as the next to connect. Now, for each tree connection:

Find the root r of the next subtree (in 𝒩) to attach.
Find the node v in the existing network as the attaching point to accept this subtree.
Create a new reticulation node in 𝒩 to connect the subtree.

This operation depends on the types of r and v. Two cases are shown in Figure 4. The other cases are similar. In all the cases, only a single new reticulation edge is created to connect a subtree.

Fig. 4. — Attaching a subtree in 𝒩. Left: insert a reticulation edge between two tree nodes r and v. Right: insert a reticulation edge between two reticulation nodes r and v. The dashed lines are the newly added edges.

Cycles: a remaining issue is that cycles can be introduced when connecting subtrees in 𝒩. There are two sources of cycles. First, the found agreement forest may induce cycles (see (Baroni et al., 2005)). Enhancing the ILP formulation to avoid cycles may significantly complicate the formulation and slow the ILP solving. A practical observation is that cycles in an agreement forest are often caused by two pairs of leaves a, b and c, d so that the a/b pair is ancestral to c/d pair in T and the c/d pair is ancestral to a/b pair in T′. Here, we say a pair of leaves a and b is ancestral to a pair of leaves c and d if the the MRCA of a and b is ancestral to the MRCA of c and d. MRCA stands for the most recent common ancestor, and node a is ancestral to node b in tree T if a is on the path from b to the root of T. To forbid this type of simple cycles, we enhance the ILP formulation: for such pairs a/b and c/d, we require either a and b are not in the same subtree, or c and d are not in the same subtree of the resulting forest. Although this does not guarantee to remove all cycles, we found that cycles in the agreement forest are rare after this change. This observation is also useful for the method of computing pairwise reticulation distances in (Wu and Wang, 2010).

Second, cycles can appear in other parts of the network when subtrees in the agreement forest are connected. In practice, however, we find this happens relatively rare. When this type of cycles does occur, we simply start over and try another order of tree insertion. This works well in practice: in Section 5, we build acyclic networks successfully for all (thousands of) simulated datasets.

4.2 Handling larger datasets

When the size and the number of trees grow, the running time increases. To handle larger datasets, we make several simplifications. (i) Instead of enumerating all possible orders of tree insertion, we start with an arbitrary tree. At each step, we pick a tree with the smallest reticulation distance to one of the already inserted trees. (ii) Solving the min-cost tree insertion problem optimally becomes more difficult when data grows. So instead of considering all possible T′ displayed in 𝒩 when inserting T, we randomly choose a fixed number (say 10) of trees T′ displayed in 𝒩 (in addition to all the inserted gene trees) and find the best way of inserting T based on one of the chosen T′. This heuristic is called the coarse mode (and the original approach is called the full mode). Our experience shows that the coarse mode works reasonably well in practice (see Section 5).

5 EXPERIMENTAL RESULTS

We have implemented a software tool called PIRN (which stands for Parsimonious Inference of Reticulate Network) to compute the RH and SIT bounds. Program PIRN is available for download from: http://www.engr.uconn.edu/˜ywu/. The tool is written in C++ and uses either CPLEX (a commercial ILP solver) or GNU GLPK ILP solver (mainly a demo of the functionalities for users without a CPLEX license). In computing the SIT bound, PIRN can run full mode (slower but can give better results) or coarse mode (faster but less accurate). We test our methods for both simulated and biological data on a 3192 MHz Intel Xeon workstation.

5.1 Simulation data

We generate simulation data using a two-stage approach: first simulate reticulate networks, and then generate a fixed number of trees displayed in the networks according to randomly generated display vectors. We simulate reticulate networks using a scheme similar to the coalescent simulation implemented in program ms (Hudson, 2002). For a given number of taxa (denoted as n), we start with n isolated lineages and simulate reticulation backwards in time. At each step, there are two possible events: (i) lineage merging, which occurs at rate 1; (ii) lineage splitting, which occurs at rate r. We choose the next event according to relative probabilities of all feasible events. Lineage merging generates speciation events, while lineage splitting generates reticulation events. To speedup the simulation, lineage splitting is disabled when the number of current lineages is no more than three. The parameter r dictates the level of reticulation in the simulated network: larger r will lead to more reticulation events in simulation.

Full mode of the SIT bound: to test the performance of the bounds, we generate data with varying number of trees K, number of taxa n and level of reticulation r. For each settings of these three parameters, we simulate 100 datasets. We report the percentage of datasets where optimal solution is found (i.e. lower bound matches upper bound) in Figure 5a. To show how close the lower and upper bounds are, we report the average gap (the difference between the upper and the lower bounds, divided by the lower bound) in Figure 5b. We also report the average lower bound in Figure 5c, which somewhat reflects how complex the simulated networks are. We also give the average running time for each setting in Figure 5d. For five larger datasets, there are a small number of test cases that are too slow to run the full mode, and are excluded. The percentage of unfinished computation is usually one or two out of 100 datasets, except the cases with n=30/r=3.0/K=5 (18% unfinished) and n=30/r=5.0/K=4 (6% unfinished). This suggests the current practical range of the full mode of the SIT bound.

Fig. 5. — Performance of the RH bound and *SIT* bound (full mode), for 10, 20, 30, 40 and 50 taxa, number of trees K from 3 to 5, and reticulation level r at 1.0, 3,0 and 5.0. (a). The percentage of exact reticulation number found among 100 simulated datasets. (b) Average gaps (in percentage) between the *SIT* bound and RH bound (normalized by the RH bound), and the average RH bound is shown in (c). (d) The average running time (in seconds). The reticulation level r for left, middle and right figures is 1.0, 3.0 and 5.0, respectively. Horizontal axis is the number of taxa, and each curve in a figure is for a value of K.

As shown in Figure 5a, PIRN performs very well when the number of trees K=3 or reticulation level r is small: optimal solution can be found for at least 80% of simulated datasets when r=1.0 and K=5. Even with higher reticulation level (r=3.0) and larger number of taxa (say 50), still about 60% of datasets can be solved exactly when K=3. As expected, as the number of taxa, reticulation level and the number of trees grow, fewer datasets can be solved to exact, and correspondingly, Figure 5b shows gaps between the RH and SIT bounds increase. Figure 5c shows the complexity of networks increases too. Nevertheless, the gaps are still relatively small in these cases. For the more difficult settings simulated (i.e. 30 taxa, high reticulation level and five trees as input), the gap is about 25%. Running time depends on the complexity of the networks. Figure 5d shows that PIRN is practical for data of medium size.

Coarse mode of the SIT bound: we also test the coarse mode of our methods for larger data, as described in Section 4.2. The results are shown in Figure 6. The figure on the left shows the effects (on accuracy and running time) of increasing the number of taxa n. The figure on the right shows the effects of having more trees (i.e. increasing K from three to nine) for 10 taxa and reticulation level 5.0. There is clear trade-off between the accuracy of solutions and efficiency. The coarse mode under-performs in terms of the quality of solutions, but is more scalable, especially when K increases. When the number of taxa increases, the coarse mode is likely to run faster than the full mode, but the difference is less significant.

Fig. 6. — Comparing the coarse (C) mode and the full (F) mode of the *SIT* bound. Left: fix r=3.0 and K=4, while varying n. Right: fix r=5.0 and n=10 and vary K. Both percentages of optimal solutions found and running time (in minutes for the left figure and seconds for the right one) are shown.

GLPK: the CPLEX solver is used in the simulation. The GLPK version in general is less robust and can handle smaller data than the CPLEX version. Our experience shows that the GLPK solver can often solve for five trees with 30 taxa when reticulation level is low and fewer number of taxa when reticulation level is higher.

The RH bound: we now compare the performance of the RH bound with the MAF bound. We note that the MAF bound is not easy to compute: finding the MAF of only two trees is known to be NP-hard. When data is small, the MAF bound can be computed (e.g. using ILP similar to that in (Wu, 2009)). In Table 1, we compare the RH bound and the MAF bound for 100 datasets. Each data has 10 or 20 taxa and contain three to seven correlated trees. The trees are selected from the local trees for recombining sequences generated by a coalescent simulator, program ms (Hudson, 2002).

Table 1.

Compare the RH and MAF bounds for K trees

		K=3	K=4	K=5	K=6	K=7
n=10	RH > MAF	46	56	61	58	72
	RH = MAF	54	44	38	41	27
	RH < MAF	0	0	1	1	1
	100 (RH − MAF/MAF)	16.5	18.4	17.2	18.0	20.9

n=20	RH > MAF	65	66	68	70	-
	RH = MAF	34	34	30	30	-
	RH < MAF	1	0	2	0	-
	100 (RH − MAF/MAF)	12.6	11.9	12.1	12.8	-

Open in a new tab

RH > MAF: number of datasets where the RH bound is larger than the MAF bound among 100 datasets. The average gaps (in percentage) between the RH and MAF bounds are also shown.

Table 1 shows that the RH bound outperforms the MAF bound in a majority of the simulated datasets and only very rarely the RH bound is lower than the MAF bound. In general, as the number of trees increases, the RH bound tends to outperform the MAF bound in both accuracy and running time. For example, for a dataset with 20 sequences and six input trees, CPLEX runs for over 11 h without reporting a solution (it found a solution of 11, but did not validate its optimality). In contrast, it only takes less than 1 minute to compute the RH bound of value 12 (higher than the MAF result).

5.2 Biological data

To evaluate how well our bounds work for real biological data, we test our methods on a Poaceae dataset. The dataset was originally from the Grass Phylogeny Working Group (Grass Phylogeny Working Group, 2001). The dataset contains sequences for six loci: internal transcribed spacer of ribosomal DNA (ITS); NADH dehydrogenase, subunit F (ndhF); phytochrome B (phyB); ribulose 1,5-biphosphate carboxylase/oxygenase, large subunit (rbcL); RNA polymerase II, subunit β′′ (rpoC2); and granule bound starch synthase I (waxy). The Poaceae dataset was previously analyzed and rooted binary trees were inferred for these loci (Schmidt, 2003). Pairwise comparison were performed in (Bordewich et al., 2007; Wu and Wang, 2010). Here, we provide in Table 2 our results on estimating the reticulation number using multiple (three to five) trees. Only shared taxa of a set of trees are kept. Thus, the pairwise distances reported here are different from those in (Bordewich et al., 2007; Wu, 2009). As shown in Table 2, PIRN finds optimal solutions for both datasets with three trees, and computes lower and upper bounds that are close for the dataset with five trees. Figure 7 shows the reconstructed reticulate network for these five trees. See Supplementary Material for a graphical display of the five grass trees. The network contains 13 reticulation events, and the lower bound is 11. Although the network may not be optimal, the gap between the lower and upper bounds is relatively small.

Table 2.

Results for the grass data

Trees	n	D	RH	SIT	Time
rpoC2, waxy, ITS	10	1, 6, 6	7	7	1 s
ndhF, phyB, rbcL	21	4, 5, 6	8	8	1 s
ndhF, phyB, rbcL,rpoC2, ITS	14		11	13	26 min 38 s

Open in a new tab

Both RH and SIT bounds are shown, as well as the running time . n: the number of taxa. D: pairwise distances.

Fig. 7. — A reticulate network found by program *PIRN* for five trees of a grass dataset. Red (shaded) balls represent reticulation nodes.

Remark: the following lists several aspects on the performance of PIRN. (i) Simulation shows that PIRN is often able to find the exact reticulation number when K or r is small, even when the number of taxa increases to medium size (say 50). Moreover, PIRN can compute the RH and SIT bounds for a wide range of data, despite the fact that we do not currently have polynomial-time algorithms for computing the bounds. We achieve this with the help of integer linear programming. (ii) Computing the RH bound is often much faster and more scalable than the SIT bound. Experience shows that the ILP formulation for computing the RH bound is often very fast to solve and computing the pairwise reticulation distances usually takes less time than finding a good upper bound for all trees. The RH bound computation will also benefit from future improvements in computing the pairwise reticulate distances. The simulation results in this section are based on an earlier version of the method in (Wu and Wang, 2010) and speedup may be possible with the latest methods. (iii) The number of trees K and the similarity of tree topologies have impact on PIRN's optimality and running time. Using more powerful ILP solver (e.g. CPLEX) and/or more powerful machines may also help for more difficult cases. (iv) Finally, our general approaches can be applied to larger data by using efficient computable lower bounds of pairwise rSPR distances in computing the RH bound, and faster but less accurate heuristics to insert trees into a network.

Supplementary Material

[Supplementary Data]

btq198_index.html^{(668B, html)}

ACKNOWLEDGEMENT

I thank Simone Linz for useful discussions and for sharing the grass tree dataset.

Funding: U.S. National Science Foundation (IIS-0803440) and the Research Foundation of University of Connecticut.

Conflict of Interest: none declared.

Footnotes

¹Other terms have been used in literature, such as phylogenetic networks and hybridization networks

²It is not known whether pairwise reticulation distance satisfies the triangle inequality, although it clearly does when cycles are allowed. In practice, pairwise reticulation distances often obey the triangle inequality: we have not found a counter-example from many simulation datasets.

REFERENCES

Baroni M, et al. Bounding the number of hybridisation events for a consistent evolutionary history. J. Math. Biol. 2005;51:171–182. doi: 10.1007/s00285-005-0315-9. [DOI] [PubMed] [Google Scholar]
Baroni M, et al. A framework for representing reticulate evolution. Ann. Comb. 2004;8:391–408. [Google Scholar]
Bordewich M, et al. A reduction algorithm for computing the hybridization number of two trees. Evol. Bioinform. 2007;3:86–98. [PMC free article] [PubMed] [Google Scholar]
Bordewich M, Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 2004;8:409–423. [Google Scholar]
Bordewich M, Semple C. Computing the minimum number of hybridization events for a consistent evolutionary history. Dis. Appl. Math. 2007;155:914–928. [Google Scholar]
Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the grasses (poaceae) Ann. Mo. Bot. Gard. 2001;88:373–457. [Google Scholar]
Gusfield D. Optimal, efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. J. Comput. Syst, Sci. 2005;70:381–398. [Google Scholar]
Hallett M, Lagergren J. Proceedings of Fifth Annual Conference on Research in Computational Molecular Biology (RECOMB 2001) New York, NY, USA: ACM; 2001. Efficient algorithms for lateral gene transfer problems; pp. 149–156. [Google Scholar]
Hein J, et al. On the complexity of comparing evolutionary trees. Dis. Appl. Math. 1996;71:153–169. [Google Scholar]
Hudson R. Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
Huson D. Split networks and reticulate networks. In: Gascuel O, Steel M, editors. Reconstructing Evolution: New Mathematical and Computational Advances. Oxford, UK: Oxford University Press; 2007. pp. 247–276. [Google Scholar]
Huson D, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 2006;23:254–267. doi: 10.1093/molbev/msj030. [DOI] [PubMed] [Google Scholar]
Huson D, Klopper T. Beyond galled trees - decomposition and computation of galled networks. In: Speed T, Huang H, editors. Proceeding of RECOMB 2007: The 11th Annual International Conference Research in Computational Molecular Biology. Vol. 4453. Berlin / Heidelberg: Springer; 2007. pp. 211–225. of LNBI. [Google Scholar]
Huson D, et al. Proceeding of RECOMB 2005: The 9th Annual International Conference Research in Computational Molecular Biology. Vol. 3500. Berlin / Heidelberg: Springer; 2005. Reconstruction of reticulate networks from gene trees; pp. 233–249. of LNBI. [Google Scholar]
Huson D, et al. Computing galled networks from real data. Bioinformatics. 2009;25:i85–i93. doi: 10.1093/bioinformatics/btp217. [DOI] [PMC free article] [PubMed] [Google Scholar]
Linz S, Semple C. Hybridization in nonbinary trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 2009;6:30–45. doi: 10.1109/TCBB.2008.86. [DOI] [PubMed] [Google Scholar]
Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L, Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. New York, USA: Springer, New York Inc; 2009. [Google Scholar]
Nakhleh L, et al. Proceeding of 8th Annual International Conference on Computational Molecular Biology. Berlin / Heidelberg: Springer; 2004. Reconstructing reticulate evolution in species - theory and practice; pp. 337–346. [DOI] [PubMed] [Google Scholar]
Nakhleh L, et al. Reconstructing reticulate evolution in species - theory and practice. J. Comp. Biol. 2005;12:796–811. doi: 10.1089/cmb.2005.12.796. [DOI] [PubMed] [Google Scholar]
Schmidt H. PhD Thesis. Dusseldorf: Heinrich-Heine-Universitat; 2003. Phylogenetic trees from large datasets. [Google Scholar]
Semple C. Hybridization networks. In: Gascuel O, Steel M, editors. Reconstructing Evolution: New Mathematical and Computational Advances. Oxford: Oxford University Press; 2007. pp. 277–309. [Google Scholar]
van Iersel L, et al. Constructing level-2 phylogenetic networks from triplets. In: Vingron M, Wong L, editors. Proceeding of RECOMB 2008: The 12th Annual International Conference Research in Computational Molecular Biology. Vol. 4955. Berlin / Heidelberg: Springer; 2008. pp. 450–462. of LNBI. [Google Scholar]
Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25:190–196. doi: 10.1093/bioinformatics/btn606. [DOI] [PubMed] [Google Scholar]
Wu Y, Wang J. Fast Computation of the exact hybridization number of two phylogenetic trees. In: Borodovsky M, et al., editors. Proceeding of ISBRA 2010: The 6th International Symposium on Bioinformatics Research and Applications. Vol. 6053. Berlin / Heidelberg: Springer; 2010. pp. 203–214. of Lecture Notes in Bioinformatics. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]

btq198_index.html^{(668B, html)}

btq198_1.pdf^{(166KB, pdf)}

[B1] Baroni M, et al. Bounding the number of hybridisation events for a consistent evolutionary history. J. Math. Biol. 2005;51:171–182. doi: 10.1007/s00285-005-0315-9. [DOI] [PubMed] [Google Scholar]

[B2] Baroni M, et al. A framework for representing reticulate evolution. Ann. Comb. 2004;8:391–408. [Google Scholar]

[B3] Bordewich M, et al. A reduction algorithm for computing the hybridization number of two trees. Evol. Bioinform. 2007;3:86–98. [PMC free article] [PubMed] [Google Scholar]

[B4] Bordewich M, Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 2004;8:409–423. [Google Scholar]

[B5] Bordewich M, Semple C. Computing the minimum number of hybridization events for a consistent evolutionary history. Dis. Appl. Math. 2007;155:914–928. [Google Scholar]

[B6] Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the grasses (poaceae) Ann. Mo. Bot. Gard. 2001;88:373–457. [Google Scholar]

[B7] Gusfield D. Optimal, efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. J. Comput. Syst, Sci. 2005;70:381–398. [Google Scholar]

[B8] Hallett M, Lagergren J. Proceedings of Fifth Annual Conference on Research in Computational Molecular Biology (RECOMB 2001) New York, NY, USA: ACM; 2001. Efficient algorithms for lateral gene transfer problems; pp. 149–156. [Google Scholar]

[B9] Hein J, et al. On the complexity of comparing evolutionary trees. Dis. Appl. Math. 1996;71:153–169. [Google Scholar]

[B10] Hudson R. Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]

[B11] Huson D. Split networks and reticulate networks. In: Gascuel O, Steel M, editors. Reconstructing Evolution: New Mathematical and Computational Advances. Oxford, UK: Oxford University Press; 2007. pp. 247–276. [Google Scholar]

[B12] Huson D, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 2006;23:254–267. doi: 10.1093/molbev/msj030. [DOI] [PubMed] [Google Scholar]

[B13] Huson D, Klopper T. Beyond galled trees - decomposition and computation of galled networks. In: Speed T, Huang H, editors. Proceeding of RECOMB 2007: The 11th Annual International Conference Research in Computational Molecular Biology. Vol. 4453. Berlin / Heidelberg: Springer; 2007. pp. 211–225. of LNBI. [Google Scholar]

[B14] Huson D, et al. Proceeding of RECOMB 2005: The 9th Annual International Conference Research in Computational Molecular Biology. Vol. 3500. Berlin / Heidelberg: Springer; 2005. Reconstruction of reticulate networks from gene trees; pp. 233–249. of LNBI. [Google Scholar]

[B15] Huson D, et al. Computing galled networks from real data. Bioinformatics. 2009;25:i85–i93. doi: 10.1093/bioinformatics/btp217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Linz S, Semple C. Hybridization in nonbinary trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 2009;6:30–45. doi: 10.1109/TCBB.2008.86. [DOI] [PubMed] [Google Scholar]

[B17] Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L, Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. New York, USA: Springer, New York Inc; 2009. [Google Scholar]

[B18] Nakhleh L, et al. Proceeding of 8th Annual International Conference on Computational Molecular Biology. Berlin / Heidelberg: Springer; 2004. Reconstructing reticulate evolution in species - theory and practice; pp. 337–346. [DOI] [PubMed] [Google Scholar]

[B19] Nakhleh L, et al. Reconstructing reticulate evolution in species - theory and practice. J. Comp. Biol. 2005;12:796–811. doi: 10.1089/cmb.2005.12.796. [DOI] [PubMed] [Google Scholar]

[B20] Schmidt H. PhD Thesis. Dusseldorf: Heinrich-Heine-Universitat; 2003. Phylogenetic trees from large datasets. [Google Scholar]

[B21] Semple C. Hybridization networks. In: Gascuel O, Steel M, editors. Reconstructing Evolution: New Mathematical and Computational Advances. Oxford: Oxford University Press; 2007. pp. 277–309. [Google Scholar]

[B22] van Iersel L, et al. Constructing level-2 phylogenetic networks from triplets. In: Vingron M, Wong L, editors. Proceeding of RECOMB 2008: The 12th Annual International Conference Research in Computational Molecular Biology. Vol. 4955. Berlin / Heidelberg: Springer; 2008. pp. 450–462. of LNBI. [Google Scholar]

[B23] Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25:190–196. doi: 10.1093/bioinformatics/btn606. [DOI] [PubMed] [Google Scholar]

[B24] Wu Y, Wang J. Fast Computation of the exact hybridization number of two phylogenetic trees. In: Borodovsky M, et al., editors. Proceeding of ISBRA 2010: The 6th International Symposium on Bioinformatics Research and Applications. Vol. 6053. Berlin / Heidelberg: Springer; 2010. pp. 203–214. of Lecture Notes in Bioinformatics. [Google Scholar]

PERMALINK

Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees

Yufeng Wu

Abstract

1 INTRODUCTION

2 DEFINITIONS AND BACKGROUND

Fig. 1.

3 A LOWER BOUND

Lemma 3.1. —

Proof. —