Abstract
Motivation
Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species).
Results
We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup.
Keywords: Gene duplication and loss, Gene tree parsimony, Deep coalescence, Dynamic programming
Background
The estimation of species trees is often performed by estimating multiple sequence alignments for some collection of genes, concatenating these alignments into one supermatrix, and then estimating a tree (often using maximum likelihood or a Bayesian technique) on the resultant supermatrix. However, this approach cannot be used when the species’ genomes contain multiple copies of some gene, which can result from gene duplication. Since gene duplication and loss is a common phenomenon, the estimation of species trees requires a different type of approach in this case.
The most powerful approaches for species tree estimation for multi-copy gene families are likely to be methods such as Phyldog [1], which co-estimate gene trees and species trees under parametric models of gene evolution that include duplications and losses. Another type of approach uses initial assignments of orthology and paralogy to inform gene tree and species tree estimation [2]. However, by far the most common approach for estimating species trees uses gene tree parsimony, which we now describe.
Gene tree parsimony (GTP) is an optimization problem for estimating species trees from a set of gene trees (estimated from individual gene sequence alignments). In its most typical formulations, only gene duplication and loss are considered, so that GTP depends upon two parameters: (the cost for a duplication) and (the cost for a loss). The two most popular versions of GTP are MGD (minimize gene duplication), for which and , and MGDL (minimize gene duplication and loss), for which . The version of GTP that seeks the tree minimizing the total number of losses has also been studied; for this, and . These variants of GTP are NP-hard optimization problems [3], but software such as DupTree [4] and iGTP [5] for GTP are in wide use.
Basic to all these problems is the ability to compute the number of duplications and losses implied by a species tree and gene tree. This problem is called the “reconciliation problem”, surveyed in [6], and intensively studied in the literature (see, for example, [3, 7–17]). The mathematical formulation of the reconciliation problem was derived for the case where the gene tree and the species tree have the same set of taxa, and then extended to be able to be used on incomplete gene trees, i.e., trees that can miss some taxa.
Incomplete gene trees are quite common, and can arise for two different reasons: (1) taxon sampling: the gene may be available in the species’ genome, but was not included for some reason in the dataset for that gene, or (2) gene birth/death: as a result of gene birth and death (true biological gene loss), the species does not have the gene in its genome.
Given a gene tree gt and a species tree ST, two formulations for the number of losses have been defined. The most commonly used one computes the number of losses by first computing the “homeomorphic subtree” ST(gt) of ST induced by gt, and then computing the number of losses required to reconcile gt with ST(gt) (see, for example, [3, 8, 17]). Although this second formulation is in wide use (and is the basis of both iGTP [5] and Duptree [4], two popular methods for “solving” GTP), we will show that this can be incorrect when incompleteness is due to true biological loss. We refer to this formulation as the “standard” approach because of this widespread use in both software and the theoretical literature on GTP. The other, described in [18, 19], correctly computes the number of losses when incompleteness is a result of true gene loss, as we will prove.
This paper addresses the GTP problem for the case where some of the input gene trees may be incomplete due to either sampling or true biological loss. The main results are as follows:
We formalize the duploss reconciliation problem when gene trees are incomplete due to taxon sampling as the “optimal completion of a gene tree”, and we prove (Theorem 1) that the standard calculation correctly computes losses for this case.
We show by example that the standard calculation for losses in GTP can be incorrect when incompleteness is due to true biological loss.
We show how to compute the number of losses implied by a gene tree and species tree, when incompleteness is due to true biological loss.
We formulate variants of the GTP problem (when gene tree incompleteness is due to true biological loss) as minimum weight maximum clique problems (see Theorem 11 for one duploss variant), and show how to solve these problems efficiently using dynamic programming. We show that these optimal cliques can be found in polynomial time in the number of vertices of the graph, because of the special structure of the graphs. We also show that a constrained version of these problems, where the subtree-bipartitions of the species tree are drawn from the subtree-bipartitions of the input gene trees, can be solved in time that is polynomial in the number of gene trees and taxa.
Basics
Notation and terminology
We now define some general terminology we will use throughout this paper; other terminology will be introduced as needed. Throughout this paper we will assume that gene trees and species trees are rooted binary trees, with leaves drawn from the set of n taxa, and we allow the gene trees to have multiple copies of the taxa, and even to miss some taxa. We orient each tree so that the root is on top and the leaves are at the bottom; hence, we also say that a node v is above node w if the path from w to the root of the tree goes through the node v (similarly we say that w is below v).
We let gt denote a gene tree and ST denote a species tree. We let L(t) denote the set of taxa at the leaves of the tree t, and require that . If we say that gt is complete, and otherwise we say that gt is incomplete.
Let T be a rooted binary tree. We denote the set of vertices of T by V(T), the set of edges of T by E(T), the root by r(T), the internal nodes by , and the set of taxa that appear at the leaves by L(T). Note that if T is a gene tree, it can be incomplete, and so it is possible for |L(T)| to be smaller than the number of leaves in T. A clade in T is a subtree of T rooted at a node in T, and the set of leaves of the clade is called a cluster. Given a node v in T, the cluster of leaves below v is denoted by , and the subtree of T rooted at v is denoted by . The most recent common ancestor (MRCA) of a set A of leaves in T is denoted by . Given a gene tree gt and a species tree ST, we define by . Finally, given a node u in a rooted binary tree, we let r denote the right child of u and l denote the left child of u.
For a rooted gene tree gt and a rooted species tree ST, where , an internal node v in gt is called a duplication node if for some child of v, and otherwise v is a speciation node [3, 8, 17, 20].
ST(gt) is the homeomorphic subtree of ST induced by the taxon set of gt, and is produced as follows: ST is restricted to the taxon set of gt, and then nodes with in-degree and out-degree 1 are suppressed. is the tree obtained by restricting ST to the taxon set of gt, but not suppressing nodes of in-degree and out-degree 1.
We say that clade cl in ST is a missing clade with respect to gt if , and a maximal missing clade if it is not contained in any other missing clade. Maximal missing clades that are descendants of are called the “lower” maximal missing clades, and those that are not descendants of are called the “upper” maximal missing clades. We denote by LMMC(gt, ST) (or LMMC), the set of lower maximal missing clades, and UMMC(gt, ST) (or UMMC), the set of upper maximal missing clades. Note if .
The standard formula for computing losses
The standard formula (see, for example, [3, 8, 16, 17, 20]) for computing the minimum number of losses of a (potentially incomplete) gene tree gt with respect to a species tree ST is denoted , and is defined to be where F(u, T) is defined for internal nodes u with children l and r (which can be interchanged in the formula below) by:
1 |
where is the number of internal nodes in T on the path from s to . When gt is complete, then , and this formula follows from [18].
Optimal completion of a gene tree:
Input rooted binary gene tree gt and rooted binary species tree with .
Output complete gene tree that is an extension of gt such that implies a minimum number of losses with respect to ST.
In other words, we add all the missing taxa into gt (each taxon added at least once, but perhaps several times) so as to produce a complete binary gene tree that has a minimum number of losses with respect to ST. Let . Thus, denotes the total number of losses needed for an optimal completion of gt. Similarly, we can define to be the total number of duplications and losses needed for a completion of gt that minimizes the duploss score.
The following theorem shows that the standard formula correctly computes the number of losses, when we treat incompleteness as due to taxon sampling
Theorem 1
Given a binary rooted gene tree gt and a binary rooted species tree ST such that , the MRCA mapping defines a reconciliation that minimizes the number of duplications, the number of losses, and hence also the total number of duplications and losses, where we treat losses as due to sampling. Furthermore, , which means the standard formula correctly computes the number of losses when we treat incompleteness as due to sampling.
Proof
Consider ST(gt), the homeomorphic subtree of ST defined by the taxon set of gt. Since gt is complete with respect to ST(gt), the optimal reconciliation that minimizes duplications, losses, and their sum, is defined by , the MRCA mapping from gt to ST, and the standard formula correctly computes the number of losses for this reconciliation [18]. Note that for any completion t of gt, ; in other words, the number of losses cannot decrease by making gt complete. Similarly, the number of duplications for t with respect to ST cannot be less than the number of duplications of gt with respect to ST. We will show that we can add all the remaining taxa into gt to produce a complete gene tree such that . Therefore, will be an optimal completion with respect to the problem. Furthermore, this will also imply that , as desired.
Recall the definition of the sets UMMC and LMMC, the upper and lower maximal missing clades, respectively. Since gt is not complete, there must be at least one missing taxon, and hence at least one maximal missing clade. If then and we set . Otherwise, and . Consider the path in ST from r(ST) down to , and the subtrees that hang off that path before we reach . Note that each of these subtrees is an upper maximal missing clade. Let be the tree created by starting with ST and replacing the subtree of ST rooted at by gt. Note also that the number of duplications has not changed, and that .
If we are done; otherwise, we now add the lower maximal missing clades to one at a time. Let , so that . We will define a sequence of gene trees , so that is the result of adding clade to , and is the result of adding clade to for . We will show that for , and that the number of duplications in is the same as the number of duplications in . Since is a completion of gt, our theorem will be proven.
So consider , the first lower maximal missing clade, and let q be the node in ST that is the parent of r(t) [i.e., ]. Consider the edges (x, y) in with , such that q lies in the path between and . Subdivide each such edge (creating a new node), and add t to by making its root the child of each such newly created node. Note that there must be at least one such edge in but there can be several such edges, and hence this step adds t at least once (and perhaps several times) to . Note that when we add to , we do not change the image under the MRCA mapping for any node v that is in .
We now show that has the same number of duplications as gt with respect to ST. Clearly, any node in t is a speciation node (since t is a subtree of ST, which only has speciation nodes). Now consider a node u created by subdividing an edge (x, y), where y is the parent of x in . One child of u is the root of t and the other child has an entirely disjoint leaf set; thus u is a speciation node. When we subdivide edge (x, y) we make y the parent of u. Therefore, . Thus, y is a duplication node in if and only if where z is the other child of y in . But then y is a duplication node in if and only if y is a duplication node in , since the MRCA mapping does not change. Hence, no node in changes duplication/speciation status, and the newly added nodes are always speciation nodes. Therefore the number of duplication nodes does not change.
We now show that the number of losses does not change, i.e., . Now consider an edge (x, y) that is subdivided through the addition of a node u that is made the parent of the subtree . Then x, y, and u all map (under ) to different vertices in . Also, a simple case analysis (using the standard formula) verifies that . Since for all other vertices , this means that the total number of losses does not change.
Therefore, the result of adding each lower maximal missing clade to does not imply any additional losses nor any additional duplications, and so also the total number of duplications and losses does not change. Let be the tree obtained after adding in all the missing maximal clades, and return . The result then follows by induction on p.
Incompleteness due to gene birth and death
As we will see, while the MRCA mapping is still an optimal reconciliation when gene trees are incomplete due to gene birth and death (implied from [18, 21]), the standard formula does not correctly compute the number of losses. We begin by summarizing some results that have already been established:
Theorem 2
(From [18, 21]:) Given a binary rooted gene tree gt and a binary rooted species tree ST such that , the MRCA mapping defines a reconciliation that minimizes the number of duplications and the number of losses where we treat losses as due to gene birth and death. The set of speciation nodes in gt are those vertices that satisfy , where l and r are the two children of v and is the MRCA mapping from gt to ST; all other nodes are duplication nodes. Furthermore, we can compute the MRCA mapping, the set of duplication nodes, and the set of speciation nodes, in time, where ST has n leaves and gt has leaves.
Proof
Chauve et al. [18] proved that the MRCA mapping minimizes the losses required to reconcile gt with ST(gt) for complete gene trees, but the proof also applies to incomplete gene trees, treating incompleteness as due to gene birth and death. Górecki and Tiuryn [21] showed that the MRCA mapping minimizes the number of duplications required to reconcile gt with ST(gt), treating incompleteness as due to gene birth and death. Therefore, the MRCA mapping is optimal for all three scores (number of duplications, number of losses, and number of duplications plus losses), when treating incompleteness as due to gene birth and death.
It is easy to see that the duplication nodes in gt are those nodes that have or (where l and r are the two children of v, and is the MRCA mapping), and all other nodes are speciation nodes. Since the MRCA mapping can be computed in , where ST has n leaves and gt has leaves, it follows that all these can be computed in time.
However, the standard calculation for the number of losses can be incorrect when incompleteness is due to true biological loss! Consider the simple example and . Under the standard formula, , since . Under the assumption that incompleteness is due to true biological loss, the genome for d does not have the gene. Because d is sister to b and all the other taxa have the gene, the gene must have been present in the parent of d, and lost on the branch leading to d. Therefore, the standard formula for the number of losses can be incorrect when gene trees are incomplete due to gene birth and death (i.e., true biological loss).
How to calculate losses
We now show how to calculate the number of losses for an incomplete gene tree gt and species tree ST, treating incomplete gene trees as due to gene birth and death. How this is defined will depend upon whether one assumes, a priori, that the gene is present in the genome of the common ancestor of the species in ST (i.e., at the root of ST). This needs to be taken into account as gene birth can happen only once whereas loss can happen repeatedly. Thus, this section shows how to calculate the following values:
, the minimum number of losses, under the assumption the gene is present in the common ancestor of the species in ST ( is defined similarly for the total number of duplications and losses), and
the minimum number of losses without assuming the gene is present in the common ancestor of the species in ST ( is defined similarly for duplications and losses).
We now show how to compute the number of losses (i.e., and ), using the fact that the MRCA mapping defines an optimal reconciliation.
Theorem 3
Let gt be a gene tree and ST a species tree such that . Then, , and Furthermore, these values can be calculated in time, where ST has n leaves and gt has leaves.
Proof
Note that we use a modification of the standard formula, F(u, ST), so that we do not replace ST by ST(gt) as was done in [18, 19].
Derivation of Recall that does not assume that the most recent common ancestor of the species in ST has the gene. Since gene birth can happen only once (although loss can happen repeatedly), we begin by determining the location of the gene birth. If , then the gene is born before r(ST), and is present at the root of ST. Otherwise, it is easy to see that the location of the gene birth that minimizes the number of losses is the edge above . Now consider the modification of the standard formula (i.e., using ST instead of ST(gt)):
2 |
It is easy to see that this correctly returns the number of inserted subtrees, and hence the number of losses.
Derivation of By definition of , the gene is assumed to be present at the root of the species tree ST. If , then , and the result follows. However, if , the gene must be present on the path between r(ST) and . Since the gene is not present in any leaf that is not below , to minimize losses, the gene must be lost on every edge off that path, since such edges lead to subtrees that do not have the gene present in any leaf. Note that if , then the number of edges that lead off that path is . Since the gene must be lost on each of those edges, and the total number of losses is the sum of this value and the number of losses that occur within the subtree rooted at , it follows that . Figure 1 illustrates an example distinguishing and . The running time follows easily from the fact that the MRCA mapping can be computed in linear time [22].
Now one of the most important questions in terms of estimating the optimal species tree is – given a set of (possibly incomplete) gene trees, is the species tree that minimizes or different than the one that minimizes ? If the same species tree optimizes both ways of calculating losses, then defining loss differently is not that important in the context of phylogenomic analyses. But this is not necessarily true, as we will show in the following theorem.
Theorem 4
Let be a set of incomplete gene trees and and are the species trees that minimizes and , respectively. Then is not necessarily identical to or
Proof
We will present an input of 14 gene trees and a species tree that optimizes the criterion and that provably is not optimal for the criterion. Consider the two gene tree topologies and as shown in Fig. 2a, b. Let be a set of 14 gene trees, with eight gene trees having topology and the remaining six gene trees having topology . It is easy to verify that the species tree with topology minimizes . Here, . Any other species tree will result into more than 24 losses, and the reason is as follows. There are three tree topologies on leafset : ((a, b), c), ((a, c), b) and ((b, c), a). Reconciling with ((a, c), b) or ((b, c), a) requires 3 losses. Therefore, any species tree T such that is not identical to requires losses to reconcile the eight gene trees having topology with T. Therefore, to achieve less than 24 losses, should be identical to . We now calculate the number of losses required to reconcile with a species tree T such that . Note that, Reconciling ((a, c), b) with ((a, b), c) requires three losses. Then taking into consideration, it is quite easy to verify that it requires more than three losses to reconcile with a species tree T such that . Hence, there is no species tree T so that . Therefore, minimizes .
However, the species tree minimizes . Here , which is less than .
Therefore, is not necessarily same as . Then the fact that is not necessarily identical to immediately follows.
Algorithms to find species trees
Here we address the problem of finding a species tree that has a minimum total number of duplications and losses, treating incompleteness as due to true biological loss. Prior results on GTP include a branch-and-bound algorithm in [23] based on techniques from [18], a randomized hill-climbing heuristic presented in [4], a probabilistic and computationally expensive method for co-estimating gene and species trees [1], and dynamic programming based solutions by Hallett and Lagergren [15], Bayzid et al. [20] and Chang et al. [24]. However, none of these studies takes the reasons of incompleteness into account, and we have already shown that the standard calculation for losses can be incorrect when incompleteness is due to true biological loss.
In this section, we derive a different approach for the GTP problems, treating incomplete gene trees as due to true biological loss (i.e., minimizing or ). The techniques we propose can be used to solve GTP exactly for small datasets, or approximately (though without any guaranteed error bounds) on larger datasets. The approach we take here is based on [20] (see also [15, 25, 26], which use very similar techniques). Bayzid et al. [20] provided a graph-theoretic formulation for , whereby an optimal solution to corresponded to finding a minimum weight maximum clique inside a graph called the “Compatibility Graph”. The nodes of the compatibility graph correspond to “subtree-bipartitions”, a concept Bayzid et al. [20] introduced and we will also use. Bayzid et al. [20] showed how to find a minimum weight max clique using a dynamic programming approach. We will use the same graph-theoretic formulation as in [20], but modify the weights appropriately, to show that the optimal solution to still corresponds to a minimum weight max clique. The DP algorithm in [20] can then be used directly to find the optimal solution to . To achieve this, we first derive an efficient formula for (and , similar to the one derived in [17] for , but somewhat more involved.
We will let denote the set of duplication nodes in gt with respect to ST and denote the set of speciation nodes in gt with respect to ST. When gt and ST are known, we may write these as and . The calculation for the number of losses depends on how we interpret incompleteness in gene trees. Therefore, rather than having a single optimization problem like MGDL, we have variants of this problem depending on how we treat incompleteness. As shown in Theorem 1, the term MGDL in the literature refers to , which (by Theorem 1) is identical to . Here, we consider the optimization problems , where we treat incompleteness as due to gene birth and death. And therefore, we also consider , , and .
Basic material
Subtree-bipartitions
Let T be a rooted binary tree and u an internal node in T. The subtree-bipartition of u, denoted by , is the unordered pair , where l and r are the two children of u. Note that subtree-bipartitions are not defined for leaf nodes. The set of subtree-bipartitions of a tree T is denoted by . Furthermore, any pair A and B of disjoint subsets of also define a subtree-bipartition (though we may refer to these as candidate subtree-bipartitions to emphasize this).
Subtree-bipartition domination Let and be two subtree-bipartitions. We say that is dominated by (and conversely that dominates ) if either of the following two conditions holds: (1) and or (2) and We say that subtree-bipartition (A|B) is dominated by a species tree T if one of T’s subtree-bipartitions dominates (A|B). Bayzid et al. showed that an internal node u in a gene tree gt is a duplication node with respect to a species tree ST if is dominated by ST [20]. Finally, for a set of gene trees on taxon set and for any candidate subtree-bipartition (A|B), we let be the total number of subtree-bipartitions in that are dominated by (A|B).
Subtree-bipartition containment and compatibility We say that contains if or , and that and are disjoint if . We say that two subtree bipartitions are compatible if they are disjoint or one contains the other.
The compatibility graph Let be a set of rooted binary gene trees on the set of n taxa. The compatibility graph has a vertex for a subtree-bipartition defined on , and there is an edge between two vertices if and only if the associated subtree-bipartitions are compatible.
Deep coalescence and the MDC problem
Deep coalescence (also called incomplete lineage sorting, or ILS) refers to the failure of alleles to coalesce (looking backwards in time) into a common ancestral allele until deeper than the most recent speciation events [27]. One of the measures for incongruence between a gene tree and a species tree under ILS is XL(gt, ST), the number of extra lineages defined for the pair ST and gt [27]. For a gene tree gt and a species tree ST such that the number of extra lineages (summing over all edges) is defined to be
where is the number of extra lineages on .
Minimize deep coalescence (MDC) is an optimization problem for estimating species trees in the presence of ILS. MDC problem can be defined as follows.
Problem Minimize deep coalescence (MDC)
INPUT A set of gene trees.
OUTPUT A species tree ST such that is minimized.
This problem is also NP-hard [17], and software for the problem exists in Phylonet [28] and iGTP [5], among others. We now describe theoretical material leading to the algorithmic approach in Phylonet [26].
Definition 5
(From [26]) For and gene tree gt, we set to be the number of B-maximal clusters in gt, where a B-maximal cluster is a cluster such that but no other cluster of gt containing Y is a subset of B.
Definition 6
We define for x either a subtree-bipartition or a subset of , as follows. If , then we set if and otherwise . If x is a subtree-bipartition, then we let for , and we set if , and otherwise . For a set of gene trees and ST a species tree, we set .
Yu et al. [26] showed that for any edge e in ST, where B is the cluster below e, then is the number of lineages going through edge e, and so is the number of extra lineages going through e. They defined weights on potential species tree clusters B by = 0 if and otherwise (i.e., is defined for clusters while is defined for subtree-bipartitions), and extended this to a set of gene trees by , and then to a set C of clusters by . From this, it follows easily that a set C of compatible clusters minimizing defines a rooted binary species tree with a minimum MDC score.
Deriving and
Theorem 7
[From [17]] Let gt be a rooted binary gene tree, ST a rooted binary species tree and the set of duplication nodes in gt with respect to ST. Then
We now derive formulas for and ; to obtain formulas for and , simply add . Recall that in the definition of F(u, T) given in Eq. 1, losses are associated with internal nodes, and the total number of losses is defined as the sum of losses associated to each internal node. However, the definition of the number of losses corresponding to a node can be rewritten in terms of edges, as we now show. Let be the number of edges in the path in ST between s and . Therefore, can be defined as follows.
Then, for a vertex u in gt with children r and l, we can rewrite Eq. 1 as follows:
It is easy to see that in all three branches of the equation above, the two terms of the sum correspond to the edges connecting u to its two children l and r. (The second term in the first branch and both terms in the third branch are 0, but we wrote them in terms of the function D(., .) for convenience.) Let p(x) be the parent of x in a tree T. Therefore, we can associate gene losses to edges instead of nodes, as follows:
We use the subscript ST in to emphasize the fact that the distance is taken within the tree ST and not within ST(gt). Therefore,
Lemma 8
For all gene trees gt and species trees ST with
3 |
and for a set of gene trees,
4 |
Finally, equalities concerning and can be obtained from these equalities by adding and , where .
Proof
We partition all the non-root nodes in gt into two sets: CD (children of duplications), consisting of those nodes whose parents are duplication nodes, and CS (children of speciations), consisting of those nodes whose parents are speciation nodes. Note that every edge can be associated with the set containing x. Therefore,
5 |
Since each internal node has two children, clearly the number of vertices x for which p(x) is a speciation node is twice the number of speciation nodes; therefore Since each internal node is a speciation node or a duplication node, it follows that , and the result follows.
Let L(gt, e) be the number of lineages that go through edge ; thus, , and so
6 |
Lemma 9
For any gene tree gt and species tree ST, and (by Eq. 6)
7 |
Thus, for a set of gene trees and species tree ST,
Proof
We establish the first equality, since the remaining ones follow directly from it. Consider the lists of edges in paths in ST from to , as x ranges over the internal vertices in gt. It is easy to see that the number of occurrences of an edge in these lists is (the number of lineages through ). Also, the edges will not be present in these lists, since these are the edges incident on the missing clades in ST with respect to gt. Therefore, the sum of the lengths of these lists is equal to and also equal to .
Theorem 10
For all gene trees gt, sets of gene trees, and species trees ST, , and
8 |
Proof
Corollary 1
For all gene trees gt and species trees ST,
Proof
The equalities concerning follow from Theorems 3 and 10. The equalities concerning follow by adding .
Assigning weights to subtree-bipartitions
To use the graph-theoretic formulation of , we have to assign weights to each node in the compatibility graph, , where is the input set of gene trees, so that a minimum weight clique of vertices defines an optimal solution to . We will define weights , and to each subtree-bipartition (i.e., node in the compatibility graph), and set
We then prove (see Theorem 11) that a set of compatible subtree-bipartitions that has minimum total weight defines a species tree that optimizes . Note that weights and have already been defined. Hence, all that remains is to define and , and then to prove Theorem 11.
Calculating and : We now show how to define weight for every vertex v in the compatibility graph so that for all species trees ST, is the sum of the vertex weights for the clique in corresponding to ST. To count the number of edges in , we need to exclude those edges from E(ST) that are incident on a clade that is missing in gt. For a vertex v associated with the subtree-bipartition (p|q), we define as follows (swapping p and q as needed):
9 |
Then, . We set . Then, for any species tree ST and set of gene trees,
10 |
where is the clique in that corresponds to ST.
Calculating and |UMMC(gt, ST)| We now show how to assign the weight to each vertex v of the compatibility graph so that for all species trees ST, |UMMC(gt, ST)| is the sum of the weights over all the vertices of the clique in corresponding to ST. Recall that UMMC(gt, ST) is the set of upper maximal missing clades in ST. For a vertex v associated with the subtree-bipartition (p|q), we define as follows (swapping p and q as needed):
11 |
Then Finally, we set Then, for any species tree ST and set of gene trees,
12 |
where is the clique in that corresponds to ST.
We can extend the techniques to allow for losses and duplications to have different costs, as follows. Let be the cost of a duplication and assume the cost of a loss () is 1. (Note that, our techniques work for any arbitrary and .) Let , and set Let be the problem that takes a set of gene trees and duplication cost as input, and finds the species tree that minimizes the weighted duploss score . Let . If , we omit the superscript and write .
Theorem 11
Let be a set of binary rooted gene trees on set of n species, and set the weights on the vertices in the compatibility graph using (a) A set of subtree-bipartitions in an -clique of minimum weight in defines a binary species tree ST that minimizes . Furthermore, the weighted duploss score of ST is given by , where (b) If we reset the weights to be , then a set of subtree-bipartitions in an -clique of minimum weight in defines a binary species tree ST that minimizes
Proof
We prove (a), since (b) follows directly from (a). Let be a clique of size in and ST the associated species tree. Let be the set of subtree-bipartitions in gt that are dominated by a subtree-bipartition in ST. Note that is the number of speciation nodes in gt with respect to ST [20]. Therefore, the total number of speciation nodes in is . Also, , and , where is the number of leaves in . Finally, since all gene trees are rooted binary trees, and . Recall that is the number of extra lineages contributed by the leaf set of the species tree (Definition 6). Therefore,
Note that does not depend on the topology of the species tree. Hence, the -clique with minimum weight defines a tree ST that minimizes . The proof for (b) follows trivially.
Dynamic programming algorithm
Let be a set of subtree-bipartitions, with equal to all possible subtree-bipartitions if an exact solution is desired, and otherwise a proper subset if a faster algorithm is desired or necessary. We present the DP algorithm for the problem. We compute score(A) in order, from the smallest cluster to the largest cluster
Algorithm
if |A| = 1 then score(A) = WXL(A)
else
score(A) = max{score(A1) + score(A − A1) + (A1|A − A1) : (A1|A−A1) ∈
If there is no , we set score(A) to , signifying that A cannot be further resolved. At the end of the algorithm, if includes at least one clique of size , we have computed as well as sufficient information to construct the optimal set of compatible clusters and hence the optimal species tree (subject to the constraint that all the subtree bipartitions in the output tree are in ). If subtree bipartitions in are not sufficient for building a fully resolved tree on , then will be , and our algorithm returns FAIL.
The optimal number of duplications and losses is given by , by Theorem 11. If contains all possible subtree-bipartitions, we have an exact but exponential time algorithm. However, if contains only those subtree-bipartitions from the input gene trees, then the algorithm finds the optimal constrained species tree in time that is polynomial in the number of gene trees and taxa.
Theorem 12
Let be a set of rooted binary gene trees, a set of subtree-bipartitions which contains only the subtree-bipartitions, or a subset of the subtree-bipartitions from the input gene trees. The dynamic programming (DP) algorithm finds the species tree ST minimizing the weighted duploss score, treating incomplete gene trees as resulting from gene birth and death, subject to the constraint that in time. Therefore, if is all possible subtree-bipartitions, we have an exact but exponential time algorithm. However, if contains only those subtree-bipartitions from the input gene trees, then the DP algorithm finds the optimal constrained species tree in time, where n is the number of species, k is the number of gene trees, and d the maximum number of times that any taxon appears in any gene tree.
Proof
The proof of correctness is given above. The running time analysis for an arbitrary follows the same argument as given in [20], since and can be computed in O(1) time after the preprocessing (as described in [20]). Finally, suppose Q is the set of subtree bipartitions from the input gene trees, and we use Q as the constraint set. Note that |Q| is O(dkn) (every internal node in every gene tree corresponds to subtree bipartition, and there are at most a total of dkn internal nodes across all the gene trees). Hence, when the constraint set is just the set Q of subtree bipartitions from the input set of gene trees, then the algorithm runs in time.
Extensions
It is trivial to extend the theory for and to and , as we now show. Recall that the only difference between and is whether the gene is assumed to be present at the ancestral species: in it is not assumed to be present there, but in it is. Therefore, and that . Therefore, to extend the algorithmic approach to solve and , we define and and then seek a minimum weight maximum clique in the compatibility graph with these modified weights.
Conclusion
The calculation of reconciliation costs between gene trees and species trees is a standard step in many bioinformatics analyses, including the estimation of species trees from a set of gene trees. This paper showed that different interpretations of incompleteness (i.e., species missing from genes) can impact the way that these reconciliation costs should be calculated, and need to be taken into account when using Gene Tree Parsimony to construct species trees from gene trees.
To address this issue, we presented a dynamic programming algorithm that provably finds an optimal species tree given a set of gene trees under the (weighted) GDL model within a constrained search space, treating incompleteness as due to true biological loss. This technique can be used on any input on which other gene tree parsimony is used. The use of dynamic programming to find provably optimal solutions within a constrained search space is also how ASTRAL (a coalescent-based species tree estimation method) [29–31] and FastRFS (a supertree method) [32] achieve good performance. For those methods, the constraints are based on bipartitions rather than subtree bipartitions, but the dynamic programming algorithm is nearly identical to the one we use here. As noted in [32], although setting the constraint set to just the bipartitions in the input source trees produced good results, expanding the set to include bipartitions from computed supertrees improved the topological accuracy of the resultant FastRFS supertree, without greatly increasing the running time. This suggests that expanding the subtree bipartition set to include estimated species trees based on GDL would be similarly beneficial for the dynamic programming method we present in this paper. In addition, changing the technique for defining subtree bipartitions is necessary when all the gene trees are incomplete, since in that case none of the gene trees can contribute any valid subtree bipartitions.
The results and the methods presented here are based on the assumption that the gene trees are discordant due to gene duplication and loss. However, these methods can be applied to both orthologous genes (in which case the gene trees could be single copy) and gene families that by definition will include both paralogs and orthologs. In cases where the genes are expected to be orthologs, all discordance between gene trees and species trees should be due to processes other than duplication and loss (e.g., incomplete lineage sorting), which could make approaches that attempt to minimize the total GDL cost potentially less accurate. Nevertheless, it is also possible that these GDL-based methods could be reasonably accurate under conditions where ILS rather than GDL is operating, and so these methods should be explored in that context.
Another natural source of discordance is gene tree estimation error, which is likely to occur with most biological datasets (see discussion in [33]). Therefore, the most accurate estimations of the number of duplications and losses (or weighted versions of these numbers) will only be obtained when the estimated gene trees and species trees are highly accurate, so that every attempt should be made to estimate these trees carefully. Gene trees, especially of multi-copy genes spanning large numbers of species, can be extremely large (i.e., greater than 100,000 leaves), and thus present enormous analytical and computational challenges (e.g., multiple sequence alignment and likelihood-based tree estimation are both difficult for datasets with more than about 1000 sequences, let alone 100,000) [34, 35]. Since completely accurate gene trees are not likely to be reliably obtained, these reconciliation methods and associated species tree estimation methods should be modified to take gene tree uncertainty into account. Methods such as NOTUNG [36] and ProfileNJ [37] are examples of methods that do this, but more work is needed.
Authors’ contributions
MSB designed the study and developed the algorithms. MSB and TW proved the theoretical results and wrote the paper. Both authors read and approved the final manuscript.
Acknowledgements
The authors thank Siavash Mirarab for his helpful suggestions and help in implementing the software. The authors wish to thank the anonymous reviewers for their helpful comments, which led to improvements to the manuscript.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
Not applicable.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This work was supported by US National Science Foundation (1062335 and CCF-1535977) to TW and Fulbright Fellowship to MSB.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Md Shamsuzzoha Bayzid, Email: shams_bayzid@cse.buet.ac.bd.
Tandy Warnow, Email: warnow@illinois.edu.
References
- 1.Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013;23(2):323–330. doi: 10.1101/gr.141978.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proce Natl Acad Sci. 2015;112:2053–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ma B, Li M, Zhang L. From gene trees to species trees. SIAM J Comput. 2000;30(3):729–752. doi: 10.1137/S0097539798343362. [DOI] [Google Scholar]
- 4.Wehe A, Bansal MS, Burleigh JG, Eulenstein O. Duptree: a program for large-scale phylogenetic analyses using gene tree parsimony. Am J Bot. 2008;24(13):1540–1541. doi: 10.1093/bioinformatics/btn230. [DOI] [PubMed] [Google Scholar]
- 5.Chaudhary R, Bansal MS, Wehe A, Fernández-Baca D, Eulenstein O. iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform. 2010;11(1):574. doi: 10.1186/1471-2105-11-574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011;12(5):392–400. doi: 10.1093/bib/bbr045. [DOI] [PubMed] [Google Scholar]
- 7.Goodman M, Czelusniak J, Moore G, Romero-Herrera E, Matsuda G. Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979;28:132–163. doi: 10.2307/2412519. [DOI] [Google Scholar]
- 8.Guigo R, Muchnik I, Smith T. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6(2):189–213. doi: 10.1006/mpev.1996.0071. [DOI] [PubMed] [Google Scholar]
- 9.Page RD. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol. 1994;43(1):58–77. [Google Scholar]
- 10.Mirkin B, Muchnik I, Smith T. A biologically consistent model for comparing molecular phylogenies. J Comput Biol. 1995;2(4):493–507. doi: 10.1089/cmb.1995.2.493. [DOI] [PubMed] [Google Scholar]
- 11.Page RD. Genetree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998;14(9):819–820. doi: 10.1093/bioinformatics/14.9.819. [DOI] [PubMed] [Google Scholar]
- 12.Page RD, Charleston MA. Reconciled trees and incongruent gene and species trees. Math Hierarchies Biol. 1997;37:57–70. doi: 10.1090/dimacs/037/04. [DOI] [Google Scholar]
- 13.Stege U. Gene trees and species trees: the gene-duplication problem is fixed-parameter tractable. In: Dehne FKHA, Gupta A, Sack J-R, Tamassia R, editors. Proceedings of the 6th international workshop on algorithms and data structures (WADS’99), Lecture Notes in Computer Science 1663. Heidelberg: Springer; 1999. pp. 166–173. [Google Scholar]
- 14.Zhang L. On a Mirkin–Muchnik–Smith conjecture for comparing molecular phylogenies. J Comput Biol. 1997;4(2):177–188. doi: 10.1089/cmb.1997.4.177. [DOI] [PubMed] [Google Scholar]
- 15.Hallett MT, Lagergren J. New algorithms for the duplication-loss model. In: Shamir R, Miyano S, Istrail S, Pevzner PA, Waterman MS, editors. Proceedings of the Fourth annual international conference on computational molecular biology (RECOMB) New York: ACM; 2000. pp. 138–146. [Google Scholar]
- 16.Górecki P. Reconciliation problems for duplication, loss and horizontal gene transfer. In: Bourne PE, Gusfield D, editors. Proceedings of the Eighth annual international conference on computational molecular biology (RECOMB) New York: ACM; 2004. pp. 316–325. [Google Scholar]
- 17.Zhang L. From gene trees to species trees II: species tree inference by minimizing deep coalescence events. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(9):1685–1691. doi: 10.1109/TCBB.2011.83. [DOI] [PubMed] [Google Scholar]
- 18.Chauve C, Doyon JP, El-Mabrouk N. Gene family evolution by duplication, speciation, and loss. J Comput Biol. 2008;15(8):1043–1062. doi: 10.1089/cmb.2008.0054. [DOI] [PubMed] [Google Scholar]
- 19.Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non-binary species trees. J Comput Biol. 2008;15(8):981–1006. doi: 10.1089/cmb.2008.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bayzid MS, Mirarab S, Warnow T. Inferring optimal species trees under gene duplication and loss. In: Altman R, editor. Proceedings of pacific symposium on biocomputing (PSB) Hackensack: World Scientific Publishing; 2013. pp. 250–261. [DOI] [PubMed] [Google Scholar]
- 21.Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor Comput Sci. 2006;359(8):378–399. doi: 10.1016/j.tcs.2006.05.019. [DOI] [Google Scholar]
- 22.Gabow HN, Tarjan RE. A linear-time algorithm for a special case of disjoint set union. In: Johnson DS, Fagin R, Fredman ML, Harel D, Karp RM, Lynch NA, Papadimitriou CH, Rivest RL, Ruzzo WL, Seiferas JI, editors. Proceedings of the 15th annual ACM symposium on theory of computing (STOC) New York: ACM; 1983. pp. 246–251. [Google Scholar]
- 23.Doyon JP, Chauve C. Branch-and-bound approach for parsimonious inference of a species tree from a set of gene family trees. Adv Exp Med Biol. 2011;696:287–295. doi: 10.1007/978-1-4419-7046-6_29. [DOI] [PubMed] [Google Scholar]
- 24.Chang W-C, Górecki P, Eulenstein O. Exact solutions for species tree inference from discordant gene trees. J Bioinform Comput Biol. 2013;11(05):1342005. doi: 10.1142/S0219720013420055. [DOI] [PubMed] [Google Scholar]
- 25.Than CV, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Computati Biol. 2009;5(9):1000501. doi: 10.1371/journal.pcbi.1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yu Y, Warnow T, Nakhleh L. Algorithms for MDC-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles. J Comput Biol. 2011;18(11):1543–1559. doi: 10.1089/cmb.2011.0174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Maddison WP. Gene trees in species trees. Syst Biol. 1997;46(3):523–536. doi: 10.1093/sysbio/46.3.523. [DOI] [Google Scholar]
- 28.Than CV, Ruths D, Nakhleh L. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinform. 2008;9(1):322. doi: 10.1186/1471-2105-9-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014;30(17):541–548. doi: 10.1093/bioinformatics/btu462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31(12):44–52. doi: 10.1093/bioinformatics/btv234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang C, Sayyari E, Mirarab S. ASTRAL-III: increased scalability and impacts of contracting low support branches. In: Meidanis J, Nakhleh L, editors. Comparative genomics: 15th international workshop, RECOMB CG 2017, Barcelona, Spain, October 4–6, 2017, Proceedings. Cham: Springer; 2017. pp. 53–75. [Google Scholar]
- 32.Vachaspati P, Warnow T. FastRFS: fast and accurate Robinson–Foulds supertrees using constrained exact optimization. Bioinformatics. 2017;33(5):631–639. doi: 10.1093/bioinformatics/btw600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Molloy EK, Warnow T. To include or not to include: the impact of gene filtering on species tree estimation methods. Syst Biol. 2017 doi: 10.1093/sysbio/syx077. [DOI] [PubMed] [Google Scholar]
- 34.Mirarab S, Nguyen N, Guo S, Wang L, Kim J, Warnow TJ. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J Comput Biol. 2015;22(5):377–386. doi: 10.1089/cmb.2014.0156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Nguyen N, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 2015;16(1):124. doi: 10.1186/s13059-015-0688-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Durand D, Halldórsson BV, Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J Comput Biol. 2006;13(2):320–335. doi: 10.1089/cmb.2006.13.320. [DOI] [PubMed] [Google Scholar]
- 37.Noutahi E, Semeria M, Lafond M, Seguin J, Boussau B, Guéguen L, et al. Efficient gene tree correction guided by genome evolution. PLoS ONE. 2016;11(8):0159559. doi: 10.1371/journal.pone.0159559. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.