Abstract
Background
While tree-oriented methods for inferring orthology and paralogy relations between genes are based on reconciling a gene tree with a species tree, many tree-free methods are also available (usually based on sequence similarity). Recently, the link between orthology relations and gene trees has been formally considered from the perspective of reconstructing phylogenies from orthology relations. In this paper, we consider this link from a correction point of view. Indeed, a gene tree induces a set of relations, but the converse is not always true: a set of relations is not necessarily in agreement with any gene tree. A natural question is thus how to minimally correct an infeasible set of relations. Another natural question, given a gene tree and a set of relations, is how to minimally correct a gene tree so that the resulting gene tree fits the set of relations.
Results
We consider four variants of relation and gene tree correction problems, and provide hardness results for all of them. More specifically, we show that it is NP-Hard to edit a minimum of set of relations to make them consistent with a given species tree. We also show that the problem of finding a maximum subset of genes that share consistent relations is hard to approximate. We then demonstrate that editing a gene tree to satisfy a given set of relations in a minimum way is NP-Hard, where “minimum” refers either to the number of modified relations depicted by the gene tree or the number of clades that are lost. We also discuss some of the algorithmic perspectives given these hardness results.
Keywords: Orthology, Paralogy, NP-Hardness, Gene tree, Species tree
Background
Genes, the molecular units of heredity, hold the information to build and maintain cells. In the course of evolution, they are duplicated, lost, and passed to organisms through speciation. Genes originating from the same ancestral copy are called homologs. Homologous gene are grouped into gene families, usually via sequence similarity methods. Moreover, homologous genes can be orthologous, if their parental origin is a speciation, or paralogous, if it is a duplication. Orthologous gene are considered to be more similar in function than paralogs, a conjecture known as the orthology conjecture [1]. This is a major motivation for inferring gene evolution, as it is a prerequisite for functional prediction purposes.
Starting usually from a DNA or protein sequence alignment, the tree-based method requires to build a phylogenetic tree, called gene tree, for the considered gene family. Reconciliation [2] with the species tree then allows to infer evolutionary events (duplications and speciations) associated with the internal nodes of the gene tree. Hence the internal nodes of a gene tree can be labeled as duplications and losses, and such a labeling induces a full orthology and paralogy set of relations between gene pairs. In order to detect orthology, tree-free methods are also available. These methods are based on gene clustering according to sequence similarity, (cf. e.g. the COG database [3], OrthoMCL [4], InParanoid [5], Proteinortho [6]), synteny [7, 8] or functional annotation of genes [9]. Such methods usually are not able to detect a full set of relations, but only a partial set, i.e. some relations among genes are not inferred.
Recent papers [10, 11] have investigated, from a graph theory point of view, the link between trees and orthology/paralogy relations (we just say “relations” in the following). Given a gene family and a set of pairwise relations, a first problem is whether we can reconstruct a labeled gene tree for inducing . The problem can be subdivided into two parts. First, we can consider whether is satisfiable, i.e. whether there exists an event-labeled gene tree G in agreement with . However satisfiability is not sufficient to ensure the possibility for the relation set to reflect a true history, as nodes of G labeled as speciations can be contradictory. This raises the second question which is the existence of an S-consistent gene tree, namely an event-labeled tree that can be obtained by reconciliation with a species tree S. A simple characterization of satisfiability is given in [10], when the set is a full set of relations (i.e. each pair of genes of is in ). On the other hand, checking for S-consistency can be done in polynomial-time for full sets [12, 13], and also partial sets of relations [14].
In this paper we explore the link between relations and trees in the perspective of relation and tree correction. Several gene tree databases from whole genomes are available, including for instance Ensembl Compara [15], Hogenom [16], Phog [17], MetaPHOrs [18], PhylomeDB [19], Panther [20]. However, due to various limitations such as alignment errors, systematic artifacts of inference methods or insufficient differentiation between sequences, trees are known to contain errors and uncertainties. Consequently, a great deal of effort has been put towards tools for gene tree editing [21–29]. Most of them are based on selecting, in a neighborhood of an input tree, one best fitting the species tree.
Two years ago, we developed the first algorithm for gene tree correction using orthology relations [7]. Here we address, from a complexity and approximation point of view, the more general problem of correcting a gene tree according to a set of orthology and paralogy relations. We consider two objective functions: the number of unchanged relations (from orthology to paralogy or vice-versa), leading to the Maximum Homology Correction problem, and the number of unchanged clades (the Robinson-Foulds distance [30]), leading to the Maximum Clade Correction problem. We provide NP-completeness results for these two problems.
Conversely, we also address the problem of correcting a set of relations so that it represents a valid history in terms of S-consistency. A set of relations is usually represented as a graph R, where edges represent orthologous relations and non-edges represent paralogous relations. The satisfiability problem related to S-consistency reduces to adding or removing a minimum number of edges of R in order to make it -free (that is, it contains no induced path of length three), as shown in [10]. The problem is known to be NP-Hard and fixed parameter tractable [31]. In [11], an integer linear programming formulation is used to correct relation graphs of reasonable size. A factor approximation algorithm of factor , where is the degree of the graph R, is given in [32]. The S-consistency problem, however, has never been studied.
In this paper, two criteria are considered for correcting a set R of relations: minimize the number of modified relations, and maximize the number of genes inducing an S-consistent set of relations. The first problem is shown to be NP-complete, while the second problem is shown to be not approximable within factor for any and any constant .
Trees and orthology relations
All trees considered in this paper are assumed to be rooted. They are not necessarily binary, but we assume that all nodes are of degree at least three, except possibly the root that can be of degree two. Given a set X, a tree T for X is a tree whose leafset is in bijection with X. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node u of T, the subtree rooted at u is denoted and we call the leafset the clade ofu. A node u is an ancestor of v if u is on the (inclusive) path between v and the root, and we then call v a descendant of u. If , then v is a strict descendant of u, and if u and v are connected by an edge of T, then v is a child of u. The lowest common ancestor (lca) of u and v, denoted , is the ancestor common to both nodes that is the most distant from the root. We say that u and v are separated if and only if (i.e. none is an ancestor of the other). We define analogously for a set U of nodes. Let be a subset of . The restriction of T to is the tree with leaf set obtained from the subtree of T rooted as , by removing all leaves that are not in , and contracting all internal nodes of degree two, except the root. Let be a tree such that . We say that T displays if and only if is .
Evolution of a gene family
Species evolve through speciation, which is the separation of one species into distinct ones. A species tree S for a species set represents an ordered set of speciation events that have led to : an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species’ genomes, genes undergo speciation when the species to which they belong do, but also duplications, and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes accompanied by a mapping function mapping each gene to its corresponding species. The evolutionary history of can be represented as a node-labeled gene tree for , where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication), and is labeled as a speciation (Spec) or duplication (Dup) accordingly.
Formally, we call a DS-tree for a pair , where G is a tree with , and is a function labeling each internal node of G as a duplication or a speciation node (we drop the G subscript from when it is clear from the context). Given a species tree S, the LCA-mapping function maps each gene of G, ancestral or extant, to a species as follows: if , then ; otherwise, . An example is given in Fig. 1, where the label of each node of G represents its LCA-mapping with respect to S.
According to the Fitch [33] terminology, we say that two genes x, y of are orthologous inG if , and paralogous inG if . We denote by , respectively , the set of all gene pairs that are orthologous, respectively paralogous in G. By we mean (the same applies for ). In Fig. 1, while . We say that (respec. ) is an orthology (respec. paralogy) relation induced by G.
While a history for can be represented as a DS-tree, the converse is not always true, as a DS-tree G for does not necessarily represent a valid history. For this to hold, any speciation node of G should reflect a clustering of species in agreement with S [14]. Formally G should be S-consistent, as defined below.
Definition 1
Let S be a species tree and G be a DS-tree. Let v be an internal node of G such that . Then the speciation node v is S-consistent if and only if for any two distinct children of v, and are separated in S.
We say that G is S-consistent if and only if every speciation node of G is S-consistent.
Notice that G and S are not required to be binary. In particular, the definition of S-consistency for a speciation node v of G does not require v to be binary, even if S is binary. The reason is that in such a case, one can “refine” v into a set of binary S-consistent speciation nodes based on the topology of S. This operation does not affect the orthology and paralogy relations of the genes of G (see Fig. 1). Duplication nodes can be refined as well. Lemma 1 formalizes this intuition. This will serve to show that our results hold for both non-binary and binary gene trees.
Lemma 1
Let G be an S-consistent DS-tree for some binary species tree S. Then there is a binary DS-tree such that is S-consistent, and such that and.
Proof
Let v be a highest non-binary node (i.e. v has no non-binary ancestors) of G with children . We show that v can be made to be binary while preserving and , which suffices to prove the Lemma since we can repeat this operation successively on every non-binary node.
If , obtain a DS-tree by removing from the children of v, adding a child to v and adding as children of , setting . Notice that for every , implying that all speciations remain S-consistent. It is readily seen that and .
If instead , let be the two children of . Let is an ancestor of , for . Notice that for any child of v, is a strict descendant of . For if not, v has a child such that . But since v is a speciation, is separated from for any , implying that is not the lca of and , contradicting the definition of . This strict descendant condition implies that partitions . Also observe that and V_2 cannot be empty, for otherwise would be equal to either or . Obtain by removing the children of v, adding two children and to v, then adding as children of and as children of . Set . Note that the children of and are still from separated species, and so both are S-consistent. As for v, by the definition of and , is a descendant of and is a descendant of (not necessarily a strict descendant). Therefore, both are separated and so v is S-consistent. The species for every other node remaining unchanged, we conclude that preserves S-consistency and does not modify nor .
We can verify that both DS-trees in Fig. 1 are S-consistent. For example, the speciation node z in has children from species v, c, d and w, which are pairwise separated in S. Notice that, from Definition 1, if G is a DS-tree, then the lca of two leaves of G belonging to the same species must be a duplication node. The converse is not true. For example, in the S-consistent gene tree G of Fig. 1, the parental node of and is a duplication node even though and belong to two different species.
Relation graph
A set of orthology/paralogy relations on (or simply a relation set) is a pair of subsets such that and if , then . The relation set is said full if . A DS-tree G induces a full set of relations.
We adopt the graph representation considered in [14] for full relation sets. A relation graphR on a gene family is a graph with vertex set , in which we interpret each edge uv of the edge set E(R) of R as an orthology relation between u and v, and each missing edge (non-edge) as a paralogy relation.1 Notice that if , then . The relation graph of a DS-tree G, denoted by R(G), is the graph with vertex set and edge set (for example, see the relation graph R in Fig. 2).
A DS-tree for a gene family leads to a relation graph, but the converse is not always true. A relation graph R is satisfiable if there exists a DS-tree G such that . The problem of relation graph satisfiability has been addressed in [10]. The following theorem is a reformulation of one of the main results of this paper.
Theorem 1
([10]) A relation graph R is satisfiable if and only if Ris-free, meaning that no four vertices of Rinduce a path of length three.
For example, in Fig. 2, the relation graphs R and are satisfiable, while the graph is not. As a DS-tree does not necessarily represent a true history for (see previous section and Definition 1), satisfiability of a relation graph does not ensure a possible translation in terms of a history for . For this to hold, R should be consistent with the species tree, according to the following definition.
Definition 2
Given a species tree S, a relation graph R for is S-consistent if and only if R is satisfiable by a DS-tree G which is itself S-consistent.
For example the graph R in Fig. 2 is S-consistent. Notice that S-consistency implies satisfiability. Results from [14] complete the characterization of S-consistent graphs through Theorem 2. A triplet is a binary tree with leafset L of size three. For , we denote by xy|z the unique triplet T on L for which holds. Now is the subset of triplets of species induced by paths having exactly three vertices in :
We present in Theorem 2 a necessary and sufficient condition for S-consistency of a relation graph in terms of . First, we introduce in Lemma 2 an intermediate property, that is useful for proving Theorem 2.
Lemma 2
Let G be a DS-tree and S be a species tree. Then for any internal node v of G, there exist leaves x, y of such that both the following hold: (1) and (2).
Proof
We first show that (1) must hold for some . If has two children and for which there exist leaves x and y of such that is an ancestor of s(x) and an ancestor of s(y), then (1) holds. Thus if we suppose that (1) does not hold, then has a child such that all leaves of belong to a species that has as an ancestor. This implies that is a lower common ancestor than for the species present in , contradicting the definition of .
Now, take x and y satisfying (1). Suppose that (2) does not hold for x and y, i.e. , but that . Take such that z is separated from by v (i.e. ). We have . If , then we are done as x and z satisfy both (1) and (2). Otherwise, is on the path, implying that since separates s(x) from s(y). In this last case, y and z are the leaves of interest, ending the proof.
Theorem 2
Let be a satisfiable relation graph and let S be a species tree. Then R is S-consistent if and only if S displays all the triplets of .
Proof
: let G be an S-consistent gene tree satisfying R, and let such that but and . Then we must have and . We claim that S must display the s(x)s(y)|s(z) triplet. Let and . Since , xy|z must be a triplet of G. Moreover, since , and s(z) must be separated in S, implying that s(x)s(y)|s(z) is a triplet of S.
: by assumption, R is satisfiable by some DS-tree . We first obtain from a least-resolved DS-tree G satisfying R, in terms of speciation. That is, if has any speciation node v that has a speciation child w, we obtain by contracting v and w (delete w and give its children to v). Note that we have and so still satisfies R. We obtain the DS-tree G by repeating this operation until we cannot find such a v and w. We claim that if S displays the triplets of , then G is S-consistent.
Let v be a speciation node of G, and let be any two distinct children of v. By the construction of G, . By Lemma 2, has two leaves such that and . Similarly, has two leaves with and . Since v is a speciation while are duplications, we have while . Thus, and are induced paths of length two in R, which implies that S displays the and triplets. Analogously, S displays the and triplets. This is only possible if and are separated in S. We deduce that all child pairs of v are from separated species, and hence that G is S-consistent.
As an example, the graph in Fig. 2 is satisfiable but not S-consistent as the path of length 2 containing induces the triplet ac|b, while the triplet displayed by S is ab|c.
We end this section with additional notations that will be of use later. A subgraph of H is a graph with and . For a graph H and some , the subgraph ofHinduced by, denoted , is the subgraph of H with vertex-set having the maximum number of edges. We say that is an induced subgraph of H if there is a subset such that . If I is another graph, we say H is I-free if there is no such that is isomorphic to I. Finally, for some edge set , is the subgraph with and .
Relation correction problems
We raise the issue of leaving out a minimum of information from a relation graph R in order to reach satisfiability and S-consistency. Two optimality criteria are considered: (1) the minimum number of edges that need to be removed; (2) the maximum number of genes that can be kept.
The minimum edge-removal consistency problem
Based on the same construction used in paper [34], we show that adding the information on the species tree S does not make the problem of removing the minimum number of edges leading to a -free graph simpler. Although a similar reduction is likely to hold in the general case of edge-modification (removal or insertion) [31], here we focus on edge removal, as this formulation is needed in subsequent developments. We show the NP-Completeness of this problem, even when every gene from the family comes from a distinct species.
Minimum edge-removal consistency problem:
Input: A relation graph R for a gene family , a species tree S and an integer k;
Output: “Yes” if and only if there exists an S-consistent subgraph of R with such that .
Theorem 3
The Minimum Edge-Removal Consistency Problem is NP-Complete, even if for any distinct, .
Proof
Given as a certificate, Theorem 2 easily translates into a polynomial-time algorithm to verify that is S-consistent. It is also clear that verifying if can be done quickly. The problem is therefore in NP. As for NP-Hardness, the reduction is from the exact 3-cover problem, a classic NP-Hard problem [35]: given a set and a collection of 3-elements of W, does there exists such that and is a partition of W ? We assume that .
Given arbitrary W and Z, we construct R and S by first defining the species set . Let and let and be two collections of all disjoint sets of species (i.e. for any distinct set , ), with and , for all . Let and be the species in X and Y. Then the species set is . Let and be three trees such that and . Then S is obtained by first connecting with to obtain a new tree , then connecting with (see Fig. 3). Therefore S has exactly leaves. The gene family is then constructed so that it contains exactly one gene per species, as mentioned in the Theorem statement. In other words the mapping is a bijection. Thus for simplicity, we make no distinction between a gene g and its species s(g). We then define R with such that each of the sets forms an individual clique. Finally we add two edge-sets and to R, where and . Then R has cliques, namely . Also, for , all edges between and are present, as well as all edges between and . Figure 3 gives an example with and .
Notice that the construction of R described above can clearly be done in polynomial time. We now show that W and Z admit an exact 3-cover if and only if R admits an S-consistent DS-tree after the deletion of at most edges.
() : let be a partition of W, . Let be the subgraph of R in which all edges between and are removed if and only if (which removes edges), and the only edges not removed from the W-clique are those belonging to a triangle with (which removes edges). An example of is given in Fig. 3. Thus there are exactly edges of R missing from , as desired. Clearly, is -free and thus satisfiable. To see that is S-consistent, we use Theorem 2. Notice that any path of length 3 in has the form with and for some i, inducing the speciation triplet, which is in agreement with S. Therefore there exists an S-consistent gene tree satisfying .
() : let be an S-consistent relation graph obtained by deleting at most edges from R. Then, must be -free. We show that is partitioned into triangles which form a solution to the 3-cover instance. Let . We claim that in , there is exactly one such that w has neighbors in . Suppose first there are and , , such that both and are neighbors of w in . Then there is some such that induce a , unless all edges between and were deleted. But we reach a contradiction since there are such edges. Therefore w has neighbors in at most one . Using that fact, we can see that w must have at least one neighbor in X, since otherwise at most edges between X and W would remain, implying the deletion of edges, more than permitted. This proves our claim.
Thus at best, each has neighbors in X, implying that at least deleted edges are between X and W. This leaves a maximum of other edges that can be deleted.
Now, let C be a connected component of . We claim that all vertices of C must have their X neighbors in the same . For suppose otherwise that there are two vertices of C such that has a neighbor and a neighbor with . It is easy to see that such and can be chosen to be neighbors. Then induce a , a contradiction. Thus all vertices of C have their X neighbors in a common . Since each vertex of has three neighbors in W, this implies that C has at most three vertices. Suppose that C that has two vertices or less. Then since all vertices of have at most two neighbors, it can have at most edges (obtained by counting the sum of degrees). This, however, implies that at least additional edges were deleted, more than the available.
We conclude that is partitioned into t connected components, each having three vertices. Moreover, each vertex in a given component C has neighbors in the same , implying that contains the members of C. Finally since the components are all associated with a disctinct , effectively defines a solution to the exact cover instance.
The Maximum Node Consistency problem
We introduce the Maximum Node Consistency Problem (in its decision version) and we consider the approximation complexity of the corresponding optimization version.
Maximum Node Consistency problem:
Input: A relation graph R for a gene family , a species tree S and an integer k;
Output: “Yes” if and only if there exists an S-consistent induced subgraph of R with .
We show that Maximum Node Consistency is hard to approximate within a factor for any and any constant , by giving a gap-preserving reduction from Maximum Independet Set (n is the number of nodes of R). We refer the reader to [36] for a definition of gap-preserving reduction. Consider an instance of Maximum Independet Set, with . We construct an instance of Maximum Node Consistency as follows.
First, we define the set of genes , i.e. the nodes of the relation graph R. Denote and for each , we define a set of m genes: . The gene set is .
Now, we define the species tree S. First consider S as any binary tree over m leaves , and replace each leaf by any binary subtree having m leaves (thus S has leaves). Each gene in is mapped to a leaf of in a bijective manner, and so each species has exactly one gene in R. We make no distinction between and s(g).
Now, define the relation graph . Set , and we get that . For each , forms a clique in R. Moreover, for each , define an edge , for each t with .
Let be a solution of Maximum Node Consistency over instance (R, S). Denote by the subset of nodes , that is those nodes of that have not been removed. We pay a particular attention to those that contain more than one node.
Lemma 3
Let be two subsets of nodes of a solution of Maximum Node Consistency over instance (R, S) such that and . Then there is no edge with one endpoint in and the other in .
Proof
Assume on the contrary that there is some q such that and share an edge. Consider a node of , which must exist since . The induced by and implies the triplet , while S contains the triplet .
Now, we are ready to prove the main result of this section.
Lemma 4
Let a graph H be an instance Maximum Independet Set with m nodes, and let (R, S) be the corresponding instance of Maximum Node Consistencywith nodes. Then
Given an independent set of H, we can compute in polynomial time a solution of Maximum Node Consistency of size at least ;
Given a solution of Maximum Node Consistency on instance (R, S) of size at least k m, we can compute in polynomial time an independent set of H such that.
Proof
Consider an independent set of H and define a solution of Maximum Node Consistency on instance (R, S) of size at least as follows: remove each node of if and only if . Let be the corresponding solution of Maximum Node Consistency. Since is an independent set, it follows that consists only of cliques , disconnected one from the other. It has nodes and as is -free, it is S-consistent.
The case is trivial so we assume . Consider a solution of Maximum Node Consistency on instance (R, S) of size at least km, and consider the subsets in such that . Notice that we can assume that there exist at least k such sets, otherwise would contain at most nodes.
Given an index j, consider the set , i.e. the nodes with index j that belong to some subset larger than one. By Lemma 3 each set is an independent set. Now, pick the set having maximum cardinality. It follows that contains at least k nodes, since otherwise would have at most nodes. Hence, is an independent set of size at least k, thus concluding the proof.
We say a maximization problem cannot be approximated within a factor if, unless , for any approximation algorithm there are infinitely many instances for which outputs a solution with value AP such that , where OPT is the optimal value of a solution to the problem (note that equivalently, ). It is well-known that Maximum Independet Set cannot be approximated within a factor for any and for any constant [37].
Theorem 4
The optimization version of Maximum Node Consistency cannot be approximated within a factor for any and for any constant where n is the number of nodes of the given relation graph. Moreover, this result holds even on instances in which for any distinct , .
Proof
Let H be a graph with m nodes and let (R, S) be the corresponding instance of Maximum Node Consistency with nodes. Denote by and , respectively, the values of an optimal solution for Maximum Independet Set and Maximum Node Consistency. Let be any approximation algorithm for Maximum Node Consistency, and let be the approximation algorithm for Maximum Independet Set that on input H, runs on the corresponding instance (R, S) and returns the independent set resulting from Lemma 4. Let and denote, respectively, the sizes of the solutions found by and . By Lemma 4 we get that and . Now,
as we may assume that . Since Maximum Independet Set cannot be approximated within a factor , for any and any constant , then for any and any constant there exist infinitely many instances H on which . Thus, it follows that
on infinitely many instances. Finally the fact that the result holds even on instances in which for any distinct , follows from the construction of R.
We get the following as an immediate corollary, which will be of use later:
Corollary 1
The decision version of Maximum Node Consistency is NP-Hard, even on instances in which for any distinct
Gene tree correction problems
In this section, we are given a gene family , a species tree S, an S-consistent DS-tree G for , and a set of orthology/paralogy constraints (not necessarily full). We focus on the problem of correcting G according to C in a minimal way. The goal is thus to find a DS-tree inducing C such that the difference between G and is minimum. We consider two ways of measuring the difference (or symetrically the similarity) between gene trees, one based on conserved orthology/paralogy relations induced by the two trees, and one based on the number of conserved clades between the two trees, which is the Robinson-Foulds in the case that G, and S are all binary trees.
The Maximum Homology Correction problem
Maximum Homology Correction problem:
Input: A species tree S, an S-consistent DS-tree G for a gene family , an integer k, a set O of orthology and a set P of paralogy relations;
Output: “Yes” if there exists an S-consistent DS-tree for with , such that .
Theorem 5
The Maximum Homology Correctionproblem is NP-Complete, even if S, G and are required to be binary.
Proof
The problem is clearly in NP, as verifying S-consistency can be done in polynomial time, as well as counting the common orthologs/paralogs relations (the set of relations is quadratic in size). For our reduction, we use the Minimum Edge-Removal Consistency problem for the case of a gene family with at most one gene per genome, which is NP-Hard by Theorem 3. Given a species tree S, a relation graph R with V(R) in bijection with and an integer k, we construct an instance of the Maximum Homology Correction Problem, i.e. a species tree , a DS-tree G, an orthologous set O and paralogous set P.
Let and construct G by mimicking S - that is by first copying S and its leaf labels, then replacing each leaf of G by the gene . Note that if S is binary, then so is G. All internal nodes of G are labeled as speciations, so all genes of are pairwise orthologous. Thus R(G) is a clique. Finally, let and . Therefore the objective is to break a minimum of orthologies of G in order to satisfy P.
We show that that there is an S-consistent subgraph of R obtained by removing at most k edges if and only if there is an -consistent DS-tree satisfying O and P with at most relations that are not induced by G.
: Let be a solution to the Minimum Edge-Removal Consistency Problem for R and S. Then there exists a S-consistent DS-tree satisfying , which is obtained by deleting at most k edges from R. By Lemma 1, we may assume that if S is binary, then so is . Now, since has at most non-edges, has at most paralogs and is therefore a solution to the constructed instance of the Maximum Homology Correction Problem that breaks at most orthologies of R(G).
: Let be a solution, binary or not, to the constructed Maximum Homology Correction Problem instance and let . Since satisfies P and breaks at most orthologies, must have P as non-edges, plus at most k other non-edges. Thus can be obtained by removing at most k edges from , as desired.
The maximum clade correction problem
Maximum clade correction problem:
Input: A gene tree G, a species tree S, a set O of orthology and a set P of paralogy relations and an integer k;
Output: “Yes” if there exists an S-consistent DS-tree satisfying O and P such that G and have at least k clades in common.
Notice that if S, G and are required to be binary, the effective measure between G and is the Robinson-Foulds distance. This special case is handled as part of the general proof. But before we need the following lemma, which uses grafting operations to add leaves to G and satisfy a prescribed relation without breaking other relations.
Given two trees and , connecting with corresponds to creating a new node x and giving it and as its two children. Grafting a new leaf x to a tree T corresponds to adding x to by either: (1) adding x as a new child of some node u of T; (2) connecting T with x; (3) subdividing an edge uv and adding x as a child of the newly created vertex.
Lemma 5
Let G be an S-consistent gene tree, for some species tree S. Let x be a gene not in G and y be some gene in G with . Then there exists a gene tree obtained by grafting x to G such that the following conditions are satisfied:
x and y are orthologs in ;
and;
is S-consistent;
Proof
If is a strict descendant of , then it is easy to see that connecting x to r(G) under a common parent yields the desired result. So we assume is an ancestor of . If there is some node u of G such that adding x as a child of u satisfies the three conditions of the Lemma, then we are done. So assume that there is no node u to which we can add x as a child.
Let uv be an edge of G, and suppose that we graft x on uv to obtain . Call p the parent of x on , and say that u is the parent of p (i.e. p has children x and v). Note that if , then for any , implying that setting for all preserves S-consistency. We will find such a uv that guarantees this property, while ensuring that can be a speciation (i.e. is S-consistent), and that is S-consistent. This will prove the Lemma.
Let , and let g be the lowest ancestor of y in G such that is or an ancestor of . Note that the case in which g does not exist was handled in the beginning of the proof. Now suppose that . Denote by the child of g that is also an ancestor of y. Note that is an ancestor of and is a strict descendant of . We claim that . To see this, obtain by grafting x to , p being the parent of x and g the parent of p. Then, , and its children species and are separated in S by our choice of . Thus setting preserves S-consistency. Also, since is already an ancestor of . Finally, we are free to set , satisfying all the required conditions.
So instead suppose that . Recall that adding x as a child of g to obtain a new tree is not a solution. Since in , , we must either have , or is not S-consistent. By the choice of g, only the latter is possible, implying that all children of g are from separated species in G, but not in . Therefore, there must be a child of g such that is an ancestor of s(x). Note that must be unique since otherwise, would not be possible. We then claim that . Indeed, obtain by grafting x to , p being the parent of x. We have , and we set . The species of the children of g remain unchanged in , and so and is S-consistent, again satisfying all required conditions.
Theorem 6
The Maximum Clade Correction Problem is NP-Complete, even if S, G and are required to be binary.
Proof
Verifying S-consistency and comparing the set of clades from G and can clearly be done in polynomial time, thus the problem is in NP. We use the Maximum Node Consistency problem for our reduction, which is NP-Hard by Corollary 1. Let R, S and k be the Maximum Node Consistency instance, letting R be the relation graph with , S the species tree and k an integer. Let (noting that when ). The constructed instance of the Maximum Clade Correction Problem uses the same species tree S. Construct G as follows: first consider G as any binary tree with n leaves , where each leaf is mapped to vertex of R. Then, replace each leaf by a subtree constructed as follows: is a caterpillar tree with leaves, and each leaf of is such that (recall that a caterpillar tree is a path to which we add a leaf child to each internal node). Let denote the set of the deepest leaves of (the depth of a leaf being the number of nodes on the path between and the root). Each leaf of is mapped to a distinct node of . Denote by the leaf of mapped to , and by the subtree of rooted at . Then G has exactly leaves and clades (since it is binary). An example is given in Fig. 4. Finally define the set of orthology relations to satisfy and the set of paralogy relations to satisfy. Note that each is present in exactly one relation.
We show that R admits an S-consistent induced subgraph with at least k nodes if and only if G, O and P admit an S-consistent DS-tree satisfying O and P such that G and share at least clades.
() Let be a solution to the Maximum Node Consistencyinstance, , and let H be a DS-tree satisfying that is S-consistent. By Lemma 1, we may assume that if S is binary, then so is H. Now, since , to each leaf of H corresponds a subtree in G. Then build a DS-tree from H by replacing each leaf of H by , and labeling all internal nodes of inserted trees as Dup (in Fig. 4, is the subtree of rooted at the common ancestor of and ). We first argue that is S-consistent and satisfies the subsets of O and P restricted to . In a subsequent step, we will graft the genes missing from using Lemma 5. Notice that and that all nodes of H that are in have the same LCA-mapping in both trees. It follows that is also S-consistent. Also, for all , . Thus for any pair of leaves in such that , is a speciation (by the construction of O from R and the fact that H satisfies ). By the same reasoning, if then and are paralogous in .
The solution is obtained by grafting to every leaf of G missing from whilst preserving the clades, maintaining satisfiability of O and P and S-consistency. If such a exists, then G and share at least k identical subtrees from , and since each contains clades, it follows that G and share at least clades as required. Let be the leaves yet missing from . Let (i.e. the leaves of L subject to an orthology constraint with some leaf already in ). The complement is the set of leaves of L that are either subject to a paralogy constraint with some leaf of , or not subject to any constraint with any leaf of . Let be the relation graph with vertex set and edge set , depicting the required orthologies within . Recall that each leaf of L is contained in at most one relation, implying that each node of has maximum degree 1. Thus is -free and therefore is S-consistent. Let be a DS-tree satisfying that is S-consistent, assuming that is binary if S is. We update by joining and under a common parent x, and labeling x as Dup. Notice that this does not modify any orthology or paralogy relation previously in or in , nor does it break S-consistency. This also ensures that paralogies with and are satisfied.
The final step is to graft the leaves of to in a way satisfying orthology requirements. This is done by successively applying Lemma 5 to each . As shown, each such can be grafted into without modifying any orthology or paralogy relation already in whilst satisfying the orthology requirement that is subject to. It is straightforward to see that in addition, can be grafted without breaking any clade present in , since every vertex in is mapped to the same species. The tree obtained after all these grafting operations, satisfies every O and P and has the required common clades with G.
() Let be a solution, binary or not, to the Maximum Clade Correction Problem instance. Denote by C the number of clades shared by both G and , with . Recall that is the set of the deepest leaves of in G, with being the subtree rooted at . Denote by the subtree of rooted at . We say that was preserved if every leaf of belongs to (in other words, the clade might have been extended, but only to include other leaves from ). We claim that at least k of the subtrees are preserved in . Assume, on the contrary, that at least subtrees from N are not preserved. Take a non-preserved subtree . Then some leaf belongs to the clade. This implies that for any ancestor x of in G, cannot contain the x clade. By construction of , has at least ancestors in G. Therefore, . This leads to , and then to , contradicting our choice of .
Now, let is preserved in . We have . Let and . Notice that to each corresponds exactly one subtree in H such that (and all such subtrees are disjoint). Let be the tree obtained by replacing every subtree in H by . Replacing by changes no LCA-mapping value since all vertices of map to . Thus as is S-consistent, then so are H and . Now, we claim that induces the set of relations represented by , which proves the theorem since . By contradiction, suppose that but is labeled Dup. Then is also labeled Dup, and so is . But , contradicting our assumption that is a solution. The same reasoning applies when , ending the proof.
Algorithmic avenues
As the problems considered in this paper are all computationally hard, only non-polynomial exact algorithms or approximation algorithms avenues can realistically be explored. Let us generalize the Minimum Edge-Removal Consistency problem to the minimum editing problem (i.e. minimzing edge removals and insertions). It is not hard to imagine a branch-and-bound algorithm that solves the problem. Call an induced subgraph H of a relation graph Rbad if it is a , or there is triplet of not displayed by S. Each can be solved by six possible edge editings, and each contradictory triplet of can be solved by three possible editings. Therefore, in a branch-and-bound process, one would verify if a given graph contains a bad subgraph and if so, proceed recursively on each graph obtained by an editing that removes it. If no bad subgraph exists, then is a possible solution and its number of editings is retained. If, at any point, has had more editings than the best solution encountered so far, the algorithm can stop the recursion. Notice however that an edge should not be edited more than once in order to avoid infinite loops. The idea of this branch-and-bound algorithm can also be applied to the Maximum Node Consistency problem. It is known that a , if one exists, can be found in linear time [38], and clearly a contradictory triplet, if any, can be found in time (though a more efficient algorithm may exist). A similar approach has been applied in [31] to design an FPT algorithm for the satisfiability problem.
As for approximations, an algorithm proposed in [32] can be directly applied to the Minimum Edge-Removal Consistency problem and guarantees that we do not remove more than times more edges than the optimal solution, where is the maximum degree of R. The idea is simple: as long as R has a bad subgraph H, remove every edge incident to a vertex of H and continue. Even though this is the best known approximation algorithm so far, it has the undesirable effect of isolating many vertices, motivating the exploration of alternative algorithms. One direction would be to consider existing ideas on the problem of satisfiability, i.e. finding the minimum number of editings required to make a graph -free, and adapt them to the consistency problem - for instance the Min-Cut algorithm proposed in [39].
As for gene tree correction, we have developed in [14] a polynomial-time algorithm which, given a species tree S and partial sets of relations O and P, verifies if there exists an S-consistent gene tree satisfying O and P and if so, constructs one among the set of all possible solutions. In ordre to correct a gene tree G, we can envisage an extension of this algorithm allowing to provide G as input, and pick, among the solutions of the algorithm the one which is the closest to G (either in terms of common homology relations or clades).
Conclusions
A gene tree induces a set of orthology and paralogy relations between members of a gene family, but the converse is not always true. In this paper we have shown that attempting to modify a set of relations as least as possible in order to ensure consistency with a species tree leads to the formulation of NP-Complete problems. Moreover, even assuming that the given relations are error-free, it remains computationally difficult to correct a gene tree in order to fit the given set of relations. As various model-free methods are available to infer orthology and paralogy, these correction problems are of practical biological interest. A future direction would be to explore the exact branch-and-bound algorithms and heuristics mentioned in the last section, and design fast approximation algorithms for the relation graph and gene tree editing problems.
Authors’ contribution
ML, RD and NE modeled the four problems presented, and devised and wrote the hardness proofs. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Funding
Publication of this work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds de Recherche Nature et Technologies of Quebec (FRQNT).
Footnotes
The term ‘relation graph’ is also used in phylogenetics in the form of a generalization of a median network to a set of partitions. To make it clear, relation graphs in this paper have nothing to do with this notion.
Manuel Lafond, Riccardo Dondi and Nadia El-Mabrouk contributed equally to this work
Contributor Information
Manuel Lafond, Email: lafonman@iro.umontreal.ca.
Riccardo Dondi, Email: riccardo.dondi@unibg.it.
Nadia El-Mabrouk, Email: mabrouk@iro.umontreal.ca.
References
- 1.Ohno S. Evolution by gene duplication. Berlin: Springer; 1970. [Google Scholar]
- 2.Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979;28:132–163. doi: 10.2307/2412519. [DOI] [Google Scholar]
- 3.Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucl Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Li L, Stoeckert CJJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucl Acids Res. 2008;36:D263–D266. doi: 10.1093/nar/gkm1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lechner M, Findeib SS, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform. 2011;12:124. doi: 10.1186/1471-2105-12-124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lafond M, Semeria M, Swenson KM, Tannier E, El-Mabrouk N. Gene tree correction guided by orthology. BMC Bioinform. 2013;14(supp 15):S5. doi: 10.1186/1471-2105-14-S15-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lafond M, Swenson K, El-Mabrouk N. Error detection and correction of gene trees. Models and algorithms for genome evolution. London: Springer; 2013. [Google Scholar]
- 9.Consortium TGO Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hellmuth M, Hernandez-Rosales M, Huber K, Moulton V, Stadler P, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66(1–2):399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]
- 11.Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. PNAS. 2014;112(7):2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421. doi: 10.1137/0210030. [DOI] [Google Scholar]
- 13.Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler P. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13(Suppl. 19):56. [Google Scholar]
- 14.Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014;15(Suppl 6):12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara gene trees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perrière G. Databases of homologous gene families for comparative genomics. BMC Bioinform. 2009;10(Suppl 6):S3. doi: 10.1186/1471-2105-10-S6-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:84–89. doi: 10.1093/nar/gkp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Pryszcz LP, Huerta-Cepas J, Gabaldón T. MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res. 2011;39:32. doi: 10.1093/nar/gkq953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Denisov I, Kormes D, Marcet-Houben M, Gabald’on T. Phylomedb v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Res. 2011;39:556–560. doi: 10.1093/nar/gkq1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mi H, Muruganujan A, Thomas PD. Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2012;41:377–386. doi: 10.1093/nar/gks1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chaudhary R, Burleigh JG, Eulenstein O. Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. BMC Bioinform. 2011;13(Supp. 10):11. doi: 10.1186/1471-2105-13-S10-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chen K, Durand D, Farach-Colton M. Notung: dating gene duplications using gene family trees. J Comput Biol. 2000;7:429–447. doi: 10.1089/106652700750050871. [DOI] [PubMed] [Google Scholar]
- 23.Dondi R, El-Mabrouk N, Swenson KM. Gene tree correction for reconciliation and species tree inference: complexity and algorithms. J Discret Algorithms. 2014;25:51–65. doi: 10.1016/j.jda.2013.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Doroftei A, El-Mabrouk N. Removing noise from gene trees. In: Przytycka TM, Sagot M-F, editors. WABI 2011. Lecture notes in bioinformatics. vol. 6833. Berlin, Heidelberg: Springer; 2011. p. 76–91.
- 25.Gorecki P, Eulenstein O. Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinform. 2011;13(Supp 10):14. doi: 10.1186/1471-2105-13-S10-S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Gorecki P, Eulenstein O. A linear-time algorithm for error-corrected reconciliation of unrooted gene trees. In: Chen J, Wang J, Zelikovsky A, editors. ISBRA 2011. Lecture notes in bioinformatics. vol. 6674. Berlin, Heidelberg: Springer; 2011. p. 148–159.
- 27.Lafond M, Chauve C, Dondi R, El-Mabrouk N. Polytomy refinement for the correction of dubious duplications in gene trees. Bioinformatics. 2014;30(17):519–526. doi: 10.1093/bioinformatics/btu463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Swenson KM, Doroftei A, El-Mabrouk N. Gene tree correction for reconciliation and species tree inference. Algorithms Mol Biol. 2012;7:31. doi: 10.1186/1748-7188-7-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nguyen TH, Ranwez V, Pointet S, Chifolleau AM, Doyon JP, Berry V. Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms Mol Biol. 2013;8(8):12. doi: 10.1186/1748-7188-8-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Robinson D, Foulds L. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147. doi: 10.1016/0025-5564(81)90043-2. [DOI] [Google Scholar]
- 31.Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comput Sci. 2012;461:45–54. doi: 10.1016/j.tcs.2011.11.040. [DOI] [Google Scholar]
- 32.Natanzon A, Shamir R, Sharan R. Complexity classification of some edge modification problems. Discret Appl Math. 2001;113(1):109–128. doi: 10.1016/S0166-218X(00)00391-7. [DOI] [Google Scholar]
- 33.Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16(5):227–231. doi: 10.1016/S0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]
- 34.El-Mallah ES, Colbourn CJ. The complexity of some edge deletion problems. IEEE Trans Circuits Syst. 1988;35(3):354–362. doi: 10.1109/31.1748. [DOI] [Google Scholar]
- 35.Michael RG, David SJ. Computers and intractability: a guide to the theory of np-completeness. San Francisco: WH Freeman & Co.; 1979. [Google Scholar]
- 36.Vazirani VV. Approximation algorithms. New York: Springer; 2003. [Google Scholar]
- 37.Zuckerman D. Linear degree extractors and the inapproximability of max clique and chromatic number. Proc Thirty Eight Annu ACM Symp Theor Comput. 2007;3(1):103–128. [Google Scholar]
- 38.Bretscher A, Corneil DG, Habib M, Paul C. A simple linear time lexbfs cograph recognition algorithm. SIAM J Discret Math. 2008;22(4):1277–1296. doi: 10.1137/060664690. [DOI] [Google Scholar]
- 39.Altenhoff AM, Gil M, Gonnet GH, Dessimoz C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One. 2013;8(1):53786. doi: 10.1371/journal.pone.0053786. [DOI] [PMC free article] [PubMed] [Google Scholar]