Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2016 Apr 16;11:4. doi: 10.1186/s13015-016-0067-7

The link between orthology relations and gene trees: a correction perspective

Manuel Lafond 1,✉,#, Riccardo Dondi 2,#, Nadia El-Mabrouk 1,#
PMCID: PMC4833969  PMID: 27087831

Abstract

Background

While tree-oriented methods for inferring orthology and paralogy relations between genes are based on reconciling a gene tree with a species tree, many tree-free methods are also available (usually based on sequence similarity). Recently, the link between orthology relations and gene trees has been formally considered from the perspective of reconstructing phylogenies from orthology relations. In this paper, we consider this link from a correction point of view. Indeed, a gene tree induces a set of relations, but the converse is not always true: a set of relations is not necessarily in agreement with any gene tree. A natural question is thus how to minimally correct an infeasible set of relations. Another natural question, given a gene tree and a set of relations, is how to minimally correct a gene tree so that the resulting gene tree fits the set of relations.

Results

We consider four variants of relation and gene tree correction problems, and provide hardness results for all of them. More specifically, we show that it is NP-Hard to edit a minimum of set of relations to make them consistent with a given species tree. We also show that the problem of finding a maximum subset of genes that share consistent relations is hard to approximate. We then demonstrate that editing a gene tree to satisfy a given set of relations in a minimum way is NP-Hard, where “minimum” refers either to the number of modified relations depicted by the gene tree or the number of clades that are lost. We also discuss some of the algorithmic perspectives given these hardness results.

Keywords: Orthology, Paralogy, NP-Hardness, Gene tree, Species tree

Background

Genes, the molecular units of heredity, hold the information to build and maintain cells. In the course of evolution, they are duplicated, lost, and passed to organisms through speciation. Genes originating from the same ancestral copy are called homologs. Homologous gene are grouped into gene families, usually via sequence similarity methods. Moreover, homologous genes can be orthologous, if their parental origin is a speciation, or paralogous, if it is a duplication. Orthologous gene are considered to be more similar in function than paralogs, a conjecture known as the orthology conjecture [1]. This is a major motivation for inferring gene evolution, as it is a prerequisite for functional prediction purposes.

Starting usually from a DNA or protein sequence alignment, the tree-based method requires to build a phylogenetic tree, called gene tree, for the considered gene family. Reconciliation [2] with the species tree then allows to infer evolutionary events (duplications and speciations) associated with the internal nodes of the gene tree. Hence the internal nodes of a gene tree can be labeled as duplications and losses, and such a labeling induces a full orthology and paralogy set of relations between gene pairs. In order to detect orthology, tree-free methods are also available. These methods are based on gene clustering according to sequence similarity, (cf. e.g. the COG database [3], OrthoMCL [4], InParanoid [5], Proteinortho [6]), synteny [7, 8] or functional annotation of genes [9]. Such methods usually are not able to detect a full set of relations, but only a partial set, i.e. some relations among genes are not inferred.

Recent papers [10, 11] have investigated, from a graph theory point of view, the link between trees and orthology/paralogy relations (we just say “relations” in the following). Given a gene family Γ and a set C of pairwise relations, a first problem is whether we can reconstruct a labeled gene tree for Γ inducing C. The problem can be subdivided into two parts. First, we can consider whether C is satisfiable, i.e. whether there exists an event-labeled gene tree G in agreement with C. However satisfiability is not sufficient to ensure the possibility for the relation set to reflect a true history, as nodes of G labeled as speciations can be contradictory. This raises the second question which is the existence of an S-consistent gene tree, namely an event-labeled tree that can be obtained by reconciliation with a species tree S. A simple characterization of satisfiability is given in [10], when the set C is a full set of relations (i.e. each pair of genes of Γ is in C). On the other hand, checking for S-consistency can be done in polynomial-time for full sets [12, 13], and also partial sets of relations [14].

In this paper we explore the link between relations and trees in the perspective of relation and tree correction. Several gene tree databases from whole genomes are available, including for instance Ensembl Compara [15], Hogenom [16], Phog [17], MetaPHOrs [18], PhylomeDB [19], Panther [20]. However, due to various limitations such as alignment errors, systematic artifacts of inference methods or insufficient differentiation between sequences, trees are known to contain errors and uncertainties. Consequently, a great deal of effort has been put towards tools for gene tree editing [2129]. Most of them are based on selecting, in a neighborhood of an input tree, one best fitting the species tree.

Two years ago, we developed the first algorithm for gene tree correction using orthology relations [7]. Here we address, from a complexity and approximation point of view, the more general problem of correcting a gene tree according to a set of orthology and paralogy relations. We consider two objective functions: the number of unchanged relations (from orthology to paralogy or vice-versa), leading to the Maximum Homology Correction problem, and the number of unchanged clades (the Robinson-Foulds distance [30]), leading to the Maximum Clade Correction problem. We provide NP-completeness results for these two problems.

Conversely, we also address the problem of correcting a set of relations so that it represents a valid history in terms of S-consistency. A set of relations is usually represented as a graph R, where edges represent orthologous relations and non-edges represent paralogous relations. The satisfiability problem related to S-consistency reduces to adding or removing a minimum number of edges of R in order to make it P4-free (that is, it contains no induced path of length three), as shown in [10]. The problem is known to be NP-Hard and fixed parameter tractable [31]. In [11], an integer linear programming formulation is used to correct relation graphs of reasonable size. A factor approximation algorithm of factor 4Δ, where Δ is the degree of the graph R, is given in [32]. The S-consistency problem, however, has never been studied.

In this paper, two criteria are considered for correcting a set R of relations: minimize the number of modified relations, and maximize the number of genes inducing an S-consistent set of relations. The first problem is shown to be NP-complete, while the second problem is shown to be not approximable within factor dn12(1-ε), for any 0<ε<1 and any constant d>0.

Trees and orthology relations

All trees considered in this paper are assumed to be rooted. They are not necessarily binary, but we assume that all nodes are of degree at least three, except possibly the root that can be of degree two. Given a set X, a tree T for X is a tree whose leafset L(T) is in bijection with X. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node u of T, the subtree rooted at u is denoted Tu and we call the leafset L(Tu) the clade ofu. A node u is an ancestor of v if u is on the (inclusive) path between v and the root, and we then call v a descendant of u. If uv, then v is a strict descendant of u, and if u and v are connected by an edge of T, then v is a child of u. The lowest common ancestor (lca) of u and v, denoted lcaT(u,v), is the ancestor common to both nodes that is the most distant from the root. We say that u and v are separated if and only if lcaT(u,v){u,v} (i.e. none is an ancestor of the other). We define lcaT(U) analogously for a set U of nodes. Let L be a subset of L(T). The restriction T|L of T to L is the tree with leaf set L obtained from the subtree of T rooted as lcaT(L), by removing all leaves that are not in L, and contracting all internal nodes of degree two, except the root. Let T be a tree such that L(T)=LL(T). We say that T displays T if and only if T|L is T.

Evolution of a gene family

Species evolve through speciation, which is the separation of one species into distinct ones. A species tree S for a species set Σ represents an ordered set of speciation events that have led to Σ: an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species’ genomes, genes undergo speciation when the species to which they belong do, but also duplications, and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes Γ accompanied by a mapping functions:ΓΣ mapping each gene to its corresponding species. The evolutionary history of Γ can be represented as a node-labeled gene tree for Γ, where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication), and is labeled as a speciation (Spec) or duplication (Dup) accordingly.

Formally, we call a DS-tree for Γ a pair (G,evG), where G is a tree with L(G)=Γ, and evG:V(G)\L(G){Dup,Spec} is a function labeling each internal node of G as a duplication or a speciation node (we drop the G subscript from evG when it is clear from the context). Given a species tree S, the LCA-mapping function sG:V(G)V(S) maps each gene of G, ancestral or extant, to a species as follows: if gL(G), then sG(g)=s(g); otherwise, sG(g)=lcaS({s(g):gL(Gg)}). An example is given in Fig. 1, where the label of each node of G represents its LCA-mapping with respect to S.

Fig. 1.

Fig. 1

A species tree S, a binary DS-tree G and a non-binary DS-tree G. In DS-trees, Dup nodes are indicated by squares. All other nodes are speciations nodes. Each leaf αi denotes a gene belonging to the genome α. G is a refinement of G such that O(G)=O(G) and P(G)=P(G). Notice that, although in this example the gene trees contain exactly one gene copy from each genome, this is not a requirement. Another example with multiple gene copies in genome a is given in Fig. 2

According to the Fitch [33] terminology, we say that two genes xy of Γ are orthologous inG if ev(lcaG(x,y))=Spec, and paralogous inG if ev(lcaG(x,y))=Dup. We denote by O(G), respectively P(G), the set of all gene pairs that are orthologous, respectively paralogous in G. By xyO(G) we mean {x,y}O(G) (the same applies for P(G)). In Fig. 1, a1c1O(G) while a1b1P(G). We say that a1c1 (respec. a1b1) is an orthology (respec. paralogy) relation induced by G.

While a history for Γ can be represented as a DS-tree, the converse is not always true, as a DS-tree G for Γ does not necessarily represent a valid history. For this to hold, any speciation node of G should reflect a clustering of species in agreement with S [14]. Formally G should be S-consistent, as defined below.

Definition 1

Let S be a species tree and G be a DS-tree. Let v be an internal node of G such that ev(v)=Spec. Then the speciation node v is S-consistent if and only if for any two distinct children v1,v2 of v, sG(v1) and sG(v2) are separated in S.

We say that G is S-consistent if and only if every speciation node of G is S-consistent.

Notice that G and S are not required to be binary. In particular, the definition of S-consistency for a speciation node v of G does not require v to be binary, even if S is binary. The reason is that in such a case, one can “refine” v into a set of binary S-consistent speciation nodes based on the topology of S. This operation does not affect the orthology and paralogy relations of the genes of G (see Fig. 1). Duplication nodes can be refined as well. Lemma 1 formalizes this intuition. This will serve to show that our results hold for both non-binary and binary gene trees.

Lemma 1

Let G be an S-consistent DS-tree for some binary species tree S. Then there is a binary DS-treeG such that Gis S-consistent, and such thatO(G)=O(G) andP(G)=P(G).

Proof

Let v be a highest non-binary node (i.e. v has no non-binary ancestors) of G with children v1,,vk. We show that v can be made to be binary while preserving O(G) and P(G), which suffices to prove the Lemma since we can repeat this operation successively on every non-binary node.

If evG(v)=Dup, obtain a DS-tree G by removing v2,,vk from the children of v, adding a child v to v and adding v2,,vk as children of v, setting evG(v)=Dup. Notice that sG(w)=sG(w) for every wV(G)V(G)=V(G)\{v}, implying that all speciations remain S-consistent. It is readily seen that O(G)=O(G) and P(G)=P(G).

If instead evG(v)=Spec, let s1,s2V(S) be the two children of sG(v). Let Vj={vi:sj is an ancestor of sG(vi), 1ik} for j{1,2}. Notice that for any child vi of v, sG(vi) is a strict descendant of sG(v). For if not, v has a child vi such that sG(vi)=sG(v). But since v is a speciation, sG(vi)=sG(v) is separated from sG(vj) for any ji, implying that sG(v) is not the lca of sG(vi) and sG(vj), contradicting the definition of sG. This strict descendant condition implies that {V1,V2} partitions {v1,,vk}. Also observe that V1 and V_2 cannot be empty, for otherwise sG(v) would be equal to either s1 or s2. Obtain G by removing the children of v, adding two children w1 and w2 to v, then adding V1 as children of w1 and V2 as children of w2. Set evG(v)=evG(w1)=evG(w2)=Spec. Note that the children of w1 and w2 are still from separated species, and so both are S-consistent. As for v, by the definition of V1 and V2, sG(w1) is a descendant of s1 and sG(w2) is a descendant of s2 (not necessarily a strict descendant). Therefore, both are separated and so v is S-consistent. The species for every other node remaining unchanged, we conclude that G preserves S-consistency and does not modify O(G) nor P(G).

We can verify that both DS-trees in Fig. 1 are S-consistent. For example, the speciation node z in G has children from species vcd and w, which are pairwise separated in S. Notice that, from Definition 1, if G is a DS-tree, then the lca of two leaves of G belonging to the same species must be a duplication node. The converse is not true. For example, in the S-consistent gene tree G of Fig. 1, the parental node of e1 and f1 is a duplication node even though e1 and f1 belong to two different species.

Relation graph

A set of orthology/paralogy relations on Γ (or simply a relation set) is a pair C=(CO,CP) of subsets CO,CPΓ2 such that COCP= and if s(x)=s(y), then {x,y}CP. The relation set is said full if COCP=Γ2. A DS-tree G induces a full set (O(G),P(G)) of relations.

We adopt the graph representation considered in [14] for full relation sets. A relation graphR on a gene family Γ is a graph with vertex set V(R)=Γ, in which we interpret each edge uv of the edge set E(R) of R as an orthology relation between u and v, and each missing edge (non-edge) uvE(R) as a paralogy relation.1 Notice that if s(u)=s(v), then uvE(R). The relation graph of a DS-tree G, denoted by R(G), is the graph with vertex set L(G) and edge set O(G) (for example, see the relation graph R in Fig. 2).

Fig. 2.

Fig. 2

A species tree S and a DS-tree G which is S-consistent. The full orthology set induced by G is represented by the relation graph R. The graph R is an example of a not satisfiable graph, as {c1,b1,d1,a2} induces a P4, while R is an example of a satisfiable (it has no induced P4), but not S-consistent graph (explanation is given in the text)

A DS-tree for a gene family Γ leads to a relation graph, but the converse is not always true. A relation graph R is satisfiable if there exists a DS-tree G such that R(G)=R. The problem of relation graph satisfiability has been addressed in [10]. The following theorem is a reformulation of one of the main results of this paper.

Theorem 1

([10]) A relation graph R is satisfiable if and only if RisP4-free, meaning that no four vertices of Rinduce a path of length three.

For example, in Fig. 2, the relation graphs R and R are satisfiable, while the graph R is not. As a DS-tree does not necessarily represent a true history for Γ (see previous section and Definition 1), satisfiability of a relation graph does not ensure a possible translation in terms of a history for Γ. For this to hold, R should be consistent with the species tree, according to the following definition.

Definition 2

Given a species tree S, a relation graph R for Γ is S-consistent if and only if R is satisfiable by a DS-tree G which is itself S-consistent.

For example the graph R in Fig. 2 is S-consistent. Notice that S-consistency implies satisfiability. Results from [14] complete the characterization of S-consistent graphs through Theorem 2. A triplet is a binary tree with leafset L of size three. For L={x,y,z}, we denote by xy|z the unique triplet T on L for which lcaT(x,y)r(T) holds. Now P3(R) is the subset of triplets of species induced by paths having exactly three vertices in R=(V,E):

P3(R)={s(x)s(y)|s(z):zx,zyEandxyEands(x)s(y)}

We present in Theorem 2 a necessary and sufficient condition for S-consistency of a relation graph in terms of P3(R). First, we introduce in Lemma 2 an intermediate property, that is useful for proving Theorem 2.

Lemma 2

Let G be a DS-tree and S be a species tree. Then for any internal node v of G, there exist leaves x, y of Gv such that both the following hold: (1) lcaS(s(x),s(y))=sG(v) and (2)lcaG(x,y)=v.

Proof

We first show that (1) must hold for some x,yL(Gv). If sG(v) has two children s1 and s2 for which there exist leaves x and y of Gv such that s1 is an ancestor of s(x) and s2 an ancestor of s(y), then (1) holds. Thus if we suppose that (1) does not hold, then sG(v) has a child s such that all leaves of Gv belong to a species that has s as an ancestor. This implies that s is a lower common ancestor than sG(v) for the species present in Gv, contradicting the definition of sG.

Now, take x and y satisfying (1). Suppose that (2) does not hold for x and y, i.e. lcaS(s(x),s(y))=sG(v), but that lcaG(x,y)v. Take zL(Gv) such that z is separated from lcaG(x,y) by v (i.e. lcaG(z,lcaG(x,y))=v). We have lcaG(x,z)=lcaG(y,z)=v. If lcaS(s(x),s(z))=sG(v), then we are done as x and z satisfy both (1) and (2). Otherwise, lcaS(s(x),s(z)) is on the s(x)-sG(v) path, implying that lcaS(s(y),s(z))=sG(v) since sG(v) separates s(x) from s(y). In this last case, y and z are the leaves of interest, ending the proof.

Theorem 2

LetR=(V,E) be a satisfiable relation graph and let S be a species tree. Then R is S-consistent if and only if S displays all the triplets of P3(R).

Proof

: let G be an S-consistent gene tree satisfying R, and let x,y,zV(R) such that zx,zyE(R) but xyE(R) and s(x)s(y). Then we must have zx,zyO(G) and xyP(G). We claim that S must display the s(x)s(y)|s(z) triplet. Let α=lcaG(x,y),β=lcaG(x,z) and γ=lcaG(y,z). Since evG(α)evG(β)=evG(γ), xy|z must be a triplet of G. Moreover, since evG(γ)=evG(β)=Spec, lcaS(s(x),s(y)) and s(z) must be separated in S, implying that s(x)s(y)|s(z) is a triplet of S.

: by assumption, R is satisfiable by some DS-tree G. We first obtain from G a least-resolved DS-tree G satisfying R, in terms of speciation. That is, if G has any speciation node v that has a speciation child w, we obtain G by contracting v and w (delete w and give its children to v). Note that we have O(G)=O(G) and so G still satisfies R. We obtain the DS-tree G by repeating this operation until we cannot find such a v and w. We claim that if S displays the triplets of P3(R), then G is S-consistent.

Let v be a speciation node of G, and let v1,v2 be any two distinct children of v. By the construction of G, evG(v1)=evG(v2)=Dup. By Lemma 2, Gv1 has two leaves x1,x2 such that lcaS(s(x1),s(x2))=sG(v1) and lcaG(x1,x2)=v1. Similarly, Gv2 has two leaves y1,y2 with lcaS(s(y1),s(y2))=sG(v2) and lcaG(y1,y2)=v2. Since v is a speciation while v1,v2 are duplications, we have x1x2,y1y2E(R) while x1y1,x1y2,x2y1,x2y2E(R). Thus, x1y1x2 and x1y2x2 are induced paths of length two in R, which implies that S displays the s(x1)s(x2)|s(y1) and s(x1)s(x2)|s(y2) triplets. Analogously, S displays the s(y1)s(y2)|s(x1) and s(y1)s(y2)|s(x2) triplets. This is only possible if lcaS(s(x1),s(x2))=sG(v1) and lcaS(s(y1),s(y2))=sG(v2) are separated in S. We deduce that all child pairs of v are from separated species, and hence that G is S-consistent.

As an example, the graph R in Fig. 2 is satisfiable but not S-consistent as the path of length 2 containing {a1,b1,c1} induces the triplet ac|b, while the triplet displayed by S is ab|c.

We end this section with additional notations that will be of use later. A subgraphH of H is a graph with V(H)V(H) and E(H)E(H). For a graph H and some VV(H), the subgraph ofHinduced byV, denoted H[V], is the subgraph of H with vertex-set V having the maximum number of edges. We say that H is an induced subgraph of H if there is a subset VV(H) such that H=H[V]. If I is another graph, we say H is I-free if there is no VV(H) such that H[V] is isomorphic to I. Finally, for some edge set EE(H), H-E is the subgraph H with V(H)=V(H) and E(H)=E(H)\E.

Relation correction problems

We raise the issue of leaving out a minimum of information from a relation graph R in order to reach satisfiability and S-consistency. Two optimality criteria are considered: (1) the minimum number of edges that need to be removed; (2) the maximum number of genes that can be kept.

The minimum edge-removal consistency problem

Based on the same construction used in paper [34], we show that adding the information on the species tree S does not make the problem of removing the minimum number of edges leading to a P4-free graph simpler. Although a similar reduction is likely to hold in the general case of edge-modification (removal or insertion) [31], here we focus on edge removal, as this formulation is needed in subsequent developments. We show the NP-Completeness of this problem, even when every gene from the family Γ comes from a distinct species.

Minimum edge-removal consistency problem:

Input: A relation graph R for a gene family Γ, a species tree S and an integer k;

Output: “Yes” if and only if there exists an S-consistent subgraph R of R with V(R)=V(R) such that |E(R)\E(R)|k.

Theorem 3

The Minimum Edge-Removal Consistency Problem is NP-Complete, even if for any distinctg1,g2Γ, s(g1)s(g2).

Proof

Given R as a certificate, Theorem 2 easily translates into a polynomial-time algorithm to verify that R is S-consistent. It is also clear that verifying if |E(R)\E(R)|k can be done quickly. The problem is therefore in NP. As for NP-Hardness, the reduction is from the exact 3-cover problem, a classic NP-Hard problem [35]: given a set W={w1,,w3t} and a collection Z={Z1,,Zr} of 3-elements of W, does there exists ZZ such that |Z|=t and Z is a partition of W ? We assume that rt.

Given arbitrary W and Z, we construct R and S by first defining the species set Σ. Let α=3t2 and let X={X1,,Xr} and Y={Y1,,Yr} be two collections of all disjoint sets of species (i.e. for any distinct set A,BXY, AB=), with |Xi|=α and |Yi|=r2α, for all 1ir. Let XΣ=1irXi and YΣ=1irYi be the species in X and Y. Then the species set is Σ=WXΣYΣ. Let SW,SX and SY be three trees such that L(SW)=W,L(SX)=XΣ and L(SY)=YΣ. Then S is obtained by first connecting r(SY) with r(SW) to obtain a new tree SWY, then connecting r(SWY) with r(SX) (see Fig. 3). Therefore S has exactly |Σ|=3t+r(α+r2α) leaves. The gene family Γ is then constructed so that it contains exactly one gene per species, as mentioned in the Theorem statement. In other words the mapping s:ΓΣ is a bijection. Thus for simplicity, we make no distinction between a gene g and its species s(g). We then define R with V(R)=Σ such that each of the sets W,X1,,Xr,Y1,,Yr forms an individual clique. Finally we add two edge-sets E1 and E2 to R, where E1={g1g2:g1Xi,g2Zi,foragiven1ir} and E2={g1g2:g1Xi,g2Yi,foragiven1ir}. Then R has 2r+1 cliques, namely W,X1,,Xr,Y1,,Yr. Also, for 1ir, all edges between Xi and Yi are present, as well as all edges between Xi and Zi. Figure 3 gives an example with t=2 and W={1,2,3,4,5,6}.

Fig. 3.

Fig. 3

S represents the species tree and R the relation graph constructed from the sets W, Z, X and Y. The illustration is given for W={1,2,3,4,5,6} and Z={{1,2,3},{2,3,4},{3,5,6},{4,5,6}}. Z={{1,2,3},{4,5,6}} is a subset of Z which is a partition of W. R is the “corrected” relation graph corresponding to Z

Notice that the construction of R described above can clearly be done in polynomial time. We now show that W and Z admit an exact 3-cover if and only if R admits an S-consistent DS-tree after the deletion of at most 3α(r-t)+(α-3t) edges.

() : let ZZ be a partition of W, |Z|=t. Let R be the subgraph of R in which all edges between Zi and Xi are removed if and only if ZiZ (which removes 3α(r-t) edges), and the only edges not removed from the W-clique are those belonging to a Zi triangle with ZiZ (which removes α-3t edges). An example of R is given in Fig. 3. Thus there are exactly 3α(r-t)+(α-3t) edges of R missing from R, as desired. Clearly, R is P4-free and thus satisfiable. To see that R is S-consistent, we use Theorem 2. Notice that any path of length 3 in R has the form wxiyi with wW,xiXi and yiYi for some i, inducing the wyi|xi speciation triplet, which is in agreement with S. Therefore there exists an S-consistent gene tree G satisfying R.

() : let R be an S-consistent relation graph obtained by deleting at most 3α(r-t)+(α-3t) edges from R. Then, R must be P4-free. We show that R[W] is partitioned into triangles which form a solution to the 3-cover instance. Let wW. We claim that in R, there is exactly one XiX such that w has neighbors in Xi. Suppose first there are x1Xi and x2Xj, ij, such that both x1 and x2 are neighbors of w in R. Then there is some yYi such that yx1wx2 induce a P4, unless all edges between x1 and Yi were deleted. But we reach a contradiction since there are r2α>3α(r-t)+(α-3t) such edges. Therefore w has neighbors in at most one XiX. Using that fact, we can see that w must have at least one neighbor in X, since otherwise at most (3t-1)α edges between X and W would remain, implying the deletion of 3αr-(3t-1)α=3α(r-t)+α edges, more than permitted. This proves our claim.

Thus at best, each wW has α neighbors in X, implying that at least 3αr-3tα=3α(r-t) deleted edges are between X and W. This leaves a maximum of α-3t other edges that can be deleted.

Now, let C be a connected component of R[W]. We claim that all vertices of C must have their X neighbors in the same XiX. For suppose otherwise that there are two vertices c1,c2 of C such that c1 has a neighbor x1Xi and c2 a neighbor x2Xj with ij. It is easy to see that such c1 and c2 can be chosen to be neighbors. Then x1,c1,c2,x2 induce a P4, a contradiction. Thus all vertices of C have their X neighbors in a common XiX. Since each vertex of Xi has three neighbors in W, this implies that C has at most three vertices. Suppose that C that has two vertices or less. Then since all vertices of R[W] have at most two neighbors, it can have at most 12(2(3t-2)+2)=3t-1 edges (obtained by counting the sum of degrees). This, however, implies that at least α-(3t-1) additional edges were deleted, more than the α-3t available.

We conclude that R[W] is partitioned into t connected components, each having three vertices. Moreover, each vertex in a given component C has neighbors in the same XiX, implying that Zi contains the members of C. Finally since the components are all associated with a disctinct Zi, R[W] effectively defines a solution to the exact cover instance.

The Maximum Node Consistency problem

We introduce the Maximum Node Consistency Problem (in its decision version) and we consider the approximation complexity of the corresponding optimization version.

Maximum Node Consistency problem:

Input: A relation graph R for a gene family Γ, a species tree S and an integer k;

Output: “Yes” if and only if there exists an S-consistent induced subgraph R of R with |V(R)|k.

We show that Maximum Node Consistency is hard to approximate within a factor dn12(1-ε) for any 0<ϵ<1 and any constant d>0, by giving a gap-preserving reduction from Maximum Independet Set (n is the number of nodes of R). We refer the reader to [36] for a definition of gap-preserving reduction. Consider an instance H=(VH,EH) of Maximum Independet Set, with |VH|=m. We construct an instance of Maximum Node Consistency as follows.

First, we define the set of genes Γ, i.e. the nodes of the relation graph R. Denote VH={v1,,vm} and for each viVH, we define a set I(vi) of m genes: I(vi)={ri,j:1jm}. The gene set Γ is viVHI(vi).

Now, we define the species tree S. First consider S as any binary tree over m leaves 1,,m, and replace each leaf i by any binary subtree Ti having m leaves (thus S has m2 leaves). Each gene in I(vi) is mapped to a leaf of Ti in a bijective manner, and so each species has exactly one gene in R. We make no distinction between gΓ and s(g).

Now, define the relation graph R=(VR,ER). Set VR=Γ, and we get that n=|VR|=m2. For each viV, I(vi) forms a clique in R. Moreover, for each {vi,vj}EH, define an edge {ri,t,rj,t}ER, for each t with 1tm.

Let R be a solution of Maximum Node Consistency over instance (RS). Denote by R(vi) the subset of nodes V(R)I(vi), that is those nodes of I(vi) that have not been removed. We pay a particular attention to those R(vi) that contain more than one node.

Lemma 3

Let R(vi),R(vj) be two subsets of nodes of a solution Rof Maximum Node Consistency over instance (R, S) such that |R(vi)|2 and |R(vj)|2. Then there is no edge with one endpoint inR(vi) and the other in R(vj).

Proof

Assume on the contrary that there is some q such that ri,qR(vi) and rj,qR(vj) share an edge. Consider a node ri,z of R(vi)\{ri,q}, which must exist since |R(vi)|2. The P3 induced by ri,z,ri,q and rj,q implies the triplet (ri,z,rj,q|ri,q), while S contains the triplet (ri,z,ri,q|rj,q).

Now, we are ready to prove the main result of this section.

Lemma 4

Let a graph H be an instance Maximum Independet Set with m nodes, and let (R, S) be the corresponding instance of Maximum Node Consistencywith n=m2 nodes. Then

  1. Given an independent set V of H, we can compute in polynomial time a solution of Maximum Node Consistency of size at least |V|m;

  2. Given a solution of Maximum Node Consistency on instance (R, S) of size at least k m, we can compute in polynomial time an independent set Vof H such that|V|k.

Proof

  1. Consider an independent set V of H and define a solution of Maximum Node Consistency on instance (RS) of size at least |V|m as follows: remove each node of I(vi) if and only if viV. Let R be the corresponding solution of Maximum Node Consistency. Since V is an independent set, it follows that R consists only of cliques R(vi), disconnected one from the other. It has |V|m nodes and as R is P3-free, it is S-consistent.

  2. The case k=1 is trivial so we assume k>1. Consider a solution R of Maximum Node Consistency on instance (RS) of size at least km, and consider the subsets R(vi) in R such that |R(vi)|>1. Notice that we can assume that there exist at least k such sets, otherwise R would contain at most (k-1)m+m-(k-1)<km nodes.

Given an index j, consider the set Rj={ri,jR(vi):|R(vi)|>1,1im}, i.e. the nodes with index j that belong to some subset R(vi) larger than one. By Lemma 3 each set Rj is an independent set. Now, pick the set Rj having maximum cardinality. It follows that Rj contains at least k nodes, since otherwise R would have at most m(k-1)+m-k<mk nodes. Hence, V={vi:ri,jRj} is an independent set of size at least k, thus concluding the proof.

We say a maximization problem cannot be approximated within a factor α if, unless P=NP, for any approximation algorithm A there are infinitely many instances for which A outputs a solution with value AP such that AP<1αOPT, where OPT is the optimal value of a solution to the problem (note that equivalently, OPTAP>α). It is well-known that Maximum Independet Set cannot be approximated within a factor cm1-ε for any 0<ε<1 and for any constant c>0 [37].

Theorem 4

The optimization version of Maximum Node Consistency cannot be approximated within a factordn12(1-ε) for any 0<ε<1 and for any constant d>0,where n is the number of nodes of the given relation graph. Moreover, this result holds even on instances in which for any distinct g1,g2Γ, s(g1)s(g2).

Proof

Let H be a graph with m nodes and let (RS) be the corresponding instance of Maximum Node Consistency with n=m2 nodes. Denote by OPTI and OPTN, respectively, the values of an optimal solution for Maximum Independet Set and Maximum Node Consistency. Let AN be any approximation algorithm for Maximum Node Consistency, and let AI be the approximation algorithm for Maximum Independet Set that on input H, runs AN on the corresponding instance (RS) and returns the independent set resulting from Lemma 4. Let API(H) and APN(R,S) denote, respectively, the sizes of the solutions found by AI(H) and AN(R,S). By Lemma 4 we get that API(H)APN(R,S)/mAPN(R,S)/m-1 and OPTN(R,S)OPTI(H)m. Now,

OPTN(R,S)APN(R,S)OPTI(H)mAPI(H)m+m=OPTI(H)API(H)+1OPTI(H)2API(H)

as we may assume that API(H)1. Since Maximum Independet Set cannot be approximated within a factor cm1-ε, for any 0<ε<1 and any constant c>0, then for any 0<ε<1 and any constant c>0 there exist infinitely many instances H on which OPTI(H)2API(H)>cm1-ε. Thus, it follows that

OPTN(R,S)APN(R,S)OPTI(H)2API(H)>c2m1-ε=dn12(1-ε)

on infinitely many instances. Finally the fact that the result holds even on instances in which for any distinct g1,g2Γ, s(g1)s(g2) follows from the construction of R.

We get the following as an immediate corollary, which will be of use later:

Corollary 1

The decision version of Maximum Node Consistency is NP-Hard, even on instances in which for any distinct g1,g2Γ,s(g1)s(g2).

Gene tree correction problems

In this section, we are given a gene family Γ, a species tree S, an S-consistent DS-tree G for Γ, and a set C=(O,P) of orthology/paralogy constraints (not necessarily full). We focus on the problem of correcting G according to C in a minimal way. The goal is thus to find a DS-tree G inducing C such that the difference between G and G is minimum. We consider two ways of measuring the difference (or symetrically the similarity) between gene trees, one based on conserved orthology/paralogy relations induced by the two trees, and one based on the number of conserved clades between the two trees, which is the Robinson-Foulds in the case that G, G and S are all binary trees.

The Maximum Homology Correction problem

Maximum Homology Correction problem:

Input: A species tree S, an S-consistent DS-tree G for a gene family Γ, an integer k, a set O of orthology and a set P of paralogy relations;

Output: “Yes” if there exists an S-consistent DS-tree G for Γ with OO(G), PP(G) such that |O(G)O(G)|+|P(G)P(G)|k.

Theorem 5

The Maximum Homology Correctionproblem is NP-Complete, even if S, G and G are required to be binary.

Proof

The problem is clearly in NP, as verifying S-consistency can be done in polynomial time, as well as counting the common orthologs/paralogs relations (the set of relations is quadratic in size). For our reduction, we use the Minimum Edge-Removal Consistency problem for the case of a gene family with at most one gene per genome, which is NP-Hard by Theorem 3. Given a species tree S, a relation graph R with V(R) in bijection with L(S) and an integer k, we construct an instance of the Maximum Homology Correction Problem, i.e. a species tree S, a DS-tree G, an orthologous set O and paralogous set P.

Let S=S and construct G by mimicking S - that is by first copying S and its leaf labels, then replacing each leaf of G by the gene s-1(). Note that if S is binary, then so is G. All internal nodes of G are labeled as speciations, so all genes of Γ are pairwise orthologous. Thus R(G) is a clique. Finally, let O= and P={g1g2:g1g2E(R)}. Therefore the objective is to break a minimum of orthologies of G in order to satisfy P.

We show that that there is an S-consistent subgraph R of R obtained by removing at most k edges if and only if there is an S-consistent DS-tree G satisfying O and P with at most |P|+k relations that are not induced by G.

: Let R be a solution to the Minimum Edge-Removal Consistency Problem for R and S. Then there exists a S-consistent DS-tree G satisfying R, which is obtained by deleting at most k edges from R. By Lemma 1, we may assume that if S is binary, then so is G. Now, since R has at most |P|+k non-edges, G has at most k+|P| paralogs and is therefore a solution to the constructed instance of the Maximum Homology Correction Problem that breaks at most k+|P| orthologies of R(G).

: Let G be a solution, binary or not, to the constructed Maximum Homology Correction Problem instance and let R=R(G). Since G satisfies P and breaks at most |P|+k orthologies, R must have P as non-edges, plus at most k other non-edges. Thus R can be obtained by removing at most k edges from R(G)-P=R, as desired.

The maximum clade correction problem

Maximum clade correction problem:

Input: A gene tree G, a species tree S, a set O of orthology and a set P of paralogy relations and an integer k;

Output: “Yes” if there exists an S-consistent DS-tree G satisfying O and P such that G and G have at least k clades in common.

Notice that if S, G and G are required to be binary, the effective measure between G and G is the Robinson-Foulds distance. This special case is handled as part of the general proof. But before we need the following lemma, which uses grafting operations to add leaves to G and satisfy a prescribed relation without breaking other relations.

Given two trees T1 and T2, connectingT1 with T2 corresponds to creating a new node x and giving it r(T1) and r(T2) as its two children. Grafting a new leaf x to a tree T corresponds to adding x to L(T) by either: (1) adding x as a new child of some node u of T; (2) connecting T with x; (3) subdividing an edge uv and adding x as a child of the newly created vertex.

Lemma 5

Let G be an S-consistent gene tree, for some species tree S. Let x be a gene not in G and y be some gene in G with s(x)s(y). Then there exists a gene tree Gobtained by grafting x to G such that the following conditions are satisfied:

  1. x and y are orthologs in G;

  2. O(G)O(G) andP(G)P(G);

  3. Gis S-consistent;

Proof

If sG(r(G)) is a strict descendant of lcaS(s(x),s(y)), then it is easy to see that connecting x to r(G) under a common parent yields the desired result. So we assume sG(r(G)) is an ancestor of lcaS(s(x),s(y)). If there is some node u of G such that adding x as a child of u satisfies the three conditions of the Lemma, then we are done. So assume that there is no node u to which we can add x as a child.

Let uv be an edge of G, and suppose that we graft  x on uv to obtain G. Call p the parent of x on G, and say that u is the parent of p (i.e. p has children x and v). Note that if sG(u)=sG(u), then sG(w)=sG(w) for any wV(G)V(G), implying that setting evG(z)=evG(z) for all zV(G)V(G)\{u} preserves S-consistency. We will find such a uv that guarantees this sG(u)=sG(u) property, while ensuring that lcaG(x,y) can be a speciation (i.e. evG(lcaG(x,y))=Spec is S-consistent), and that evG(u)=evG(u) is S-consistent. This will prove the Lemma.

Let sxy=lcaS(s(x),s(y)), and let g be the lowest ancestor of y in G such that sG(g) is sxy or an ancestor of sxy. Note that the case in which g does not exist was handled in the beginning of the proof. Now suppose that evG(g)=Dup. Denote by g the child of g that is also an ancestor of y. Note that sG(g) is an ancestor of sxy and sG(g) is a strict descendant of sxy. We claim that uv=gg. To see this, obtain G by grafting x to gg, p being the parent of x and g the parent of p. Then, sG(p)=sxy, and its children species sG(g) and sG(x) are separated in S by our choice of g. Thus setting evG(p)=evG(lcaG(x,y))=Spec preserves S-consistency. Also, sG(g)=sG(g) since sG(g) is already an ancestor of sxy=sG(p). Finally, we are free to set evG(g)=Dup, satisfying all the required conditions.

So instead suppose that evG(g)=Spec. Recall that adding x as a child of g to obtain a new tree G is not a solution. Since in G, lcaG(x,y)=g, we must either have sG(g)sG(g), or evG(g)=Spec is not S-consistent. By the choice of g, only the latter is possible, implying that all children of g are from separated species in G, but not in G. Therefore, there must be a child g of g such that sG(g) is an ancestor of s(x). Note that g must be unique since otherwise, evG(g)=Spec would not be possible. We then claim that uv=gg. Indeed, obtain G by grafting x to gg, p being the parent of x. We have sG(p)=sG(g), and we set evG(p)=Dup. The species of the children of g remain unchanged in G, and so sG(g)=sG(g) and evG(g)=evG(lcaG(x,y))=Spec is S-consistent, again satisfying all required conditions.

Theorem 6

The Maximum Clade Correction Problem is NP-Complete, even if S, G and G are required to be binary.

Proof

Verifying S-consistency and comparing the set of clades from G and G can clearly be done in polynomial time, thus the problem is in NP. We use the Maximum Node Consistency problem for our reduction, which is NP-Hard by Corollary 1. Let RS and k be the Maximum Node Consistency instance, letting R be the relation graph with V(R)={v1,,vn}, S the species tree and k an integer. Let α=n(n-1-k)+2k (noting that α>0 when kn). The constructed instance of the Maximum Clade Correction Problem uses the same species tree S. Construct G as follows: first consider G as any binary tree with n leaves l1,,ln, where each leaf li is mapped to vertex vi of R. Then, replace each leaf li by a subtree Ti constructed as follows: Ti is a caterpillar tree with n-1+α leaves, and each leaf of Ti is such that s()=s(vi) (recall that a caterpillar tree is a path to which we add a leaf child to each internal node). Let Li denote the set of the n-1 deepest leaves of Ti (the depth of a leaf being the number of nodes on the path between and the root). Each leaf of Li is mapped to a distinct node of V(R)\{vi}. Denote by i,j the leaf of Ti mapped to vj, and by Ni the subtree of Ti rooted at lca(Li). Then G has exactly n(n-1+α) leaves and n(n-1+α)-1 clades (since it is binary). An example is given in Fig. 4. Finally define O={{i,j,j,i}:vivjE(R)} the set of orthology relations to satisfy and P={{i,j,j,i}:vivjE(R)} the set of paralogy relations to satisfy. Note that each i,j is present in exactly one relation.

Fig. 4.

Fig. 4

R and S are the input relation graph and species tree, respectively, for an instance of the Maximum Node ConsistencyProblem (genes of R are labeled by their species). The gene tree G is constructed from R as described in the proof. Leaves of G related through orthology in O are joined by a black solid edge, while paralogs are joined by a dotted line. The graph R is a solution to the consistent induced subgraph with k=3, and H is the DS-tree corresponding to R. The tree G is a solution to the Maximum Clade Correction Problem constructed from H

We show that R admits an S-consistent induced subgraph with at least k nodes if and only if G, O and P admit an S-consistent DS-tree G satisfying O and P such that G and G share at least k(α+n-2) clades.

() Let R be a solution to the Maximum Node Consistencyinstance, |V(R)|k, and let H be a DS-tree satisfying R that is S-consistent. By Lemma 1, we may assume that if S is binary, then so is H. Now, since L(H)V(R), to each leaf vi of H corresponds a subtree Ti in G. Then build a DS-tree G from H by replacing each leaf vi of H by Ti, and labeling all internal nodes of inserted trees as Dup (in Fig. 4, G is the subtree of G rooted at the common ancestor of Ta,Tb and Tc). We first argue that G is S-consistent and satisfies the subsets of O and P restricted to L(G). In a subsequent step, we will graft the genes missing from G using Lemma 5. Notice that sH(vi)=sG(r(Ti)) and that all nodes of H that are in G have the same LCA-mapping in both trees. It follows that G is also S-consistent. Also, for all vi,vjL(H), evH(lcaH(vi,vj))=evG(lcaG(r(Ti),r(Tj)). Thus for any pair of leaves i,j,j,i in L(G) such that {i,j,j,i}O, lcaG(i,j,j,i) is a speciation (by the construction of O from R and the fact that H satisfies R). By the same reasoning, if {i,j,j,i}P then i,j and j,i are paralogous in G.

The solution G is obtained by grafting to G every leaf of G missing from G whilst preserving the Ti clades, maintaining satisfiability of O and P and S-consistency. If such a G exists, then G and G share at least k identical subtrees from {T1,,Tn}, and since each Ti contains α+n-2 clades, it follows that G and G share at least k(α+n-2) clades as required. Let L=L(G)\L(G) be the leaves yet missing from G. Let LO={L:L(G),O} (i.e. the leaves of L subject to an orthology constraint with some leaf already in G). The complement LO¯ is the set of leaves of L that are either subject to a paralogy constraint with some leaf of G, or not subject to any constraint with any leaf of G. Let R(LO¯) be the relation graph with vertex set LO¯ and edge set {12:1,2LO¯,12O}, depicting the required orthologies within LO¯. Recall that each leaf of L is contained in at most one relation, implying that each node of R(LO¯) has maximum degree 1. Thus R(LO¯) is P3-free and therefore is S-consistent. Let GLO be a DS-tree satisfying R(LO¯) that is S-consistent, assuming that GLO is binary if S is. We update G by joining r(G) and r(GLO) under a common parent x, and labeling x as Dup. Notice that this does not modify any orthology or paralogy relation previously in G or in GLO, nor does it break S-consistency. This also ensures that paralogies 12P with 1LO¯ and 2L(G) are satisfied.

The final step is to graft the leaves of LO to G in a way satisfying orthology requirements. This is done by successively applying Lemma 5 to each LO. As shown, each such can be grafted into G without modifying any orthology or paralogy relation already in G whilst satisfying the orthology requirement that is subject to. It is straightforward to see that in addition, can be grafted without breaking any Ti clade present in G, since every vertex in Ti is mapped to the same species. The tree G obtained after all these grafting operations, satisfies every O and P and has the required common clades with G.

() Let G be a solution, binary or not, to the Maximum Clade Correction Problem instance. Denote by C the number of clades shared by both G and G, with Ck(α+n-2). Recall that Li is the set of the n-1 deepest leaves of Ti in G, with Ni being the subtree rooted at lcaG(Li). Denote by GLi the subtree of G rooted at lcaG(Li). We say that Ni was preserved if every leaf of GLi belongs to L(Ti) (in other words, the Ni clade might have been extended, but only to include other leaves from Ti). We claim that at least k of the N={N1,,Nn} subtrees are preserved in G. Assume, on the contrary, that at least n-k+1 subtrees from N are not preserved. Take a non-preserved subtree Ni. Then some leaf L(Ti) belongs to the lcaG(Li) clade. This implies that for any ancestor x of r(Ni) in G, G cannot contain the x clade. By construction of Ti, r(Ni) has at least α ancestors in G. Therefore, Cn(n-1+α)-1-α(n-k+1). This leads to k(α+n-2)Cn(n-1+α)-1-α(n-k+1), and then to αn(n-1-k)+2k-1, contradicting our choice of α.

Now, let Np={NiN:Ni is preserved in G}. We have |Np|k. Let L={i:NiNp}Li and H=G|L. Notice that to each NiNp corresponds exactly one subtree Ni in H such that L(Ni)=L(Ni) (and all such Ni subtrees are disjoint). Let H be the tree obtained by replacing every subtree Ni in H by vi. Replacing Ni by vi changes no LCA-mapping value since all vertices of Ni map to s(vi). Thus as G is S-consistent, then so are H and H. Now, we claim that H induces the set of relations represented by R=R[L(H)], which proves the theorem since |L(H)|=|Np|k. By contradiction, suppose that vivjE(R) but lcaH(vi,vj) is labeled Dup. Then lcaH(i,j,j,i) is also labeled Dup, and so is lcaG(i,j,j,i). But i,jj,iO, contradicting our assumption that G is a solution. The same reasoning applies when vivjE(R), ending the proof.

Algorithmic avenues

As the problems considered in this paper are all computationally hard, only non-polynomial exact algorithms or approximation algorithms avenues can realistically be explored. Let us generalize the Minimum Edge-Removal Consistency problem to the minimum editing problem (i.e. minimzing edge removals and insertions). It is not hard to imagine a branch-and-bound algorithm that solves the problem. Call an induced subgraph H of a relation graph Rbad if it is a P4, or there is triplet of P3(H) not displayed by S. Each P4 can be solved by six possible edge editings, and each contradictory triplet of P3(H) can be solved by three possible editings. Therefore, in a branch-and-bound process, one would verify if a given graph R contains a bad subgraph and if so, proceed recursively on each graph obtained by an editing that removes it. If no bad subgraph exists, then R is a possible solution and its number of editings is retained. If, at any point, R has had more editings than the best solution encountered so far, the algorithm can stop the recursion. Notice however that an edge should not be edited more than once in order to avoid infinite loops. The idea of this branch-and-bound algorithm can also be applied to the Maximum Node Consistency problem. It is known that a P4, if one exists, can be found in linear time [38], and clearly a contradictory triplet, if any, can be found in time O(n3) (though a more efficient algorithm may exist). A similar approach has been applied in [31] to design an FPT algorithm for the satisfiability problem.

As for approximations, an algorithm proposed in [32] can be directly applied to the Minimum Edge-Removal Consistency problem and guarantees that we do not remove more than 4Δ(R) times more edges than the optimal solution, where Δ(R) is the maximum degree of R. The idea is simple: as long as R has a bad subgraph H, remove every edge incident to a vertex of H and continue. Even though this is the best known approximation algorithm so far, it has the undesirable effect of isolating many vertices, motivating the exploration of alternative algorithms. One direction would be to consider existing ideas on the problem of satisfiability, i.e. finding the minimum number of editings required to make a graph P4-free, and adapt them to the consistency problem - for instance the Min-Cut algorithm proposed in [39].

As for gene tree correction, we have developed in [14] a polynomial-time algorithm which, given a species tree S and partial sets of relations O and P, verifies if there exists an S-consistent gene tree G satisfying O and P and if so, constructs one among the set of all possible solutions. In ordre to correct a gene tree G, we can envisage an extension of this algorithm allowing to provide G as input, and pick, among the solutions of the algorithm the one which is the closest to G (either in terms of common homology relations or clades).

Conclusions

A gene tree induces a set of orthology and paralogy relations between members of a gene family, but the converse is not always true. In this paper we have shown that attempting to modify a set of relations as least as possible in order to ensure consistency with a species tree leads to the formulation of NP-Complete problems. Moreover, even assuming that the given relations are error-free, it remains computationally difficult to correct a gene tree in order to fit the given set of relations. As various model-free methods are available to infer orthology and paralogy, these correction problems are of practical biological interest. A future direction would be to explore the exact branch-and-bound algorithms and heuristics mentioned in the last section, and design fast approximation algorithms for the relation graph and gene tree editing problems.

Authors’ contribution

ML, RD and NE modeled the four problems presented, and devised and wrote the hardness proofs. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Funding

Publication of this work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds de Recherche Nature et Technologies of Quebec (FRQNT).

Footnotes

1

The term ‘relation graph’ is also used in phylogenetics in the form of a generalization of a median network to a set of partitions. To make it clear, relation graphs in this paper have nothing to do with this notion.

Manuel Lafond, Riccardo Dondi and Nadia El-Mabrouk contributed equally to this work

Contributor Information

Manuel Lafond, Email: lafonman@iro.umontreal.ca.

Riccardo Dondi, Email: riccardo.dondi@unibg.it.

Nadia El-Mabrouk, Email: mabrouk@iro.umontreal.ca.

References

  • 1.Ohno S. Evolution by gene duplication. Berlin: Springer; 1970. [Google Scholar]
  • 2.Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979;28:132–163. doi: 10.2307/2412519. [DOI] [Google Scholar]
  • 3.Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucl Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li L, Stoeckert CJJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucl Acids Res. 2008;36:D263–D266. doi: 10.1093/nar/gkm1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lechner M, Findeib SS, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform. 2011;12:124. doi: 10.1186/1471-2105-12-124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lafond M, Semeria M, Swenson KM, Tannier E, El-Mabrouk N. Gene tree correction guided by orthology. BMC Bioinform. 2013;14(supp 15):S5. doi: 10.1186/1471-2105-14-S15-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lafond M, Swenson K, El-Mabrouk N. Error detection and correction of gene trees. Models and algorithms for genome evolution. London: Springer; 2013. [Google Scholar]
  • 9.Consortium TGO Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hellmuth M, Hernandez-Rosales M, Huber K, Moulton V, Stadler P, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66(1–2):399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]
  • 11.Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. PNAS. 2014;112(7):2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421. doi: 10.1137/0210030. [DOI] [Google Scholar]
  • 13.Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler P. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13(Suppl. 19):56. [Google Scholar]
  • 14.Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014;15(Suppl 6):12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara gene trees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perrière G. Databases of homologous gene families for comparative genomics. BMC Bioinform. 2009;10(Suppl 6):S3. doi: 10.1186/1471-2105-10-S6-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:84–89. doi: 10.1093/nar/gkp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pryszcz LP, Huerta-Cepas J, Gabaldón T. MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res. 2011;39:32. doi: 10.1093/nar/gkq953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Denisov I, Kormes D, Marcet-Houben M, Gabald’on T. Phylomedb v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Res. 2011;39:556–560. doi: 10.1093/nar/gkq1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mi H, Muruganujan A, Thomas PD. Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2012;41:377–386. doi: 10.1093/nar/gks1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chaudhary R, Burleigh JG, Eulenstein O. Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. BMC Bioinform. 2011;13(Supp. 10):11. doi: 10.1186/1471-2105-13-S10-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chen K, Durand D, Farach-Colton M. Notung: dating gene duplications using gene family trees. J Comput Biol. 2000;7:429–447. doi: 10.1089/106652700750050871. [DOI] [PubMed] [Google Scholar]
  • 23.Dondi R, El-Mabrouk N, Swenson KM. Gene tree correction for reconciliation and species tree inference: complexity and algorithms. J Discret Algorithms. 2014;25:51–65. doi: 10.1016/j.jda.2013.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Doroftei A, El-Mabrouk N. Removing noise from gene trees. In: Przytycka TM, Sagot M-F, editors. WABI 2011. Lecture notes in bioinformatics. vol. 6833. Berlin, Heidelberg: Springer; 2011. p. 76–91.
  • 25.Gorecki P, Eulenstein O. Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinform. 2011;13(Supp 10):14. doi: 10.1186/1471-2105-13-S10-S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gorecki P, Eulenstein O. A linear-time algorithm for error-corrected reconciliation of unrooted gene trees. In: Chen J, Wang J, Zelikovsky A, editors. ISBRA 2011. Lecture notes in bioinformatics. vol. 6674. Berlin, Heidelberg: Springer; 2011. p. 148–159.
  • 27.Lafond M, Chauve C, Dondi R, El-Mabrouk N. Polytomy refinement for the correction of dubious duplications in gene trees. Bioinformatics. 2014;30(17):519–526. doi: 10.1093/bioinformatics/btu463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Swenson KM, Doroftei A, El-Mabrouk N. Gene tree correction for reconciliation and species tree inference. Algorithms Mol Biol. 2012;7:31. doi: 10.1186/1748-7188-7-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Nguyen TH, Ranwez V, Pointet S, Chifolleau AM, Doyon JP, Berry V. Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms Mol Biol. 2013;8(8):12. doi: 10.1186/1748-7188-8-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Robinson D, Foulds L. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147. doi: 10.1016/0025-5564(81)90043-2. [DOI] [Google Scholar]
  • 31.Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comput Sci. 2012;461:45–54. doi: 10.1016/j.tcs.2011.11.040. [DOI] [Google Scholar]
  • 32.Natanzon A, Shamir R, Sharan R. Complexity classification of some edge modification problems. Discret Appl Math. 2001;113(1):109–128. doi: 10.1016/S0166-218X(00)00391-7. [DOI] [Google Scholar]
  • 33.Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16(5):227–231. doi: 10.1016/S0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]
  • 34.El-Mallah ES, Colbourn CJ. The complexity of some edge deletion problems. IEEE Trans Circuits Syst. 1988;35(3):354–362. doi: 10.1109/31.1748. [DOI] [Google Scholar]
  • 35.Michael RG, David SJ. Computers and intractability: a guide to the theory of np-completeness. San Francisco: WH Freeman & Co.; 1979. [Google Scholar]
  • 36.Vazirani VV. Approximation algorithms. New York: Springer; 2003. [Google Scholar]
  • 37.Zuckerman D. Linear degree extractors and the inapproximability of max clique and chromatic number. Proc Thirty Eight Annu ACM Symp Theor Comput. 2007;3(1):103–128. [Google Scholar]
  • 38.Bretscher A, Corneil DG, Habib M, Paul C. A simple linear time lexbfs cograph recognition algorithm. SIAM J Discret Math. 2008;22(4):1277–1296. doi: 10.1137/060664690. [DOI] [Google Scholar]
  • 39.Altenhoff AM, Gil M, Gonnet GH, Dessimoz C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One. 2013;8(1):53786. doi: 10.1371/journal.pone.0053786. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES