Abstract
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets—and thus corrected—in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.
Electronic supplementary material
The online version of this article (10.1007/s00285-020-01469-y) contains supplementary material, which is available to authorized users.
Keywords: Phylogenetic combinatorics, Colored digraph, Orthology, Horizontal gene transfer
Introduction
The distinction between orthologous and paralogous genes has important consequences for gene annotation, comparative genomics, as well as molecular phylogenetics due to their close correlation with gene function (Koonin 2005). Orthologous genes, which derive from a speciation as their last common ancestor (Fitch 1970), usually have at least approximately equivalent functions (Gabaldón and Koonin 2013). Paralogs, in contrast, tend to have related, but clearly distinct functions (Studer and Robinson-Rechavi 2009; Innan and Kondrashov 2010; Altenhoff et al. 2012; Zallot et al. 2016). Phylogenetic studies strive to restrict their input data to one-to-one orthologs since these often evolve in an approximately clock-like fashion. In comparative genomics, orthologs serve as anchors for chromosome alignments and thus are an important basis for synteny-based methods (Sonnhammer et al. 2014).
Despite its practical importance, the mathematical interrelationships of empirical “pairwise best hits” on one hand, and reconciliations of gene and species trees on the other hand have remained largely unexplored. Practical workflows for orthology assignment directly use pairwise best hits as initial estimate of orthologous gene pairs. Many of the commonly used methods for orthology-identification, such as OrthoMCL (Li et al. 2003), ProteinOrtho (Lechner et al. 2014), OMA (Roth et al. 2008), or eggNOG (Jensen et al. 2008), belong to this class. Extensive benchmarking (Altenhoff et al. 2016; Nichio et al. 2017) has shown that these tools perform at least as well as methods such as Orthostrapper (Storm and Sonnhammer 2002), PHOG (Datta et al. 2009), EnsemblCompara (Vilella et al. 2009), or HOGENOM (Dufayard et al. 2005) that first independently reconstruct a gene tree T and a species tree S and then determine orthologous and paralogous genes.
The intuition behind the pairwise best hit approach is that a gene y in species s can only be an ortholog of a gene x in species r if y is the closest relative of x in s and x is at the same time the closest relative of y in r. Evolutionary relatedness is defined in terms of an – often unknown – phylogenetic tree T. The notion of a best match or closest relative thus is made precise by considering the last common ancestors in T: y is a best match for x if the least common ancestor is not further away from x (and thus not closer to the root of the tree) than for any other gene y in species s. This formally defines the best match relation studied in (Geiß et al. 2019a). The reciprocal best match relation identifies the pairs of genes that are mutually closest relatives between pairs of species, see (Geiß et al. 2019b).
Two approximations are introduced when pairwise best hit approaches are employed for orthology assessment. First, it is well known that two genes can be mutual closest relatives without being orthologs. The usual example is the complementary loss of ancestrally present paralogs following a gene duplication (Fig. 1a). Second, pairwise best hits as determined by sequence (dis)similiarity are not necessarily pairs of most closely related genes and vice versa, evolutionarily most closely related gene pairs do not necessarily appear as pairwise best hits (Fig. 1b).
We argue, therefore, that the relationship of pairwise best hits and orthology has to be understood in (at least) two conceptually and practically separate steps:
What is the relationship of pairwise best hits and reciprocal best matches?
What is the relation of reciprocal best matches and orthology?
In this contribution we focus on the second question, which is largely a mathematical problem. The main aim of the present contribution is to connect formal results on the structure of the orthology relation and the associated reconciliation maps and gene trees with recent results on the mathematical structure of (reciprocal) best match relations.
The first question, which is primarily a question of inference from data, is investigated in a companion paper (Stadler et al. 2020) that makes use of several of the mathematical results derived here. In a nutshell, the best hits inferred from estimates of genetic distances may differ from best matches whenever paralogs evolve with different rates in different species. In most situations this can be detected – and in most cases corrected – by considering quartets of genes from three different species, provided it is known that c is an outgroup to a, , and . Using the approximate additivity of empirical genetic distances, it can then be checked which one of the paralogs and is more closely related to a. The main practical difficulty is to ensure that c is correctly identified as outgroup.
Symbolic ultrametrics (Böcker and Dress 1998) and 2-structures (Ehrenfeucht and Rozenberg 1990a, b) provided a basis to show that orthology relations are essentially equivalent to cographs (Hellmuth et al. 2013, 2017; Hellmuth and Wieseke 2016). Moreover, in the absence of horizontal gene transfer (HGT), reconciliation maps for an event-labeled gene tree exist if and only if the species tree S displays all triples rooted in a speciation event that have leaves from three distinct species (Hernandez-Rosales et al. 2012; Hellmuth 2017). This shows that it is possible to infer species phylogenies from empirical estimates of orthology (Hellmuth et al. 2015; Lafond et al. 2016; Lafond and El-Mabrouk 2014; Dondi et al. 2017). Although it is possible to generalize many of the results, such as the characterization of reconciliation maps for event-labeled gene trees to scenarios with horizontal gene transfer (Nøjgaard et al. 2018; Hellmuth et al. 2019; Hellmuth 2017) this remains an active area of research.
Best matches as a mathematical structure have been studied only very recently. Geiß et al. (2019a) gave two alternative characterizations of best match digraphs and showed that they can be recognized in polynomial time. In particular, there is a unique least resolved tree for each best match digraph, which is displayed by the gene tree and can also be computed in polynomial time. Reciprocal best matches naturally appear as the symmetric part of these digraphs. Somewhat surprisingly, the undirected reciprocal best match graphs seem to have a much more difficult structure (Geiß et al. 2019b).
Although pairwise best hit methods do not attempt to explicitly construct the gene tree T, they still make the assumption that there is some underlying phylogeny for the provided homologous genes. The distinction of orthology and paralogy then amounts to assigning event labels (“speciation”, “duplication”, and possibly “HGT”) to the inner vertices of T. While it is true that any gene tree, and thus also any best match graph, can be reconciled with any species tree (Guigó et al. 1996; Page and Charleston 1997; Górecki and Tiuryn 2006), such a reconciliation may imply unrealistically many duplication and deletion events. In the extreme case, all inner vertices are duplication events before the first speciation. The root of the species tree then contains already a separate gene for each leaf of T. All the additional copies created by speciations therefore are eliminated again by subsequent loss events. More parsimonious reconciliations are thus usually modeled by minimizing the number duplication and loss events, reviewed e.g. by Doyon et al. (2011).
Moreover, the existence of reconciliation maps for T to some species tree cannot generally be ensured, if the event labels are given (Hernandez-Rosales et al. 2012; Hellmuth 2017). Hence, the best match relation (which constrains the gene tree (Geiß et al. 2019a)), the event labels, the existence of one or a particular reconcilation map, and the species tree depend on each other or at least do constrain each other. In this contribution we explore these dependencies in detail in the absence of horizontal gene transfer.
We show that, in this setting, the true orthology graph (TOG) is a subgraph of the reciprocal best match graph (RBMG). In other words, reciprocal best matches can only produce false positive orthology assignments as long as the evolution of a gene family proceeds via duplications, losses, and speciations. Computer simulations show that in broad parameter range the TOG and RBMG are very similar, proving an a posteriori justification for the use of reciprocal best matches in orthology estimation. In addition, we characterize a subset of the “false positive” edges in the RBMG that cannot be present in the TOG. Experimental results show that – using so-called good quartets – it is possible to remove nearly all false positive orthology assignments. Our aim here is to understand those sources of error and ambiguities in orthology detection that still persist even if reciprocal best matches are inferred with perfect accuracy. Therefore, all computer simulations reported here use perfect data as input. In a companion paper, we address the question how well reciprocal best matches can be inferred from (dis)similarity data, and what can be done to make this inital step more accurate. Finally, we discuss how these results can potentially be generalized to the case that the evolutionary scenarios contain HGT.
Preliminaries
A planted (phylogenetic) tree is a rooted tree T with vertex set V(T) and edge set E(T) such that (i) the root has degree 1 and (ii) all inner vertices have degree . We write L(T) for the leaves (not including ) and for the inner vertices (also not including ). To avoid trivial cases, we will always assume that . The conventional root of T is the unique neighbor of . The main reason for using planted phylogenetic trees instead of modeling phylogenetic trees simply as rooted trees, which is the much more common practice in the field, is that we will often need to refer to the time before the first branching event. Conceptually, it corresponds to explicitly representing an outgroup. For some vertex , we denote by T(v) the subtree of T that is rooted in v. Its leaf set is L(T(v)).
On a rooted tree T we define the ancestor order: if y is a vertex of the unique path connecting x with the root , we write . As usual we write if or . In particular, the leaves are the minimal elements w.r.t. , and we have for all . This partial order is conveniently extended to the edge set by defining each edge to be located between its incident vertices, i.e., if and is an edge, we set . In this case, we write to denote that x is closer to the root than y. If , we say that y is a child of x, in symbols , and x is the parent of y in T. We sometimes also write instead of . Moreover, if or in T, then x and y are called comparable, otherwise the two vertices are incomparable.
For a non-empty subset of vertices of a rooted tree , we define , the last common ancestor of A, to be the unique -minimal vertex of T that is an ancestor of every vertex in A. For simplicity we write for a set of vertices. The definition of is conveniently extended to edges by setting and , where the edges are simply treated as sets of vertices. We note for later reference that holds for non-empty vertex sets A, B of a tree.
Binary trees on three leaves are called triples. We say that a triple xy|z is displayed in a rooted tree T if x, y, and z are leaves in T and the path from x to y does not intersect the path from z to the root. The set of all triples that are displayed by the tree T, is denoted by r(T) and a triple set R is said to be compatible if there exists a tree T that displays R, i.e., .
Denote by L(S) a set of species and denote by the map that assigns to each gene a species . A tree T together with such a map is denoted by and called leaf-colored tree.
Definition 1
Let be a leaf-colored tree. A leaf is a best match of the leaf if and holds for all leaves from species . The leaves are reciprocal best matches if y is a best match for x and x is a best match for y.
The directed graph with vertex set L(T), vertex-coloring , and edges defined by the best matches in is known as colored best match graph (BMG) (Geiß et al. 2019a). The undirected graph with vertex set L(T), vertex-coloring , and edges defined by the reciprocal best matches in is known as colored reciprocal best match graph (RBMG) (Geiß et al. 2019b). We sometimes write n-BMG, resp., n-RBMG to specify the number n of colors.
Throughout this contribution, and denote simple undirected and simple directed graphs, respectively. We distinguish directed arcs (x, y) in a digraph from edges xy in an undirected graph G or tree T. For an undirected graph G we denote by the neighborhood of some vertex x in G. The disjoint union of two graphs and has vertex set and edge set . Their join has again vertex set and its edge set is given by . Thus the join of G and H is obtained by connecting every vertex of G to every vertex of H.
A class of undirected graphs that plays an important role in this contribution are cographs, which are recursively defined (Corneil et al. 1981):
Definition 2
An undirected graph G is a cograph if one of the following conditions is satisfied:
, the single-vertex graph,
, where H and are cographs,
, where H and are cographs.
An undirected graph is a cograph if and only if it does not contain an induced (path on four vertices) (Corneil et al. 1981).
Every cograph G is associated with a set of phylogenetic trees , usually referred to as the cotrees of G. Every cotree corresponds to a possible recursive construction of G, where the cotree for the single-vertex graph is simply . Since both the disjoint union and the join operation are associative, it is possible to join or unify two or more component cographs in a single construction step. The leaves of correspond to the vertices of G. Each interior vertex of corresponds to either a join or a disjoint union operation. Its child-subtrees, furthermore, are exactly the cotrees of the component cographs that are joined or disjointly unified, respectively. The event type associated with an inner vertex u will be denoted by . Each vertex u of can be associated with an induced subgraph . A cotree is called discriminating if any two adjacent inner nodes represent different types of events. If and is obtained from by contracting a non-discriminating edge, i.e., an edge uv with , then . Every cograph has a unique discriminating cotree, which is obtained from any of its cotrees by contracting all non-discriminating edges (Corneil et al. 1981). We note, finally, that the discriminating cotree of G coincides with the modular decomposition tree of G.
Reconciliation maps, event labelings, and orthology relations
A gene tree and a species tree are planted phylogenetics trees on a set of (extant) genes L(T) and species L(S), respectively. We assume that we know which gene comes from which species. Mathematically, this knowledge is represented by a map that assigns to each gene the species in whose genome it resides. Best match approaches start from a set of genes taken from a set of species. Hence, the “gene-species-association” is known. Moreover, species without sampled genes do not affect the best match graph and we can w.l.o.g. assume that is a surjective map to avoid trivial cases. Note, however, that the definitions and results presented below naturally extend to general maps . We write for a gene tree with given map .
An evolutionary scenario comprises a gene tree and a species tree together with a map from T to S that identifies the locations in the species tree S at which evolutionary events took place that are represented by the vertices of the gene tree T. The properties of the map of course depend on which types of evolutionary events are considered. In order to model evolutionary scenarios we assume that evolutionary events of different types do not occur concurrently. In particular, speciation and duplication are always strictly temporally ordered. Gene duplications therefore always occur along the edges of the species tree. Vertices on T that model speciation events, on the other hand, must be mapped to inner vertices of S.
From here on we will consider only Duplication/Loss secenarios, that is we explicitly exclude horizontal gene transfer (HGT). We will briefly discuss the effects of HGT in Sect. 8.
Definition 3
(Reconciliation Map) Let and be two planted phylogenetic trees and let be a surjective map. A reconciliation from to S is a map satisfying
-
(R0)
Root Constraint. if and only if .
-
(R1)
Leaf Constraint. If , then .
-
(R2)
Ancestor Preservation. implies .
-
(R3)Speciation Constraints. Suppose .
-
(i)for at least two distinct children of x in T.
-
(ii)and are incomparable in S for any two distinct children and of x in T.
-
(i)
Several alternative definitions of reconciliation maps for Duplication/Loss scenarios have been proposed in the literature, many of which have been shown to be equivalent. Nevertheless, we add yet another one because earlier variants do not clearly separate conditions pertaining to the structural congruence of gene tree and species tree (Axioms (R0), (R1), and (R2)) from conditions that (implicitly) distinguish event types, here (R3.i) and (R3.ii). This axiom system also generalizes easily to situations with horizontal transfer as we shall see in Sect. 8. We proceed by showing that it is equivalent to axioms that are commonly used in the literature, see e.g. Górecki and Tiuryn (2006), Vernot et al. (2008), Doyon et al. (2011), Rusin et al. (2014), Hellmuth (2017), Nøjgaard et al. (2018), and the references therein.
Lemma 1
Let be a map from to that satisfies (R0) and (R1). Then, satisfies Axioms (R2) and (R3) if and only if satisfies
-
(R2’)
Ancestor Constraint.
Suppose with .-
(i)If , then ,
-
(ii)otherwise, i.e., at least one of and is contained in W, .
-
(R3’)Inner Vertex Constraint.If , then
-
(i)and
-
(ii)and are incomparable in S for any two distinct children and of x in T.
-
(i)
-
(i)
Proof
Assume first that (R2) and (R3) are satisfied for .
Then property (R2’.i) is satisfied since it is the restriction of (R2) to .
To see that (R2’.ii) holds, let and or . Assume first that . Property (R2) implies . Let v be the child of y that lies on the path from y to x in T, i.e., . Assume for contradiction that . By Property (R2) we have . For every other child of y, Property (R2) implies . Thus, and are comparable; a contradiction to (R3.ii). Hence, and (R2’.ii) is satisfied. Now suppose and assume for contradiction that . Thus and we can apply the same arguments as above to conclude that (R3.ii) is not satisfied. Hence, and (R2’.ii) is satisfied.
In order to show that (R3’) is satisfied, let such that . Properties (R3’.ii) and (R3.ii) are equivalent. It remains to show that (R3’.i) is satisfied. From (R2) we infer for all . Thus,
1 |
Property (R3.i) implies that there are two distinct children with . Again using (R3.ii), we know that the images and are incomparable in S. The latter together with for all and for all implies
In summary, implies that and Property (R3’.i) is satisfied.
Therefore, (R2) and (R3) imply (R2’) and (R3’).
Conversely, assume now that (R2’) and (R3’) are satisfied for . Clearly (R2’) implies (R2), and (R3’.ii) implies (R3.ii). It remains to show that (R3.i) is satisfied. Let . By (R2’.ii) we have for all children , . Therefore, . By (R3’.ii), the images are pairwise incomparable in S. The latter and (R2’.i) imply . It is easy to verify that for at least two children is always satisfied. Hence, for some and thus, (R3.i) is satisfied.
Therefore, (R2’) and (R3’) imply (R2) and (R3).
A reconciliation map from to a species tree S implicitly determines whether an inner node of T corresponds to a speciation or a duplication. Since we assume that distinct events are represented by distinct nodes of the gene tree, all duplication events are mapped to the edges of S. Vertices of T mapped to vertices of S thus represent speciations. We formalize this idea as follows:
Definition 4
Given a reconciliation map from to S, the event labeling onT(determined by) is the map given by:
The symbols and identify the planted root and the leaves of T, respectively. Inner vertices are labeled for duplication and for speciation, respectively.
The event labeling , by definition, is completely determined by a reconciliation map . This raises two related questions: (1) which pattern of event labels can arise for reconciliation maps, and (2) what restriction does a given event labeling impose on the reconciliation map? To study these questions, we consider event-labeled trees (T, t) where the event labeling of T is a map satisfying , for all , and for . We interpret as gene duplication event and as speciation event.
A simple consequence of the Axioms (R0)-(R3) is the following result which is stated here for later reference. For the sake of completeness, we also provide a short proof.
Lemma 2
Let be a reconciliation map from the leaf-colored tree to and suppose that x is a vertex in V(T) with . Then, for any two distinct .
Proof
Assume for contradiction that there is a vertex . By Condition (R2’), we have and . Thus, there is a path from to z that contains and a path from to z that contains . However, Condition (R3.ii) implies that and are incomparable in S, that is, the subtree of S consisting of the two paths and must contain a cycle; a contradiction.
Lemma 2 has a simple interpretation: Since , we have , i.e., x represents a speciation. The lemma thus states that any two subtrees of T rooted in distinct children of a speciation event are composed of genes from disjoint sets of species. It suggests the following
Definition 5
An event labeling is well-formed if implies that for any two distinct .
Lemma 2 suggests to ask for a characterization of the event maps t for a given leaf-labeled tree for which admits a reconciliation map to some species tree. Definition 5 suggests to start by considering among the well-formed event labelings the one that designates every vertex of T that is not identified as a duplication because it violates Lemma 2.
Definition 6
Let be a leaf-labeled tree. The extremal event labeling of T is the map defined for by
The extremal event labeling is completely determined by . By construction, if is a duplication w.r.t. to the extremal event labeling , then for every well-formed event labeling t on .
It is a well-known result that it is always possible to reconcile a given pair of gene tree T and species tree S, see e.g. (Guigó et al. 1996; Page and Charleston 1997; Górecki and Tiuryn 2006). For convenience, we include a short direct proof of this fact.
Lemma 3
For every tree there is a reconciliation map to any species tree S with leaf set .
Proof
Let be an arbitrary species tree with leaf set L(S) and be the unique root-edge of S. Set and for all . Thus, (R0) and (R1) are satisfied. Now, set for all . Thus, for all and (R3) is trivially satisfied. Finally, for all and with we have by construction of that . Thus, (R2) is satisfied.
The reconciliation map constructed in the proof of Lemma 3 maps all inner vertices of the gene tree to the edge above the root of the species tree S, and hence for all inner vertices of T. The root of S already contains |L(T)| genes, one for each leaf of T. Every speciation event is therefore accompanied by complementary losses, and there are no further gene duplication events below the root.
The assignment of genes to species, i.e., a prescribed leaf coloring , however, implies further restrictions. In fact, it is not sufficient to require that the event labeling is well-formed. Instead, the simultaneous knowledge of gives rise to stronger conditions on the species trees S with which can be reconciled. Following (Hernandez-Rosales et al. 2012), we denote by the set of triples for which ab|c is a triple displayed by T such that (i) , , are pairwise distinct species and (ii) the root of the triple is a speciation event, i.e., . This set of triples characterizes the existence of a reconciliation map:
Proposition 1
(Hernandez-Rosales et al. 2012; Hellmuth 2017) Given an leaf-labeled tree with a well-formed event labeling t and a species tree S with , there is a reconciliation map such that the event labeling is consistent with Definition 4 if and only if S displays . In particular, can be reconciled with a species tree if and only if is a compatible set of triples.
An example for a that does not admit a reconciliation map is given in Fig. 2 (top left). We note that the characterization in Proposition 1 can be evaluated in polynomial time (Hellmuth 2017).
The event labeling t on T defines the orthology relation:
Definition 7
(Fitch 2000) Two distinct leaves are orthologs (w.r.t.t) if ; they are paralogs if .
For completeness, we note that if and only , and is never the of any of pair of leaves since the planted root has degree 1 by construction. We write for the orthology relation obtained from (T, t), i.e., the set of all unordered pairs of orthologous genes in L(T). For convenience we will not distinguish between the irreflexive, symmetric binary relation and the graph with vertex set L(T) and edge set . Naturally, we say that an arbitrary relation is an orthology relation if there is an event-labeled phylogenetic tree (T, t) such that . It is important to note that the orthology relation explicitly depends on the event labeling. Analogously, one can also define the paralogy relation by . Both orthology and paralogy are irreflexive and symmetric but not transitive, see Fig. 3. We note that orthology and paralogy are complementary in the graph-theoretical sense, i.e., is contained in exactly one of or .
Based on the work of Böcker and Dress (1998) it has been shown by Hellmuth et al. (2013) that valid orthology relations are exactly cographs:
Proposition 2
An irreflexive, symmetric relation on L is an orthology relation if and only if it is a cograph. In this case, every cotree T of with an event labeling t assigning to join operations and to disjoint union operations satisfies .
There is a unique discriminating cotree for an orthology relation , which is obtained from every other (non-discriminating) cotree (T, t) for by contracting the inner edges uv of T if and only if (Böcker and Dress 1998; Hellmuth et al. 2013).
It is natural then to ask under which conditions a given orthology relation is consistent with a leaf-labeled tree in the sense that there is a reconcilation map from to some species tree such that . We first consider the special case . As shown by Hellmuth and Wieseke (2016), it is possible to obtain the set of informative triples directly from using the following rule:
if and only if , and are pairwise different species and either
and or
and there is a vertex with and .
Theorem 1
Let be a cograph with vertex set L and associated cotree with leaf set L and let be a leaf coloring. Then there exists a reconciliation map from to some species tree S if and only if (i) is compatible and (ii) the cograph is properly colored, i.e., for all we have .
Proof
By Proposition 1, it is necessary and sufficient that (i) the set of informative triples is compatible and (ii) the event map is well-formed. Since is the event labeling of the co-tree, Condition (ii) amounts to requiring that the leaf set have pairwise disjoint sets of colors for all children of every join node u. Since the join of the two cographs associated with and introduces an edge xy for all and all , the resulting graph can only be properly colored if . On the other hand, every edge in is the result of a join operation, thus can only be well-colored if joins only appear between induced subgraphs with disjoint color sets. Thus is well-formed if and only if is a proper vertex coloring for .
Under the assumption that a reconciliation map exists for to some species tree, the next results shows that the orthology relation is always a subgraph of the orthology relation implied by and its extremal labeling .
Lemma 4
Let be a leaf-labeled tree and a reconciliation map from to some species tree S. Then .
Proof
Let and suppose . Then, by definition of , i.e., . Therefore, Lemma 2 implies for all . Hence, by definition of the extremal event labeling and thus .
The converse of Lemma 4 is generally not true, see Fig. 2 for an example. For later reference, we note the following result which is an immediate consequence of Lemma 4 due to the fact that orthology and paralogy relations are complementary.
Corollary 1
Let be a leaf-labeled tree and a reconciliation map from to some species tree S. Then .
Lemma 4, in particular, implies that none of the labelings (provided by any reconciliation map ) can yield more speciation events in T, than the extremal labeling . Moreover, it is easy to see that always implies , while implies .
We briefly compare the formalism introduced here with the literature on maximum parsimony reconciliations. There, one considers reconciliation maps that map duplication events in T also to vertices of S. The mapping is then interpreted in such a way that the duplication event u took place along an edge in S that is ancestral to . The map in this setting does not completely determine the event labeling. The least common ancestor map
2 |
corresponds to one of the “most parsimonious reconciliations” (Górecki and Tiuryn 2006; Doyon et al. 2009) and can be obtained in polynomial time. A closely related reconciliation map can be defined in our setting. The LCA-reconciliation map introduced by Hellmuth (2017) satisfies the additional axiom
(LCA) for all with , where v denotes the unique parent of in S.
The Axiom (LCA) is the analog of Eq. (2) for duplication vertices in T, which in our formalism are necessarily mapped to edges. For speciation events, the corresponding condition is expressed by (R3.i). Hellmuth (2017) showed that the existence of a reconciliation map from implies also the existence of an LCA-reconciliation map. Figure 2 shows that an LCA-reconciliation map does not necessarily have as its event labeling. Even if , then is not necessarily an LCA-reconciliation map, see Fig. 4.
Orthology and reciprocal best matches
In this section, we further clarify the relationship between the orthology relation and (reciprocal) best matches. As a main result, we find that the reciprocal best match graph contains any possible orthology relation.
Lemma 5
If with leaf set L explains the RBMG and is the extremal event labeling of , then is a subgraph of the RBMG .
Proof
Consider a vertex with . If , then none of the edges xy in G with and , is contained in .
Now suppose . For and with , we have and, by construction of , . In particular, implies that all distinct children satisfy . Thus, for all with and for all with , i.e., x and y are reciprocal best matches. Hence, and thus .
Theorem 2
Let T and S be planted trees, a surjective map, and a reconciliation map from to S. If , then x and y are reciprocal best matches in .
Observation 1
Reciprocal best matches therefore cannot produce false negative orthology assignments as long as the evolution of a gene family proceeds via duplications, losses, and speciations only.
The “false positive” edges in the RBMG compared to the orthology relation are the consequence of a particular class of duplication events:
Theorem 3
Let be a leaf- and event-labeled gene tree, and its corresponding RBMG and orthology relation, respectively. Moreover, let , , and such that , . Then, if and only if and , where “” denotes the usual symmetric set difference.
Proof
Suppose first . By definition of , we immediately find . Since , i.e., a and b are reciprocal best matches, it must hold for any of color . Hence, . Analogously, we conclude and thus, .
Conversely, assume and . Since , a and b cannot be orthologs, i.e., . Moreover, in particular implies and therefore, for any with . Hence, b is a best match for a in species . One similarly concludes that a is a best match for b. Hence, a and b are reciprocal best matches, which concludes the proof.
In practical application we usually do not know the event-labeled gene tree. It is possible, however, to compute the reciprocal best matches directly from sequence data. Therefore, it is of interest to investigate the relationship of reciprocal best match graphs and orthology relations.
Definition 8
(Geiß et al. 2019b) A tree is least resolved (w.r.t. the RBMG that it explains) if the contraction of any inner edge implies .
Since is completely determined by we can drop the reference to and often simply speak about a “least resolved tree”.
Lemma 6
Let be an RBMG that is explained by . If is least resolved w.r.t. , then every inner edge satisfies .
Proof
For contraposition, assume that there is an inner edge with . Hence, for all and we have and . It is easy to see that all such x and y form a reciprocal best match and thus, . Clearly, x and y form also reciprocal best match in and thus, each edge with and is contained in . Since we have not changed the relative ordering of the of the remaining vertices, all edges in E(G) are contained in .
The converse of Lemma 6 is not necessarily true. As an example, consider an inner edge with . It is easy to see that e can be contracted.
Lemma 6 implies that if is least resolved w.r.t. and such that u is incident to some other inner vertex , then there is a child of u which satisfies . By construction of we have . The latter observation also implies the following:
Corollary 2
Suppose that is least resolved w.r.t. and let be the extremal event labeling for . Then if and only if all children of u are leaves that are from pairwise distinct species.
Lemma 7
Let be some least resolved tree (w.r.t. some RBMG) with extremal event map and let S(W, F) be a species tree with . Then there is a reconciliation map such that .
Proof
By Cor. 2, every inner vertex u with is only incident to leaves from pairwise distinct species. However, this implies that the set of informative species triples is empty, and thus, compatible. Hence, Proposition 1 implies that there is a reconciliation map from to any species tree S, defined by , for every inner vertex that is incident to another inner vertex in T, and for any inner vertex v that is only incident to leaves that are from pairwise distinct species, and for all leaves of T. By construction of , we have with specified by Def. 4 for all .
Corollary 3
Let be a least resolved tree explaining a co-RBMG . Then is a disjoint union of cliques.
Proof
By Cor. 2 all children of a speciation node u w.r.t. are leaves from pairwise distinct species. Thus the leaves L(T(u)) form a complete subgraph in . On the other hand, no ancestor of u is a speciation, i.e., there is no edge ab with and . Thus is a disjoint union of the cliques formed by the L(T(u)) with possibly together with isolated vertices that are not children of any speciation node in .
Suppose that we know the orthology relation that is obtained from a least resolved tree that explains the RBMG . Lemma 7 implies that there is always a reconciliation map from to any species tree S with such that is determined by as in Def. 4. Now we can apply Theorem 2 to conclude that all orthologous pairs in are reciprocal best matches. In other words, all complete subgraphs of are also induced subgraphs of the underlying RBMG . Hence, is obtained from by removing edges such that the resulting graph is the disjoint union of cliques, see the top-right tree in Fig. 5 for an example. However, Fig. 5 also shows that many edges have to be removed to obtain .
This observation establishes the precise relationship of orthology detection and clustering, since (graph) clustering can be interpreted as the graph editing problem for disjoint unions of complete graphs (Böcker et al. 2011). In many orthology prediction tools, such as e.g. OMA (Roth et al. 2008), orthologs are summarized as clusters of orthologous groups (COGs) (Tatusov et al. 1997) that are obtained from reciprocal best matches.
The results above show that the RBMGs contain the orthology relation. Equivalently, RBMGs imply constraints on the event labeling. We also observe that the RBMGs cannot provide conclusive evidence regarding edges that must correspond to orthologous pairs. In the following sections we consider the constraints implied by the detailed structure of RBMGs or BMGs in more detail.
Classification of RBMGs
The structure of RBMGs has been studied in extensive detail by Geiß et al. (2019b). Although we do not have an algorithmically useful complete characterization of RBMGs, there are partial results that can be used to identify different subclasses of RBMGs based on the structure of the connected components of the 3-colored subgraphs (Geiß et al. 2019b, Thm. 7). Let be the set of the connected components of the induced subgraphs on three colors of an RBMG . Then every is precisely of one of the three types (Geiß et al. 2019b, Thm. 5):
Type (A) contains a on three colors but no induced .
Type (B) contains an induced on three colors whose endpoints have the same color, but no induced cycle on vertices.
Type (C) contains an induced cycle , called hexagon, such that any three consecutive vertices have pairwise distinct colors.
The graphs for which all are of Type (A) are exactly the RBMGs that are cographs, or co-RBMGs for short (Geiß et al. 2019b, Thm. 8 and Remark 2). Intuitively, these have a close connection to orthology graphs because orthology graphs are cographs.
Connected components of Type (B) and Type (C), on the other hand, contain induced and thus are neither cographs nor connected components of cographs. Obs. 1 implies that RBMGs that contain connected components of Type (B) and Type (C) introduce false positive edges into estimates of the orthology relation. In Sect. 6 below we will address the question to what extent and how such false-positives edges can be identified. We distinguish here co-RBMGs, (B)-RBMGs, and (C)-RBMGs depending on whether contains only Type (A) components, at least one Type (B) but not Type (C) component, or at least one Type (C) component.
Co-RBMGs have a convenient structure that can be readily understood in terms of hierarchically colored cographs (hc-cographs) introduced by Geiß et al. (2019b, Sect. 7).
Definition 9
An undirected colored graph is a hierarchically colored cograph (hc-cograph) if
, i.e., a colored vertex, or
and , or
and ,
where both and are hc-cographs and for any for .
Not all properly colored cographs are hc-cographs, see e.g. Geiß et al. (2019b) for counterexamples. However, for each cograph G, there exists a coloring (with a sufficient number of colors) such that is an hc-cograph.
Proposition 3
(Thm. 9 in (Geiß et al. 2019b)) A graph is a co-RBMG if and only if it is an hc-cograph.
Since orthology relations are necessarily cographs we can interpret Proposition 3 as necessary condition for an RBMG to correctly represent orthology.
The recursive construction of in Def. 9 also defines a corresponding hc-cotree whose leaves are the vertices of , i.e., the appearing in (K1). Each internal node u of corresponds to either a join (K2) or a disjoint union (K3) and is labeled by such that if u represents a join, and if u corresponds to a disjoint union. Each inner vertex u of represents the induced subgraph .
Proposition 4
(Thm. 10 in (Geiß et al. 2019b)) Every co-RBMG is explained by its hc-cotree .
Now let be the hc-cotree of a co-RBMG . Note, the structure of is solely determined by the hc-cograph structure of . Somehwat surprisingly, the mathematical structure of the hc-cotree and, in particular, its coloring has a simple biological interpretation. Consider . If in the hc-cotree, then in agreement with Lemma 2. On the other hand, if , then (K3) implies , in which case u indeed must be a duplication from the biological point of view (contraposition of Lemma 2).
The hc-cotree of will in general not be discriminating and it is not necessarily possible to reduce to a discriminating hc-cotree that still explains . Although it is always possible to contract edges uv of with (cf. (Geiß et al. 2019b, Cor. 11)), there are examples where edges uv with cannot be contracted to obtain a tree that still explains (cf. (Geiß et al. 2019b, Fig. 15)). We refer to (Geiß et al. 2019b) for more details and a characterization of edges that are contractable. It is of interest, therefore, to ask whether there are true orthology relations that are not hc-cographs, or equivalently, when does a discriminating hc-cotree that is obtained by edge-contraction from a given hc-cotree still explains an RBMG ? To answer this question we provide first
Definition 10
A tree contains no losses, if for all with we have for all .
Theorem 4
Let be a leaf-labeled tree such that there is a reconciliation map to some species tree and assume that does not contain losses. Then
The RBMG explained by equals the colored cograph .
The unique disciminating cotree of explains the RBMG .
Proof
To simplify the notation, we set and .
We start with proving Statement (1). By Theorem 2, is a subgraph of and , hence it suffices to show that every edge is also contained in E(H). Assume, for contradiction, that this is not the case, i.e., , and thus for . Since has no losses, we have for all , and thus and for some pair of distinct children of x. From we know that there is a vertex with . Thus, for some , which implies that ; a contradiction. We conclude that if and only if and thus .
Let us now turn to Statement (2). In order to show that explains the RBMG we first note that, since is a cograph by Statement (1), there is a unique discriminating cotree for . Furthermore, is obtained from any cotree for by contracting all edges uv in T with (Hellmuth et al. 2013). It remains to show that ab is an edge in if and only if ab forms a reciprocal best match in .
First consider duplications. Suppose, we have contracted the edge xv with . By assumption, for all children of v we have . Moreover, since is the union of species of its children w, we have . Hence, after contraction of xv, the vertices and are now children of x and still satisfy . In particular, for every child w of x. By induction on the number of contracted edges, every vertex x in with still satisfies for all children of x in . Thus, the same argument as in the proof of Statement (1) implies that ab cannot be a reciprocal best match in for all and . We also have for and , and thus . Since is a cotree for the cograph , implies . Therefore, unless a and b form a reciprocal best match in .
Let us now turn to speciation vertices. Lemma 47 in (Geiß et al. 2019b) states, in particular, that all non-discriminating edges uv with can be contracted to obtain a tree that still explains . Thus, if a and b are reciprocal best matches in , then . We conclude, therefore, that if and only if a and b are reciprocal best matches in .
Prop. 3 shows that if the no loss condition of Def. 10 holds, then is a co-RBMG, an hc-cograph, and an orthology relation.
The no loss condition of Def. 10 is very restrictive, however, and thus in general will not be satisfied in real-life data. Theorem 1 shows that orthology relations correspond to properly colored cographs with compatible sets of the informative triples. The characterization of co-RBMGs in (Geiß et al. 2019b), on the other hand, shows that only hc-colorings may appear. Since the requirement that is a proper coloring already implies disjointness of the color sets for join operations, we can interpret the hc-coloring condition as a condition on duplication vertices. The offending vertices are exactly those for which (i) and (ii) there are two children such that both and . In this case, there is a pair of species such that a different “paralog group” (that is, a lineage of genes descending from a duplication) is missing in each of them. Every pair of vertices with and with forms a best match and thus a false positive orthology assignment. Since an RBMG is a cograph only if it is hierarchically colored, the presence of such duplications implies that the RBMG is not a cograph. At least in principle, therefore, it should be possible to identify the false positive edges by means of a suitable cograph-editing approach.
Before closing this section, we briefly return to the existence of reconciliation maps. Since every hc-cograph is a properly colored cograph, Theorem 1 immediately implies
Corollary 4
Let be an hc-cograph with vertex set L and associated hc-cotree with leaf set L. Then there exists a reconciliation map from to some species tree S if and only if is compatible.
By Cor. 4, it is not necessarily possible to reconcile a (discriminating) hc-cotree with any species tree. An example is shown in Fig. 5. To be more precise, the hc-cotree in Fig. 5 yields the conflicting species triples AB|C and AC|B. Hence, Prop. 1 implies that cannot be reconciled with any species tree even though explains the RBMG . One can contract edges of to obtain a least resolved tree that still explains , see Fig. 5 (top right). In agreement with Lemma 7, and thus, there is always a reconciliation map from to any species tree S with . Moreover, in agreement with Theorem 2, all orthologous pairs in are best matches. Although explains , the two graphs and are very different. In particular, by Corollary 3, is the disjoint union of cliques.
Observation 2
In general it is not necessary to edit to a disjoint union of cliques to obtain a valid orthology relation.
An example is provided by the tree in Fig. 5. Obviously, is not the disjoint union of cliques. Moreover, AB|C is the only informative triple displayed by where A, B, and C correspond to the red, blue and green species, respectively. Prop. 1 implies that can be reconciled with any species tree that displays AB|C. In other words, is already “biologically feasible” and there is no need to remove further edges from .
Non-orthologous reciprocal best matches
In this section we investigate to what extent false positive orthology assignments in the reciprocal best match graph can be identified. Since the orthology relation must be a cograph, it is natural to consider the smallest obstructions, i.e., induced s in more detail. First we note that every induced in an RBMG contains either three or four distinct colors (Geiß et al. 2019b, Sect. E). Each in an RBMG spans an induced subgraph of every BMG that contains as its symmetric part. These induced subgraphs of a BMG with four vertices are known as quartets. With respect to a fixed BMG, every induced belongs to one of three distinct types which are defined in terms of its coloring and the quartet in which it resides. An induced with edges ab, bc, and cd is denoted by or, equivalently, .
Definition 11
Let be a BMG explained by the tree , with symmetric part and let with and pairwise distinct colors , , and . The set Q, resp., the induced subgraph is
a good quartet if (i) is an induced in and (ii) and ,
a bad quartet if (i) is an induced in and (ii) and , and
an ugly quartet if is an induced in .
If Q is a good, bad, or ugly quartet we will refer to the underlying induced as a good, bad, or ugly quartet, respectively. Lemma 32 of (Geiß et al. 2019b) states that every quartet Q in an RBMG that is contained in a BMG is either good, bad, or ugly. An example of an RBMG containing good, bad, and ugly quartets is shown in Fig. 6. Note that good, bad, and ugly quartets cannot appear in RBMGs of Type (A). These are cographs and thus by definition do not contain induced .
The location of good quartets (in contrast to bad and ugly quartets) turns out to be strictly constrained. This fact can be used to show that the “middle” edge of any good quartet must be a false positive orthology assignment:
Lemma 8
Let be some leaf-labeled tree and the extremal event labeling for . If is a good quartet in the BMG , then for .
Proof
Lemma 36 of Geiß et al. (2019b) implies that for a good quartet in with there are two distinct children such that and . Thus, in particular, and must be inner vertices in . Since by definition of a good quartet, we have . Hence, by definition of (cf. Definition 6).
As an immediate consequence of Lemma 8 and Cor. 1, an analogous statement is true for event labelings for a given reconciliation map:
Corollary 5
Let T and S be planted trees, a surjective map, and a reconciliation map from to S. If is a good quartet in the BMG , then for .
Given an RBMG that contains a good quartet (w.r.t. to the underlying BMG ), the edge yz therefore always corresponds to a false positive orthology assignment, i.e., it is not contained in the true orthology relation .
Not all false positives can be identified in this way from good quartets, however. The RBMG in Fig. 7, for instance, contains only one good quartet, that is . After removal of the false positive edge , the remaining undirected graph still contains the bad quartet , hence, in particular, it still contains an induced and is, therefore, not an orthology relation.
Neither bad nor ugly quartets can be used to unambiguously identify false positive edges. For an example, consider Fig. 7. The two 3-RBMGs and both contain the bad quartet . As a consequence of Lemma 2, neither the root of nor the root of can be labeled by a speciation event. Hence, as reside all in different subtrees below the root of , all edges in correspond to false positive orthology assignments. On the other hand, the vertices and reside within the same 2-colored subtree below the root of and are incident to the same parent in . Therefore, one easily checks that there exist reconciliation scenarios where and are orthologous, hence the edge must indeed be contained in the orthology relation. Similarly, and are ugly quartets in and , respectively. By the same argumentation as before, the edges , , and are false positives in . For , however, there exist reconciliation scenarios, where and are orthologs.
Cor. 9 of Geiß et al. (2019b), finally, implies that every (B)-RBMG and every (C)-RBMG contains at least one good quartet. In particular, therefore, there is at least one false positive orthology assignment that can be identified with the help of good quartets. We shall see in Sect. 7.2, using simulated data, that in practice the overwhelming majority of false positive orthology assignments is already identified by good quartets.
From a theoretical point of view it is interesting nevertheless that it is possible to identify even more false positive orthology assignments starting from Lemma 2. It implies that whenever x and y are located in two distinct leaf sets defined for the the same connected component of an induced 3-RBMG of Type (B) or (C). Details can be found in (Geiß et al. 2019b, Lemma 25) and the Supplemental Material. At least in our simulation data scenarios of this type that are not covered already by a good quartet seem to be exceedingly rare, and hence of little practical relevance.
Simulations
Although the edges in the RMBG cannot identify orthologous pairs with certainty (as a consequence to Lemma 3), there is a close resemblance in practice, i.e., for empirically determined scenarios. In order to explore this connection in more detail, we consider simulated evolutionary scenarios . These uniquely determine both the (reciprocal) best match graph and , resp., and the orthology graph , thus allowing a direct comparison of these graphs. Since we only analyze scenarios , we did not use simulations tools such as ALF (Dalquén et al. 2011) that are designed to simulate sequence data.
Simulation methods
In order to simulate evolutionary scenarios we employ a stepwise procedure:
Construction of the species treeS. We regard S as an ultrametric tree, i.e., its branch lengths are interpreted as real-time. Given a user-defined number of species N we generate S under the innovations model as described by Keller-Schmidt and Klemm (2012). The binary trees generated by this model have similar depth and imbalances as those of real phylogenetic trees from databases.
Construction of the true gene tree. Traversing the species tree S top-down, one gene tree is generated with user-defined rates for duplications, for losses, and for horizontal transfer events. The number of events along each edge of the species tree, of each type of event, is drawn from a Poisson distribution with parameter , where is the length of the edge e and is the rate of the event type. Duplication and horizontal transfer events duplicate an active lineage and occur only inside edges of S. For duplications, both offspring lineages remain inside the same edge of the species tree as the parental gene. In contrast, one of the two offsprings of an HGT event is transferred to another, randomly selected, branch of the species tree at the same time. At speciation nodes all branches of the gene tree are copied into each offspring. Loss events terminate branches of . Loss events may occur only within edges of the species tree that harbor more than one branch of the gene tree. Thus every leaf of S is reached by at least one branch of the gene tree . All vertices v of are labeled with their event type t(v), in particular, there are different leaf labels for extant genes and lost genes. The simulation explicitly records the reconciliation map, i.e., the assignment of each vertex of to a vertex or edge of S.
Construction of the observable gene treeTfrom. The leaves of are either observable extant genes or unobservable losses. As described by Hernandez-Rosales et al. (2012), we prune in bottom-up order by removing all loss events and omitting all inner vertices with only a single remaining child.
Using steps (1) and (2), we simulated 10,000 scenarios for species trees with 3 to 100 species (=leaves) and additional 4000 scenarios for species trees with 3 to 50 leaves, drawn from a uniform distribution. For each of these species trees, exactly one gene tree was simulated as described above. The rate parameters were varied between 0.65 and 0.99 in steps of 0.01 for duplication and loss events. For HGTs, a rate in the range between 0.1 and 0.24, again in steps of 0.01, was used. A detailed list of all simulated scenarios can be found in the Supplemental Material. For each of the 14,000 true gene trees the total number of speciation events, of losses, of duplications, and of HGTs was determined. Summary statistics of the simulated scenarios are compiled in the Supplemental Material.
From each true gene tree we extracted the observable gene tree T as described in Step (3). For all retained vertices the reconciliation map and thus the event labeling remains unchanged. Since for all extant genes , it suffices to consider T. The leaf coloring map is obtained from its definition, i.e., setting for all . We can now extract the orthology relation and reciprocal best match relation from each scenario.
The orthology relation is easily constructed from the event labeled gene tree (T, t), since if and only if . An efficient way to compute and the RBMG that avoids the explicit evaluation of is described in the Supplemental Material. For each reconciliation scenario , we also identify all good quartets in the BMG and then delete the middle edge of the corresponding from the RBMG . The resulting graph will be referred to as .
Simulation results for duplication/loss scenarios
In order to assess the practical relevance of co-RBMGs we measured the abundance of non-cograph components in the simulated RBMGs. More precisely, we determined for each simulated RBMG the connected components of its restrictions to any three distinct colors and determined whether these components are cographs, graphs of Type (B), or graphs of Type (C). In order to identify these graph types, we used algorithms of (Hoàng et al. 2013) to first identify an induced belonging to a good quartet. If one exists, we check for the existence of an induced and then test whether its endpoints are connected, thus forming a hexagon characteristic for the a Type (C) graph. Otherwise, the presence of the implies Type (B), while the absence of induced s guarantees that the component is a cograph.
We did not encounter a single Type (C) component in 14,000 simulated scenarios. As we shall see this is a consequence of the fact that all simulated trees are binary. To see this, we consider the structure of connected 3-RBMG of Type (C) in some more detail, generalizing some technical results by Geiß et al. (2019b):
Lemma 9
Let be a connected 3-RBMG containing the induced with three distinct colors r, s, and t such that , , and . Then, every tree that explains must satisfy the following property: There exist distinct where such that either , , or , , .
Proof
If , then, due to the connectedness of , at least one of the six vertices of the induced is adjacent to more than one vertex of one of the colors r, s, t, hence the first statement immediately follows from Lemma 39(iii) in Geiß et al. (2019b). Now consider the special case . By Cor. 9 of Geiß et al. (2019b), contains a good quartet. W.l.o.g. let be a good quartet, thus and . This, in particular, implies , thus there are distinct children such that and . Moreover, as and , we have , hence . Now consider . Since and , it must hold , hence . Assume, for contradiction, that . Then, as and , we clearly have . However, this implies , contradicting . We therefore conclude that there must exist a vertex such that . One easily checks that this implies , which completes the proof.
Theorem 5
If is a binary leaf-labeled tree, then does not contain a connected component of Type (C).
Proof
By Obs. 6 of (Geiß et al. 2019b), the restriction of explains the subgraph of that is induced by vertices with color r, s, or t. Thm. 2 of (Geiß et al. 2019b) shows, furthermore, that every connected component of is explained by restriction of to the corresponding vertices. Now suppose is a binary. Then both and are also binary. By contraposition of Lemma 9, no as specified in Lemma 9 can be explained by , and thus cannot contain a connected component of Type (C).
Although events that generate more than two offspring lineages are logically possible in real data, most multifurcations in phylogenetic trees are considered to be “soft polytomies”, arising from data that are insufficient to produce a fully resolved, binary trees (Purvis and Garland Jr. 1993; Kuhn et al. 2011; Sayyari and Mirarab 2018). Type (C) 3-RBMGs thus should be very unlikely under biologically plausible assumptions on the model of evolution. Here we only consider the abundance of Type (B) components relative to all Type (A) and (B) components. We denote their ratio by . The results are summarized in Fig. 8. We find that is usually below 20% and increases with the number of loss and HGT events. More precisely, 83.47% of the 14,000 scenarios have at least one Type (B) component and 16.53% do not have Type (B) components at all. Among all 3-colored connected components taken from the restrictions to any three colors, 94.41% are of Type (A) and 5.59% are of Type (B).
A graph G is called -sparse if every induced subgraph on five vertices contains at most one induced (Jamison and Olariu 1992). The interest in -sparse graphs derives from the fact that the cograph editing problem is solvable in linear time from -sparse graphs (Liu et al. 2012). It is of immediate practical interest, therefore, to determine the abundance of -sparse RBMGs that are not cographs. Among the 14,000 simulated scenarios, we found that about 20.9% of the 3-colored Type (B) components are -sparse, while the majority contains “overlapping” s. We then investigated the corresponding -thin graphs. An undirected colored graph is called -thin if no distinct vertices are in relation . Two vertices a and b are in relation if and . Somewhat surprisingly, this yields a reversed situation, where more than two thirds of the -thin 3-colored Type (B) components are now -sparse, while only a minority of 31.32% is not -sparse. An example of an undirected colored graph and its corresponding -thin version , which we found during our simluations, is shown in Panel (B) of Fig. 9.
Next we investigated the relationship of the RBMG and the orthology graph (see Fig. 10). We empirically confirmed that in the absence of HGT (not shown). Also following our expectations, the fraction of false-positive orthology predictions in an RBMG is small as long as duplications and losses remain moderate (l.h.s. panel in Fig. 10). Most of the false positive orthology calls are associated with large numbers of losses for a given number of duplications.
We find that good quartets eliminate nearly all false positive edges from the RBMG and leave a nearly perfect orthology graph (r.h.s. panel in Fig. 10). As we have seen so far, reciprocal best matches indeed form an excellent approximation of orthology in duplication-loss scenarios. In particular, the good quartets identify nearly all false positive edges, making it easy to remove the few remaining using a generic cograph editing algorithm (Liu et al. 2012).
Outlook: evolutionary scenarios with horizontal gene transfer
The benign results above beg the question how robust they are under HGT. Gene family histories with HGT have been a topic of intense study in recent years (Doyon et al. 2010; Tofigh et al. 2011; Bansal et al. 2012; Nøjgaard et al. 2018). Following the so-called DTL-scenarios as proposed e.g. by Tofigh et al. (2011), Bansal et al. (2012) we relax the notion of reconciliation maps, since ancestry is no longer preserved. We replace Axiom (R2) by
-
(R2w)
Weak Ancestor Preservation.
If , then either or and are incomparable w.r.t. .
and add the following constraints
-
(R3.iii)
Addition to the Speciation Constraint.
If , then for all .
-
(R4)
HGT Constraint.
If x has a child y such that and are incomparable, then x also has a child with .
Property (R2w) equivalently states that if , then we must not have , which would invert the temporal order. Property (R3.iii) (which follows from (R2) but not from (R2w)) ensures that the children of speciation events are still mapped to positions that are comparable to the image of the speciation node. Condition (R4), finally, requires that every horizontal transfer event also has a vertically inherited offspring. Note that condition (R4) is void if (R2) holds. In summary the axioms (R0), (R1), (R2w), (R3.i), (R3.ii), (R3.iii), and (R4) are a proper generalization of Def. 3. We note that these axioms are not sufficient to ensure time consistency, however. We refer to Nøjgaard et al. (2018) for details. Our choice of axioms also rules out some scenarios that may appear in reality (or simulations), but which are not observable when only evolutionary divergence is available as measurement. For example, Condition (R3.ii) excludes scenarios in which HGT events have no surviving vertically inherited offspring.
We furthermore extend the event map t for a gene tree T to include HGT as an additional event type denoted by the symbol . We define such that if and only if u has a child v such that and are incomparable. Since the offsprings of an HGT event are not equivalent, it is useful to introduce an edge labeling such that if and are incomparable w.r.t. . This edge labeling is investigated in detail by Geiß et al. (2018) as the basis of Fitch’s xenology relation. Alternatively, the asymmetry can be handled by enforcing an ordering of the vertices, see (Hellmuth et al. 2017).
Evolutionary scenarios with horizontal transfer may lead to a situation where two genes x, y in the same species, i.e., with , derive from a speciation, i.e., . This is the case when the two lineages underwent an HGT event that transferred a copy back into the lineage in which the other gene has been vertically transmitted. We call such genes xeno-orthologs and exclude them from the orthology relation, see Fig. 11. This choice is motivated (1) by the fact that, by definition, genes of the same species cannot be recognized as reciprocal best matches, and (2) from a biological perspective they behave rather like paralogs. In scenarios with HGT we therefore modify the definition of the orthology graph such that is replaced by
3 |
The extremal map as in Def. 6 cannot easily be extended to include HGT, as the events and on some vertex u are solely defined on two exclusive cases: either and are disjoint or not for . Both cases, however, can also appear when we have HGT (see Fig. 11 for an example). That is, the fact that and are disjoint or not, does not help to unambiguously identify the event types in the presence of HGT.
Prop. 1 can be generalized to the case that contains HGT events. The existence of reconciliation maps from an event-labeled tree to an unknown species tree can be characterized in terms of species triples that can be derived from as follows: Denote by the set of all transfer edges in the labeled gene tree and let be the forest obtained from by removing all transfer edges. By definition, and are incomparable for every transfer edge xy in T. The set is the set of triples where , , are pairwise distinct and either
ab|c is a triple displayed by a connected component of such that the root of the triple is a speciation event, i.e., .
or and for some transfer edge xy or yx of T.
Proposition 5
(Hellmuth 2017) Given an event-labeled, leaf-labeled tree . Then, there is a reconciliation map to some species tree S if and only if is compatible. In this case, can be reconciled with every species tree S that displays the triples in .
Here, we have not added additional constraints on reconciliation maps that ensure that the map is also “time-consistent”, that is, genes do not travel “back” in the species tree, see (Nøjgaard et al. 2018) for further discussion on this. However, Prop. 5 gives at least a necessary condition for the existence of time-consistent reconciliation maps. A simple proof of Prop. 5 for the case that T is binary and does not contain HGT events can be found in (Hernandez-Rosales et al. 2012). Moreover, generalizations of reconciling event-labeled gene trees with species networks have been established by Hellmuth et al. (2019).
In contrast to pure DL scenarios, it is no longer guaranteed that all true orthology relationships are also reciprocal best matches. Figure 12 gives counterexamples. In three of these scenarios the RBMG contains an induced that mimics a good quartet. Removal of the middle edge of good quartets therefore not only reduces false positives in DL scenarios but also introduces additional false negatives in the presence of HGT (Fig. 13).
Discussion
In the theoretical part of this contribution we have clarified the relationships between (reciprocal) best match graphs (RBMGs), orthology, reconciliation map, gene tree, species tree, and event map for the case of duplication loss scenarios.
The orthology graph is necessarily a subgraph of the RBMG. In the absence of HGT, RBMGs therefore produce only false positive but no false negative orthology assignments. Using not only reciprocal best matches but all best matches, furthermore, shows that good quartets identify almost all false positive edges. Removing the central edge of all good quartets in yields nearly perfect orthology estimates. This, however, implies that orthology inference is not solely based on reciprocal best matches. Instead, it is necessary to also include certain directional best matches, namely those that identify good quartets.
We observed that a small number of HGT events can cause large deviations between the RBMG and the orthology graph . However, we have considered here the worst-case scenario, where HGT events occur between relatively closely related organisms. While this is of utmost relevance in some cases, for instance for toxin and virulence genes in bacteria, it is of little concern e.g. for the evolution of animals. In the latter case, xenologs almost always originate from bacteria or viruses, i.e., from outgroups. The xenologs then form their own group of co-orthologs and behave as if they would have been lost in the species outside the subtree that received the horizontally transfered gene.
From a more theoretical point of view, our empirical findings in the HGT case beg two questions: (1) Are there local features in the (R)BMG that make it possible to unambiguously identify HGT, at least in some cases? (2) What kind of additional information can be integrated to distinguish good quartets arising from duplication/loss events that can be safely removed from those that are introduced by HGT and should be “repaired” in a different manner. Most obviously, one may ask whether the Fitch relation is sufficient (we conjecture that this is the case) (Geiß et al. 2018; Hellmuth and Seemann 2019), or whether it suffices to know that a leaf is a (recent) result of transfer (we conjecture that this is not enough in general).
The identification of edges in the RBMG that should or should not be removed has important implication for orthology detection approaches that enforce the cograph structure of the predicted orthology relation by means of cograph editing. While this is an NP-complete problem (Liu et al. 2012) in general, the complexity of the colored version, i.e., editing a properly colored graph to the nearest hc-cograph remains open. The removal of false positive edges identified by good quartets empirically reduces the number of induced drastically. This observation also suggests to consider hc-cograph editing with a given best match relation. We suspect that the additional knowledge of the directed edges makes the problem tractable since it already implies a unique least resolved tree that captures much of the cograph structure.
Cograph editing would be fully content with hc-cographs, i.e., co-RBMGs. These are not necessarily “biologically feasible” in the sense that they can be reconciled with a species tree. It will therefore be of interest to consider the problem of editing an hc-cograph to another hc-cograph that is reconcilable with some or a given species tree – a problem that has been considered already for orthology relations (Lafond et al. 2016; Lafond and El-Mabrouk 2014). Since the obstructions are conflicting triples with a speciation at their top node, the offending data are conflicting orthology assignments. It seems natural therefore to phrase the problem not as an arbitrary editing problem but instead to ask for a maximal induced sub-hc-cograph that implies a compatible triple set. If it is indeed true that triples necessarily displayed by the species tree can be extracted directly from the c(R)BMG, it will be of practical use to consider the corresponding edge deletion problem for c(R)BMGs. In particular, it would be interesting to know whether the latter problem is the same as asking for the maximal compatible subset of triples implied by the c(R)BMG or co-BMG?
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
Open Access funding provided by Projekt DEAL. This work was support in part by the German Federal Ministry of Education and Research (BMBF, project no. 031A538A, de.NBI-RBC) and the Mexican Consejo Nacional de Ciencia y Tecnología (CONACyT, 278966 FONCICYT 2).
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Manuela Geiß, Email: manuela@bioinf.uni-leipzig.de.
Marcos E. González Laffitte, Email: marcoslaffitte@gmail.com.
Alitzel López Sánchez, Email: lopez.alitzel@gmail.com.
Dulce I. Valdivia, Email: dulce.i.valdivia@gmail.com
Marc Hellmuth, Email: mhellmuth@mailbox.org.
Maribel Hernández Rosales, Email: maribel@im.unam.mx.
Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de
References
- Altenhoff Adrian M, Boeckmann Brigitte, Capella-Gutierrez Salvador, Dalquen Daniel A, DeLuca Todd, Forslund Kristoffer, Huerta-Cepas Jaime, Linard Benjamin, Pereira Cécile, Pryszcz Leszek P, Schreiber Fabian, da Silva Alan Sousa, Szklarczyk Damian, Train Clément-Marie, Bork Peer, Lecompte Odile, von Mering Christian, Xenarios Ioannis, Sjölander Kimmen, Jensen Lars Juhl, Martin Maria J, Muffato Matthieu, Gabaldón Toni, Lewis Suzanna E, Thomas Paul D, Sonnhammer Erik, Dessimoz Christophe. Standardized benchmarking in the quest for orthologs. Nature Methods. 2016;13(5):425–430. doi: 10.1038/nmeth.3830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comp Biol. 2012;8:e1002514. doi: 10.1371/journal.pcbi.1002514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bansal M, Alm E, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012;28:i283–i291. doi: 10.1093/bioinformatics/bts225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Böcker S, Briesemeister S, Klau GW. Exact algorithms for cluster editing: evaluation and experiments. Algorithmica. 2011;60:316–334. [Google Scholar]
- Böcker S, Dress AWM. Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math. 1998;138:105–125. [Google Scholar]
- Corneil DG, Lerchs H, Steward Burlingham L. Complement reducible graphs. Discr Appl Math. 1981;3:163–174. [Google Scholar]
- Dalquén DA, Anisimova M, Gonnet GH, Dessimoz C. ALF—A simulation framework for genome evolution. Mol Biol Evol. 2011;29:1115–1123. doi: 10.1093/molbev/msr268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:W84–W89. doi: 10.1093/nar/gkp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dondi R, Lafond M, El-Mabrouk N. Approximating the correction of weighted and unweighted orthology and paralogy relations. Algorithms Mol Biol. 2017;12:4. doi: 10.1186/s13015-017-0096-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doyon JP, Chauve C, Hamel S. Space of gene/species trees reconciliations and parsimonious models. J Comp Biol. 2009;16:1399–1418. doi: 10.1089/cmb.2009.0095. [DOI] [PubMed] [Google Scholar]
- Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011;12:392–400. doi: 10.1093/bib/bbr045. [DOI] [PubMed] [Google Scholar]
- Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Tannier E, editor. Comparative genomics: international workshop, RECOMB-CG 2010. Berlin: Springer; 2010. pp. 93–108. [Google Scholar]
- Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. doi: 10.1093/bioinformatics/bti325. [DOI] [PubMed] [Google Scholar]
- Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part I: clans, basic subclasses, and morphisms. Theor Comp Sci. 1990;70:277–303. [Google Scholar]
- Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part II: representation through labeled tree families. Theor Comp Sci. 1990;70:305–342. [Google Scholar]
- Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. [PubMed] [Google Scholar]
- Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. doi: 10.1016/s0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]
- Gabaldón T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat Rev Genet. 2013;14:360–366. doi: 10.1038/nrg3456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geiß M, Anders J, Stadler PF, Wieseke N, Hellmuth M. Reconstructing gene trees from Fitch’s xenology relation. J Math Biol. 2018;77:1459–1491. doi: 10.1007/s00285-018-1260-8. [DOI] [PubMed] [Google Scholar]
- Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs. J Math Biol. 2019;78:2015–2057. doi: 10.1007/s00285-019-01332-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geiß Manuela, Stadler Peter F., Hellmuth Marc. Reciprocal best match graphs. Journal of Mathematical Biology. 2019;80(3):865–953. doi: 10.1007/s00285-019-01444-2. [DOI] [PubMed] [Google Scholar]
- Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor Comp Sci. 2006;359:378–399. [Google Scholar]
- Guigó R, Muchnik I, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6:189–213. doi: 10.1006/mpev.1996.0071. [DOI] [PubMed] [Google Scholar]
- Hellmuth M. Biologically feasible gene trees, reconciliation maps and informative triples. Alg Mol Biol. 2017;12:23. doi: 10.1186/s13015-017-0114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66:399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]
- Hellmuth M, Huber K, Moulton V. Reconciling event-labeled gene trees with MUL-trees and species networks. J Math Biol. 2019;79:1885–1925. doi: 10.1007/s00285-019-01414-8. [DOI] [PubMed] [Google Scholar]
- Hellmuth M, Seemann CR. Alternative characterizations of Fitch’s xenology relation. J Math Biol. 2019;79:969–986. doi: 10.1007/s00285-019-01384-x. [DOI] [PubMed] [Google Scholar]
- Hellmuth M, Stadler PF, Wieseke N. The mathematics of xenology: Di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations. J Math Biol. 2017;75:299–237. doi: 10.1007/s00285-016-1084-3. [DOI] [PubMed] [Google Scholar]
- Hellmuth Marc, Wieseke Nicolas. Evolutionary Biology. Cham: Springer International Publishing; 2016. From Sequence Data Including Orthologs, Paralogs, and Xenologs to Gene and Species Trees; pp. 373–392. [Google Scholar]
- Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci USA. 2015;112:2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13:S6. [Google Scholar]
- Hoàng CT, Kamiński M, Sawada J, Sritharan R. Finding and listing induced paths and cycles. Discr Appl Math. 2013;161:633–641. [Google Scholar]
- Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
- Jamison B, Olariu S. Recognizing -sparse graphs in linear time. SIAM J Comput. 1992;21:381–406. [Google Scholar]
- Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36:D250–D2504. doi: 10.1093/nar/gkm796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keller-Schmidt S, Klemm K. A model of macroevolution as a branching process based on innovations. Adv Complex Syst. 2012;15:1250043. [Google Scholar]
- Koonin E. Orthologs, paralogs, and evolutionary genomics. Ann Rev Genet. 2005;39:309–338. doi: 10.1146/annurev.genet.39.073003.114725. [DOI] [PubMed] [Google Scholar]
- Kuhn TS, Mooers AØ, Thomas GH. A simple polytomy resolver for dated phylogenies. Methods Ecol Evo. 2011;2:427–436. [Google Scholar]
- Lafond M, Dondi R, El-Mabrouk N. The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol. 2016;11:4. doi: 10.1186/s13015-016-0067-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genom. 2014;15:S12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection combining clustering and synteny for very large datasets. PLoS ONE. 2014;9:e105015. doi: 10.1371/journal.pone.0105015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comp Sci. 2012;461:45–54. [Google Scholar]
- Nichio BTL, Marchaukoski JN, Raittz RT. New tools in orthology analysis: a brief review of promising perspectives. Front Genet. 2017;8:165. doi: 10.3389/fgene.2017.00165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nøjgaard N, Geiß M, Merkle D, Stadler PF, Wieseke N, Hellmuth M. Time-consistent reconciliation maps and forbidden time travel. Alg Mol Biol. 2018;13:2. doi: 10.1186/s13015-018-0121-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Page RDM, Charleston MA. Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci. 1997;37:57–70. [Google Scholar]
- Purvis A, Garland T., Jr Polytomies in comparative analyses of continuous characters. Syst Biol. 1993;42:569–575. [Google Scholar]
- Roth ACJ, Gonnet GH, Dessimoz C. Algorithm of OMA for large-scale orthology inference. BMC Bioinform. 2008;9:518. doi: 10.1186/1471-2105-9-518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rusin LY, Lyubetskaya E, Gorbunov KY, Lyubetsky V. Reconciliation of gene and species trees. BioMed Res Int. 2014;2014:642089. doi: 10.1155/2014/642089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayyari E, Mirarab S. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes. 2018;9:E132. doi: 10.3390/genes9030132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sonnhammer E. L. L., Gabaldon T., Sousa da Silva A. W., Martin M., Robinson-Rechavi M., Boeckmann B., Thomas P. D., Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30(21):2993–2998. doi: 10.1093/bioinformatics/btu492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler PF, Geiß M, Schaller D, López A, Gonzalez Laffitte M, Valdivia D, Hellmuth M, Hernandez Rosales M (2020) From best hits to best matches. Tech Rep 2001.00958, arXiv
- Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. doi: 10.1093/bioinformatics/18.1.92. [DOI] [PubMed] [Google Scholar]
- Studer RA, Robinson-Rechavi M. How confident can we be that orthologs are similar, but paralogs differ? Trends Genet. 2009;25:210–216. doi: 10.1016/j.tig.2009.03.004. [DOI] [PubMed] [Google Scholar]
- Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
- Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEEACM Trans Comput Biol Bioinform. 2011;8:517–535. doi: 10.1109/TCBB.2010.14. [DOI] [PubMed] [Google Scholar]
- Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non-binary species trees. J Comput Biol. 2008;15:981–1006. doi: 10.1089/cmb.2008.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional annotations of paralogs: a blessing and a curse. Life. 2016;6:39. doi: 10.3390/life6030039. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.