Best match graphs and reconciliation of gene trees with species trees

Manuela Geiß; Marcos E González Laffitte; Alitzel López Sánchez; Dulce I Valdivia; Marc Hellmuth; Maribel Hernández Rosales; Peter F Stadler

doi:10.1007/s00285-020-01469-y

. 2020 Jan 30;80(5):1459–1495. doi: 10.1007/s00285-020-01469-y

Best match graphs and reconciliation of gene trees with species trees

Manuela Geiß ¹, Marcos E González Laffitte ², Alitzel López Sánchez ², Dulce I Valdivia ^3,⁴, Marc Hellmuth ^5,⁶, Maribel Hernández Rosales ², Peter F Stadler ^1,^7,^8,^9,^10,^11,^12,^✉

PMCID: PMC7052050 PMID: 32002659

Abstract

A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets—and thus corrected—in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.

Electronic supplementary material

The online version of this article (10.1007/s00285-020-01469-y) contains supplementary material, which is available to authorized users.

Keywords: Phylogenetic combinatorics, Colored digraph, Orthology, Horizontal gene transfer

Introduction

The distinction between orthologous and paralogous genes has important consequences for gene annotation, comparative genomics, as well as molecular phylogenetics due to their close correlation with gene function (Koonin 2005). Orthologous genes, which derive from a speciation as their last common ancestor (Fitch 1970), usually have at least approximately equivalent functions (Gabaldón and Koonin 2013). Paralogs, in contrast, tend to have related, but clearly distinct functions (Studer and Robinson-Rechavi 2009; Innan and Kondrashov 2010; Altenhoff et al. 2012; Zallot et al. 2016). Phylogenetic studies strive to restrict their input data to one-to-one orthologs since these often evolve in an approximately clock-like fashion. In comparative genomics, orthologs serve as anchors for chromosome alignments and thus are an important basis for synteny-based methods (Sonnhammer et al. 2014).

Despite its practical importance, the mathematical interrelationships of empirical “pairwise best hits” on one hand, and reconciliations of gene and species trees on the other hand have remained largely unexplored. Practical workflows for orthology assignment directly use pairwise best hits as initial estimate of orthologous gene pairs. Many of the commonly used methods for orthology-identification, such as OrthoMCL (Li et al. 2003), ProteinOrtho (Lechner et al. 2014), OMA (Roth et al. 2008), or eggNOG (Jensen et al. 2008), belong to this class. Extensive benchmarking (Altenhoff et al. 2016; Nichio et al. 2017) has shown that these tools perform at least as well as methods such as Orthostrapper (Storm and Sonnhammer 2002), PHOG (Datta et al. 2009), EnsemblCompara (Vilella et al. 2009), or HOGENOM (Dufayard et al. 2005) that first independently reconstruct a gene tree T and a species tree S and then determine orthologous and paralogous genes.

The intuition behind the pairwise best hit approach is that a gene y in species s can only be an ortholog of a gene x in species r if y is the closest relative of x in s and x is at the same time the closest relative of y in r. Evolutionary relatedness is defined in terms of an – often unknown – phylogenetic tree T. The notion of a best match or closest relative thus is made precise by considering the last common ancestors in T: y is a best match for x if the least common ancestor ${lca}_{T} (x, y)$ is not further away from x (and thus not closer to the root of the tree) than ${lca}_{T} (x, y^{'})$ for any other gene y in species s. This formally defines the best match relation studied in (Geiß et al. 2019a). The reciprocal best match relation identifies the pairs of genes that are mutually closest relatives between pairs of species, see (Geiß et al. 2019b).

Two approximations are introduced when pairwise best hit approaches are employed for orthology assessment. First, it is well known that two genes can be mutual closest relatives without being orthologs. The usual example is the complementary loss of ancestrally present paralogs following a gene duplication (Fig. 1a). Second, pairwise best hits as determined by sequence (dis)similiarity are not necessarily pairs of most closely related genes and vice versa, evolutionarily most closely related gene pairs do not necessarily appear as pairwise best hits (Fig. 1b).

Fig. 1 — Pairwise best hits are not equivalent to orthology. a Complementary losses of ancient paralogs following a later speciation event leaves only a single member of the gene family in each species. Hence, x and y are reciprocal best matches but not orthologs since their last common ancestor by construction is a duplication event. b Lineage specific rate differences between paralogs cause discrepancies between best hits and best matches. Here, the branch length in the tree represents sequence dissimilarity. In this example, the species (indicated by the leaf color) retain copies of the two paralogs originating from a duplication event pre-dating the separation of red and blue. While the gene $x_{2}$ evolves faster in the red species, the situation is reversed for $y_{2}$ in the blue species. While ${x_{1}, y_{1}}$ and ${x_{2}, y_{2}}$ are orthologs and reciprocal best matches in the evolutionary sense, neither appears as a reciprocal best hit in terms of similarity (i.e., branch length). The only reciprocal best hit is ${x_{1}, y_{2}}$ , which is neither a best match nor a pair of orthologs (color figure online)

We argue, therefore, that the relationship of pairwise best hits and orthology has to be understood in (at least) two conceptually and practically separate steps:

What is the relationship of pairwise best hits and reciprocal best matches?
What is the relation of reciprocal best matches and orthology?

In this contribution we focus on the second question, which is largely a mathematical problem. The main aim of the present contribution is to connect formal results on the structure of the orthology relation and the associated reconciliation maps and gene trees with recent results on the mathematical structure of (reciprocal) best match relations.

The first question, which is primarily a question of inference from data, is investigated in a companion paper (Stadler et al. 2020) that makes use of several of the mathematical results derived here. In a nutshell, the best hits inferred from estimates of genetic distances may differ from best matches whenever paralogs evolve with different rates in different species. In most situations this can be detected – and in most cases corrected – by considering quartets of genes ${a, b_{1}, b_{2}, c}$ from three different species, provided it is known that c is an outgroup to a, $b_{1}$ , and $b_{2}$ . Using the approximate additivity of empirical genetic distances, it can then be checked which one of the paralogs $b_{1}$ and $b_{2}$ is more closely related to a. The main practical difficulty is to ensure that c is correctly identified as outgroup.

Symbolic ultrametrics (Böcker and Dress 1998) and 2-structures (Ehrenfeucht and Rozenberg 1990a, b) provided a basis to show that orthology relations are essentially equivalent to cographs (Hellmuth et al. 2013, 2017; Hellmuth and Wieseke 2016). Moreover, in the absence of horizontal gene transfer (HGT), reconciliation maps for an event-labeled gene tree exist if and only if the species tree S displays all triples rooted in a speciation event that have leaves from three distinct species (Hernandez-Rosales et al. 2012; Hellmuth 2017). This shows that it is possible to infer species phylogenies from empirical estimates of orthology (Hellmuth et al. 2015; Lafond et al. 2016; Lafond and El-Mabrouk 2014; Dondi et al. 2017). Although it is possible to generalize many of the results, such as the characterization of reconciliation maps for event-labeled gene trees to scenarios with horizontal gene transfer (Nøjgaard et al. 2018; Hellmuth et al. 2019; Hellmuth 2017) this remains an active area of research.

Best matches as a mathematical structure have been studied only very recently. Geiß et al. (2019a) gave two alternative characterizations of best match digraphs and showed that they can be recognized in polynomial time. In particular, there is a unique least resolved tree for each best match digraph, which is displayed by the gene tree and can also be computed in polynomial time. Reciprocal best matches naturally appear as the symmetric part of these digraphs. Somewhat surprisingly, the undirected reciprocal best match graphs seem to have a much more difficult structure (Geiß et al. 2019b).

Although pairwise best hit methods do not attempt to explicitly construct the gene tree T, they still make the assumption that there is some underlying phylogeny for the provided homologous genes. The distinction of orthology and paralogy then amounts to assigning event labels (“speciation”, “duplication”, and possibly “HGT”) to the inner vertices of T. While it is true that any gene tree, and thus also any best match graph, can be reconciled with any species tree (Guigó et al. 1996; Page and Charleston 1997; Górecki and Tiuryn 2006), such a reconciliation may imply unrealistically many duplication and deletion events. In the extreme case, all inner vertices are duplication events before the first speciation. The root of the species tree then contains already a separate gene for each leaf of T. All the additional copies created by speciations therefore are eliminated again by subsequent loss events. More parsimonious reconciliations are thus usually modeled by minimizing the number duplication and loss events, reviewed e.g. by Doyon et al. (2011).

Moreover, the existence of reconciliation maps for T to some species tree cannot generally be ensured, if the event labels are given (Hernandez-Rosales et al. 2012; Hellmuth 2017). Hence, the best match relation (which constrains the gene tree (Geiß et al. 2019a)), the event labels, the existence of one or a particular reconcilation map, and the species tree depend on each other or at least do constrain each other. In this contribution we explore these dependencies in detail in the absence of horizontal gene transfer.

We show that, in this setting, the true orthology graph (TOG) is a subgraph of the reciprocal best match graph (RBMG). In other words, reciprocal best matches can only produce false positive orthology assignments as long as the evolution of a gene family proceeds via duplications, losses, and speciations. Computer simulations show that in broad parameter range the TOG and RBMG are very similar, proving an a posteriori justification for the use of reciprocal best matches in orthology estimation. In addition, we characterize a subset of the “false positive” edges in the RBMG that cannot be present in the TOG. Experimental results show that – using so-called good quartets – it is possible to remove nearly all false positive orthology assignments. Our aim here is to understand those sources of error and ambiguities in orthology detection that still persist even if reciprocal best matches are inferred with perfect accuracy. Therefore, all computer simulations reported here use perfect data as input. In a companion paper, we address the question how well reciprocal best matches can be inferred from (dis)similarity data, and what can be done to make this inital step more accurate. Finally, we discuss how these results can potentially be generalized to the case that the evolutionary scenarios contain HGT.

Preliminaries

A planted (phylogenetic) tree is a rooted tree T with vertex set V(T) and edge set E(T) such that (i) the root $0_{T}$ has degree 1 and (ii) all inner vertices have degree ${deg}_{T} (u) \geq 3$ . We write L(T) for the leaves (not including $0_{T}$ ) and $V^{0} = V (T) \ (L (T) \cup {0_{T}})$ for the inner vertices (also not including $0_{T}$ ). To avoid trivial cases, we will always assume that $| L (T) | \geq 2$ . The conventional root $ρ_{T}$ of T is the unique neighbor of $0_{T}$ . The main reason for using planted phylogenetic trees instead of modeling phylogenetic trees simply as rooted trees, which is the much more common practice in the field, is that we will often need to refer to the time before the first branching event. Conceptually, it corresponds to explicitly representing an outgroup. For some vertex $v \in V (T)$ , we denote by T(v) the subtree of T that is rooted in v. Its leaf set is L(T(v)).

On a rooted tree T we define the ancestor order: if y is a vertex of the unique path connecting x with the root $0_{T}$ , we write $x ≺_{T} y$ . As usual we write $x ⪯_{T} y$ if $x = y$ or $x ≺_{T} y$ . In particular, the leaves are the minimal elements w.r.t. $≺_{T}$ , and we have $x ⪯ 0_{T}$ for all $x \in V (T)$ . This partial order is conveniently extended to the edge set by defining each edge to be located between its incident vertices, i.e., if $y ≺_{T} x$ and $e = x y$ is an edge, we set $y ≺_{T} e ≺_{T} x$ . In this case, we write $e = x y$ to denote that x is closer to the root than y. If $e = x y \in E (T)$ , we say that y is a child of x, in symbols $y \in child (x)$ , and x is the parent of y in T. We sometimes also write $y ⪰_{T} x$ instead of $x ⪯_{T} y$ . Moreover, if $x ⪯_{T} y$ or $y ⪯_{T} x$ in T, then x and y are called comparable, otherwise the two vertices are incomparable.

For a non-empty subset of vertices $A \subseteq V$ of a rooted tree $T = (V, E)$ , we define ${lca}_{T} (A)$ , the last common ancestor of A, to be the unique $⪯_{T}$ -minimal vertex of T that is an ancestor of every vertex in A. For simplicity we write ${lca}_{T} (x_{1}, \dots, x_{k}) : = {lca}_{T} ({x_{1}, \dots, x_{k}})$ for a set $A = {x_{1}, \dots, x_{k}}$ of vertices. The definition of ${lca}_{T} (A)$ is conveniently extended to edges by setting ${lca}_{T} (x, e) : = {lca}_{T} ({x} \cup e)$ and ${lca}_{T} (e, f) : = {lca}_{T} (e \cup f)$ , where the edges $e, f \in E (T)$ are simply treated as sets of vertices. We note for later reference that $lca (A \cup B) = lca (lca (A), lca (B))$ holds for non-empty vertex sets A, B of a tree.

Binary trees on three leaves are called triples. We say that a triple xy|z is displayed in a rooted tree T if x, y, and z are leaves in T and the path from x to y does not intersect the path from z to the root. The set of all triples that are displayed by the tree T, is denoted by r(T) and a triple set R is said to be compatible if there exists a tree T that displays R, i.e., $R \subseteq r (T)$ .

Denote by L(S) a set of species and denote by $σ : L (T) \to L (S)$ the map that assigns to each gene $x \in L (T)$ a species $σ (x) \in L (S)$ . A tree T together with such a map $σ$ is denoted by $(T, σ)$ and called leaf-colored tree.

Definition 1

Let $(T, σ)$ be a leaf-colored tree. A leaf $y \in L (T)$ is a best match of the leaf $x \in L (T)$ if $σ (x) \neq σ (y)$ and $lca (x, y) ⪯_{T} lca (x, y^{'})$ holds for all leaves $y^{'}$ from species $σ (y^{'}) = σ (y)$ . The leaves $x, y \in L (T)$ are reciprocal best matches if y is a best match for x and x is a best match for y.

The directed graph $\vec{G} (T, σ)$ with vertex set L(T), vertex-coloring $σ$ , and edges defined by the best matches in $(T, σ)$ is known as colored best match graph (BMG) (Geiß et al. 2019a). The undirected graph $G (T, σ)$ with vertex set L(T), vertex-coloring $σ$ , and edges defined by the reciprocal best matches in $(T, σ)$ is known as colored reciprocal best match graph (RBMG) (Geiß et al. 2019b). We sometimes write n-BMG, resp., n-RBMG to specify the number n of colors.

Throughout this contribution, $G = (V, E)$ and $\vec{G} = (V, \vec{E})$ denote simple undirected and simple directed graphs, respectively. We distinguish directed arcs (x, y) in a digraph $\vec{G}$ from edges xy in an undirected graph G or tree T. For an undirected graph G we denote by $N (x) = {y ∣ y \in V (G), x y \in E (G)}$ the neighborhood of some vertex x in G. The disjoint union Inline graphic of two graphs $G = (V, E)$ and $H = (W, F)$ has vertex set and edge set . Their join has again vertex set and its edge set is given by . Thus the join of G and H is obtained by connecting every vertex of G to every vertex of H.

A class of undirected graphs that plays an important role in this contribution are cographs, which are recursively defined (Corneil et al. 1981):

Definition 2

An undirected graph G is a cograph if one of the following conditions is satisfied:

$G = K_{1}$ , the single-vertex graph,
$G = H ⋈ H^{'}$ , where H and $H^{'}$ are cographs,
, where H and $H^{'}$ are cographs.

An undirected graph is a cograph if and only if it does not contain an induced $P_{4}$ (path on four vertices) (Corneil et al. 1981).

Every cograph G is associated with a set of phylogenetic trees $T_{G}$ , usually referred to as the cotrees of G. Every cotree $T_{G} \in T_{G}$ corresponds to a possible recursive construction of G, where the cotree for the single-vertex graph $K_{1}$ is simply $K_{1}$ . Since both the disjoint union and the join operation are associative, it is possible to join or unify two or more component cographs in a single construction step. The leaves of $T_{G}$ correspond to the vertices of G. Each interior vertex of $T_{G}$ corresponds to either a join or a disjoint union operation. Its child-subtrees, furthermore, are exactly the cotrees of the component cographs that are joined or disjointly unified, respectively. The event type associated with an inner vertex u will be denoted by $t_{G} (u)$ . Each vertex u of $T_{G}$ can be associated with an induced subgraph $G [L (T_{G} (u))]$ . A cotree $T_{G}$ is called discriminating if any two adjacent inner nodes represent different types of events. If $T_{G} \in T_{G}$ and $T_{G}^{'}$ is obtained from $T_{G}$ by contracting a non-discriminating edge, i.e., an edge uv with $t_{G} (u) = t_{G} (v)$ , then $T_{G}^{'} \in T_{G}$ . Every cograph has a unique discriminating cotree, which is obtained from any of its cotrees by contracting all non-discriminating edges (Corneil et al. 1981). We note, finally, that the discriminating cotree of G coincides with the modular decomposition tree of G.

Reconciliation maps, event labelings, and orthology relations

A gene tree $T = (V, E)$ and a species tree $S = (W, F)$ are planted phylogenetics trees on a set of (extant) genes L(T) and species L(S), respectively. We assume that we know which gene comes from which species. Mathematically, this knowledge is represented by a map $σ : L (T) \to L (S)$ that assigns to each gene the species in whose genome it resides. Best match approaches start from a set of genes taken from a set of species. Hence, the “gene-species-association” is known. Moreover, species without sampled genes do not affect the best match graph and we can w.l.o.g. assume that $σ$ is a surjective map to avoid trivial cases. Note, however, that the definitions and results presented below naturally extend to general maps $σ$ . We write $(T, σ)$ for a gene tree with given map $σ$ .

An evolutionary scenario comprises a gene tree and a species tree together with a map $μ$ from T to S that identifies the locations in the species tree S at which evolutionary events took place that are represented by the vertices of the gene tree T. The properties of the map $μ$ of course depend on which types of evolutionary events are considered. In order to model evolutionary scenarios we assume that evolutionary events of different types do not occur concurrently. In particular, speciation and duplication are always strictly temporally ordered. Gene duplications therefore always occur along the edges of the species tree. Vertices on T that model speciation events, on the other hand, must be mapped to inner vertices of S.

From here on we will consider only Duplication/Loss secenarios, that is we explicitly exclude horizontal gene transfer (HGT). We will briefly discuss the effects of HGT in Sect. 8.

Definition 3

(Reconciliation Map) Let $S = (W, F)$ and $T = (V, E)$ be two planted phylogenetic trees and let $σ : L (T) \to L (S)$ be a surjective map. A reconciliation from $(T, σ)$ to S is a map $μ : V \to W \cup F$ satisfying

(R0)
Root Constraint. $μ (x) = 0_{S}$ if and only if $x = 0_{T}$ .
(R1)
Leaf Constraint. If $x \in L (T)$ , then $μ (x) = σ (x)$ .
(R2)
Ancestor Preservation. $x ≺_{T} y$ implies $μ (x) ⪯_{S} μ (y)$ .
(R3)
Speciation Constraints. Suppose $μ (x) \in W^{0}$ .
- (i)
  $μ (x) = {lca}_{S} (μ (v^{'}), μ (v^{''}))$ for at least two distinct children $v^{'}, v^{''}$ of x in T.
- (ii)
  $μ (v^{'})$ and $μ (v^{''})$ are incomparable in S for any two distinct children $v^{'}$ and $v^{''}$ of x in T.

Several alternative definitions of reconciliation maps for Duplication/Loss scenarios have been proposed in the literature, many of which have been shown to be equivalent. Nevertheless, we add yet another one because earlier variants do not clearly separate conditions pertaining to the structural congruence of gene tree and species tree (Axioms (R0), (R1), and (R2)) from conditions that (implicitly) distinguish event types, here (R3.i) and (R3.ii). This axiom system also generalizes easily to situations with horizontal transfer as we shall see in Sect. 8. We proceed by showing that it is equivalent to axioms that are commonly used in the literature, see e.g. Górecki and Tiuryn (2006), Vernot et al. (2008), Doyon et al. (2011), Rusin et al. (2014), Hellmuth (2017), Nøjgaard et al. (2018), and the references therein.

Lemma 1

Let $μ$ be a map from $(T = (V, E), σ)$ to $S = (W, F)$ that satisfies (R0) and (R1). Then, $μ$ satisfies Axioms (R2) and (R3) if and only if $μ$ satisfies

(R2’)
Ancestor Constraint.
Suppose $x, y \in V$ with $x ≺_{T} y$ .
- (i)
  If $μ (x), μ (y) \in F$ , then $μ (x) ⪯_{S} μ (y)$ ,
- (ii)
  otherwise, i.e., at least one of $μ (x)$ and $μ (y)$ is contained in W, $μ (x) ≺_{S} μ (y)$ .
- (R3’)
  Inner Vertex Constraint.
  If $μ (x) \in W^{0}$ , then
  - (i)
    $μ (x) = {lca}_{S} (σ (L (T (x))))$ and
  - (ii)
    $μ (v^{'})$ and $μ (v^{''})$ are incomparable in S for any two distinct children $v^{'}$ and $v^{''}$ of x in T.

Proof

Assume first that (R2) and (R3) are satisfied for $μ$ .

Then property (R2’.i) is satisfied since it is the restriction of (R2) to $μ (x), μ (y) \in F$ .

To see that (R2’.ii) holds, let $x ≺_{T} y$ and $μ (x) \in W$ or $μ (y) \in W$ . Assume first that $μ (y) \in W$ . Property (R2) implies $μ (x) ⪯_{S} μ (y)$ . Let v be the child of y that lies on the path from y to x in T, i.e., $x ⪯_{T} v ≺_{T} y$ . Assume for contradiction that $μ (x) = μ (y)$ . By Property (R2) we have $μ (x) = μ (v) = μ (y)$ . For every other child $v^{'}$ of y, Property (R2) implies $μ (v^{'}) ⪯_{S} μ (y) = μ (v)$ . Thus, $μ (v)$ and $μ (v^{'})$ are comparable; a contradiction to (R3.ii). Hence, $μ (x) ≺_{S} μ (y)$ and (R2’.ii) is satisfied. Now suppose $μ (x) \in W$ and assume for contradiction that $μ (x) = μ (y)$ . Thus $μ (y) \in W$ and we can apply the same arguments as above to conclude that (R3.ii) is not satisfied. Hence, $μ (x) ≺_{S} μ (y)$ and (R2’.ii) is satisfied.

In order to show that (R3’) is satisfied, let $x \in V$ such that $μ (x) \in W^{0}$ . Properties (R3’.ii) and (R3.ii) are equivalent. It remains to show that (R3’.i) is satisfied. From (R2) we infer $μ (y) ⪯_{S} μ (x)$ for all $y \in ⋃_{v \in child (x)} L (T (v)) = L (T (x))$ . Thus,

\begin{matrix} {lca}_{S} (σ (L (T (x)))) ⪯ μ (x) . \end{matrix}

Property (R3.i) implies that there are two distinct children $v^{'}, v^{''} \in child (x)$ with $μ (x) = {lca}_{S} (μ (v^{'}), μ (v^{''}))$ . Again using (R3.ii), we know that the images $μ (v^{'})$ and $μ (v^{''})$ are incomparable in S. The latter together with $μ (y) ⪯_{S} μ (v^{'})$ for all $y \in L (T (v^{'}))$ and $μ (y^{'}) ⪯_{S} μ (v^{''})$ for all $y^{'} \in L (T (v^{''}))$ implies

\begin{matrix} {lca}_{S} (μ (v^{'}), μ (v^{''})) = {lca}_{S} (σ (L (T (v^{'}))) \cup σ (L (T (v^{''})))) ⪯_{S} {lca}_{S} (σ (L (T (x)))) . \end{matrix}

In summary, ${lca}_{S} (σ (L (T (x)))) ⪯_{S} μ (x) = {lca}_{S} (μ (v^{'}), μ (v^{''})) ⪯_{S} {lca}_{S} (σ (L (T (x))))$ implies that $μ (x) = {lca}_{S} (σ (L (T (x))))$ and Property (R3’.i) is satisfied.

Therefore, (R2) and (R3) imply (R2’) and (R3’).

Conversely, assume now that (R2’) and (R3’) are satisfied for $μ$ . Clearly (R2’) implies (R2), and (R3’.ii) implies (R3.ii). It remains to show that (R3.i) is satisfied. Let $μ (x) \in W^{0}$ . By (R2’.ii) we have $μ (x) ≻_{S} μ (v_{i})$ for all children $v_{i} \in child (x) = {v_{1}, \dots, v_{k}}$ , $k \geq 2$ . Therefore, $μ (x) ⪰_{S} {lca}_{S} (μ (v_{1}), \dots, μ (v_{k}))$ . By (R3’.ii), the images $μ (v_{1}), \dots, μ (v_{k})$ are pairwise incomparable in S. The latter and (R2’.i) imply ${lca}_{S} (μ (v_{1}), \dots, μ (v_{k})) = {lca}_{S} (⋃_{i = 1}^{k} σ (L (T (v_{i})))) = {lca}_{S} (σ (L (T (x)))) = μ (x)$ . It is easy to verify that ${lca}_{S} (μ (v_{1}), \dots, μ (v_{k})) = {lca}_{S} (μ (v^{'}), μ (v^{''}))$ for at least two children $v^{'}, v^{''} \in child (x)$ is always satisfied. Hence, $μ (x) = {lca}_{S} (μ (v^{'}), μ (v^{''}))$ for some $v^{'}, v^{''} \in child (x)$ and thus, (R3.i) is satisfied.

Therefore, (R2’) and (R3’) imply (R2) and (R3). $□$

A reconciliation map $μ$ from $(T, σ)$ to a species tree S implicitly determines whether an inner node of T corresponds to a speciation or a duplication. Since we assume that distinct events are represented by distinct nodes of the gene tree, all duplication events are mapped to the edges of S. Vertices of T mapped to vertices of S thus represent speciations. We formalize this idea as follows:

Definition 4

Given a reconciliation map $μ$ from $(T, σ)$ to S, the event labeling onT(determined by $μ$ ) is the map Inline graphic given by:

The symbols $⊚$ and $⊙$ identify the planted root $0_{T}$ and the leaves of T, respectively. Inner vertices are labeled $□$ for duplication and Inline graphic for speciation, respectively.

The event labeling $t_{μ}$ , by definition, is completely determined by a reconciliation map $μ$ . This raises two related questions: (1) which pattern of event labels can arise for reconciliation maps, and (2) what restriction does a given event labeling impose on the reconciliation map? To study these questions, we consider event-labeled trees (T, t) where the event labeling of T is a map Inline graphic satisfying $t (0_{T}) = ⊚$ , $t (x) = ⊙$ for all $x \in L (T)$ , and for $x \in V^{0} (T)$ . We interpret $□$ as gene duplication event and as speciation event.

A simple consequence of the Axioms (R0)-(R3) is the following result which is stated here for later reference. For the sake of completeness, we also provide a short proof.

Lemma 2

Let $μ$ be a reconciliation map from the leaf-colored tree $(T, σ)$ to $S = (W, F)$ and suppose that x is a vertex in V(T) with $μ (x) \in W^{0}$ . Then, $σ (L (T (v^{'}))) \cap σ (L (T (v^{''}))) = \emptyset$ for any two distinct $v^{'}, v^{''} \in child (x)$ .

Proof

Assume for contradiction that there is a vertex $z \in σ (L (T (v^{'}))) \cap σ (L (T (v^{''})))$ . By Condition (R2’), we have $μ (x) ≻_{S} μ (v^{'}) ⪰_{S} z$ and $μ (x) ≻_{S} μ (v^{''}) ⪰_{S} z$ . Thus, there is a path $P_{1}$ from $μ (x)$ to z that contains $μ (v^{'})$ and a path $P_{2}$ from $μ (x)$ to z that contains $μ (v^{''})$ . However, Condition (R3.ii) implies that $μ (v^{'})$ and $μ (v^{''})$ are incomparable in S, that is, the subtree of S consisting of the two paths $P_{1}$ and $P_{2}$ must contain a cycle; a contradiction. $□$

Lemma 2 has a simple interpretation: Since $μ (x) \in W^{0}$ , we have Inline graphic , i.e., x represents a speciation. The lemma thus states that any two subtrees of T rooted in distinct children of a speciation event are composed of genes from disjoint sets of species. It suggests the following

Definition 5

An event labeling Inline graphic is well-formed if implies that $σ (L (T (v^{'}))) \cap σ (L (T (v^{''}))) = \emptyset$ for any two distinct $v^{'}, v^{''} \in child (x)$ .

Lemma 2 suggests to ask for a characterization of the event maps t for a given leaf-labeled tree $(T, σ)$ for which $(T, t, σ)$ admits a reconciliation map to some species tree. Definition 5 suggests to start by considering among the well-formed event labelings the one that designates every vertex of T that is not identified as a duplication because it violates Lemma 2.

Definition 6

Let $(T, σ)$ be a leaf-labeled tree. The extremal event labeling of T is the map Inline graphic defined for $u \in V (T)$ by

The extremal event labeling ${\hat{t}}_{T}$ is completely determined by $(T, σ)$ . By construction, if $u \in V^{0} (T)$ is a duplication w.r.t. to the extremal event labeling ${\hat{t}}_{T} (u) = □$ , then $t (u) = □$ for every well-formed event labeling t on $(T, σ)$ .

It is a well-known result that it is always possible to reconcile a given pair of gene tree T and species tree S, see e.g. (Guigó et al. 1996; Page and Charleston 1997; Górecki and Tiuryn 2006). For convenience, we include a short direct proof of this fact.

Lemma 3

For every tree $(T = (V, E), σ)$ there is a reconciliation map $μ$ to any species tree S with leaf set $L (S) = σ (L (T))$ .

Proof

Let $S = (W, F)$ be an arbitrary species tree with leaf set L(S) and $e_{0} = 0_{S} ρ_{S}$ be the unique root-edge of S. Set $μ (0_{T}) = 0_{S}$ and $μ (v) = σ (v)$ for all $v \in L (T)$ . Thus, (R0) and (R1) are satisfied. Now, set $μ (v) = e_{0}$ for all $v \in V^{0} = V \ (L (T) \cup {0_{T}})$ . Thus, $μ (v) \notin W^{0}$ for all $v \in V^{0}$ and (R3) is trivially satisfied. Finally, for all $v, v^{'} \in V^{0}$ and $y \in L (T)$ with $y ≺_{T} v ≺_{T} v^{'}$ we have by construction of $μ$ that $μ (y) ≺_{T} μ (v) = μ (v^{'}) ≺_{T} μ (0_{T})$ . Thus, (R2) is satisfied. $□$

The reconciliation map $μ$ constructed in the proof of Lemma 3 maps all inner vertices of the gene tree to the edge above the root of the species tree S, and hence $t_{μ} (x) = □$ for all inner vertices of T. The root of S already contains |L(T)| genes, one for each leaf of T. Every speciation event is therefore accompanied by complementary losses, and there are no further gene duplication events below the root.

The assignment of genes to species, i.e., a prescribed leaf coloring $σ$ , however, implies further restrictions. In fact, it is not sufficient to require that the event labeling is well-formed. Instead, the simultaneous knowledge of $(T, t, σ)$ gives rise to stronger conditions on the species trees S with which $(T, t, σ)$ can be reconciled. Following (Hernandez-Rosales et al. 2012), we denote by $S (T, t, σ)$ the set of triples $σ (a) σ (b) | σ (c)$ for which ab|c is a triple displayed by T such that (i) $σ (a)$ , $σ (b)$ , $σ (c)$ are pairwise distinct species and (ii) the root of the triple is a speciation event, i.e., Inline graphic . This set of triples characterizes the existence of a reconciliation map:

Proposition 1

(Hernandez-Rosales et al. 2012; Hellmuth 2017) Given an leaf-labeled tree $(T, t, σ)$ with a well-formed event labeling t and a species tree S with $L (S) = σ (L (T))$ , there is a reconciliation map $μ : V (T) \to V (S) \cup E (S)$ such that the event labeling is consistent with Definition 4 if and only if S displays $S (T, t, σ)$ . In particular, $(T, t, σ)$ can be reconciled with a species tree if and only if $S (T, t, σ)$ is a compatible set of triples.

An example for a $(T, t, σ)$ that does not admit a reconciliation map is given in Fig. 2 (top left). We note that the characterization in Proposition 1 can be evaluated in polynomial time (Hellmuth 2017).

The event labeling t on T defines the orthology relation:

Definition 7

(Fitch 2000) Two distinct leaves $x, y \in L (T)$ are orthologs (w.r.t.t) if Inline graphic ; they are paralogs if $t ({lca}_{T} (x, y)) = □$ .

For completeness, we note that $t ({lca}_{T} (x, y)) = ⊙$ if and only $x = y$ , and $0_{T}$ is never the $lca$ of any of pair of leaves since the planted root $0_{T}$ has degree 1 by construction. We write $Θ (T, t)$ for the orthology relation obtained from (T, t), i.e., the set of all unordered pairs ${x, y}$ of orthologous genes in L(T). For convenience we will not distinguish between the irreflexive, symmetric binary relation $Θ (T, t)$ and the graph with vertex set L(T) and edge set $Θ (T, t)$ . Naturally, we say that an arbitrary relation $Θ$ is an orthology relation if there is an event-labeled phylogenetic tree (T, t) such that $Θ = Θ (T, t)$ . It is important to note that the orthology relation $Θ$ explicitly depends on the event labeling. Analogously, one can also define the paralogy relation $\bar{Θ}$ by $t ({lca}_{T} (x, y)) = □$ . Both orthology and paralogy are irreflexive and symmetric but not transitive, see Fig. 3. We note that orthology $Θ$ and paralogy $\bar{Θ}$ are complementary in the graph-theoretical sense, i.e., ${x, y}$ is contained in exactly one of $Θ$ or $\bar{Θ}$ .

Fig. 3 — Orthology and paralogy relations are symmetric but not transitive. In this evolutionary scenario with two speciations () and two duplications ( $□$ ), the genes $a_{1}$ and $b_{2}$ are both orthologs of $c_{1}$ but not of each other. The leaves of the gene tree on the l.h.s. are colored corresponding to the three species A, B, and C. The orthology graph $Θ$ and its complement, the paralogy graph $\bar{Θ}$ , are shown on the r.h.s (color figure online)

Inline graphic — Orthology and paralogy relations are symmetric but not transitive. In this evolutionary scenario with two speciations () and two duplications ( $□$ ), the genes $a_{1}$ and $b_{2}$ are both orthologs of $c_{1}$ but not of each other. The leaves of the gene tree on the l.h.s. are colored corresponding to the three species A, B, and C. The orthology graph $Θ$ and its complement, the paralogy graph $\bar{Θ}$ , are shown on the r.h.s (color figure online)

Based on the work of Böcker and Dress (1998) it has been shown by Hellmuth et al. (2013) that valid orthology relations are exactly cographs:

Proposition 2

An irreflexive, symmetric relation $Θ$ on L is an orthology relation if and only if it is a cograph. In this case, every cotree T of $Θ$ with an event labeling t assigning Inline graphic to join operations and $□$ to disjoint union operations satisfies $Θ = Θ (T, t)$ .

There is a unique discriminating cotree $(T_{Θ}, t_{Θ})$ for an orthology relation $Θ$ , which is obtained from every other (non-discriminating) cotree (T, t) for $Θ$ by contracting the inner edges uv of T if and only if $t (u) = t (v)$ (Böcker and Dress 1998; Hellmuth et al. 2013).

It is natural then to ask under which conditions a given orthology relation $Θ$ is consistent with a leaf-labeled tree $(T, σ)$ in the sense that there is a reconcilation map $μ$ from $(T, σ)$ to some species tree such that $Θ = Θ (T, t_{μ})$ . We first consider the special case $T = T_{Θ}$ . As shown by Hellmuth and Wieseke (2016), it is possible to obtain the set of informative triples $S (T_{Θ}, t_{Θ}, σ)$ directly from $Θ$ using the following rule:

$σ (a) σ (b) | σ (c) \in S (T, t, σ)$ if and only if $σ (a), σ (b)$ , and $σ (c)$ are pairwise different species and either

$(a, c), (b, c) \in Θ$ and $(a, b) \notin Θ$ or
$(a, c), (b, c), (a, b) \in Θ$ and there is a vertex $d \neq a, b, c$ with $(c, d) \in Θ$ and $(a, d), (b, d) \notin Θ$ .

Theorem 1

Let $Θ$ be a cograph with vertex set L and associated cotree $(T_{Θ}, t_{Θ})$ with leaf set L and let $σ$ be a leaf coloring. Then there exists a reconciliation map $μ$ from $(T_{Θ}, t_{Θ}, σ)$ to some species tree S if and only if (i) $S (T_{Θ}, t_{Θ}, σ)$ is compatible and (ii) the cograph $(Θ, σ)$ is properly colored, i.e., for all $x y \in E (Θ)$ we have $σ (x) \neq σ (y)$ .

Proof

By Proposition 1, it is necessary and sufficient that (i) the set of informative triples is compatible and (ii) the event map $t_{Θ}$ is well-formed. Since $t_{Θ}$ is the event labeling of the co-tree, Condition (ii) amounts to requiring that the leaf set $L (T (v_{i}))$ have pairwise disjoint sets of colors $σ (L (T (v_{i})))$ for all children $v_{i} \in child (u)$ of every join node u. Since the join $Θ_{i} ⋈ Θ_{j}$ of the two cographs associated with $T (v_{i})$ and $T (v_{j})$ introduces an edge xy for all $x \in L (T (v_{i}))$ and all $y \in L (T (v_{j}))$ , the resulting graph can only be properly colored if $σ (L (T (v_{i}))) \cap σ (L (T (v_{j}))) = \emptyset$ . On the other hand, every edge in $Θ$ is the result of a join operation, thus $(Θ, σ)$ can only be well-colored if joins only appear between induced subgraphs with disjoint color sets. Thus $t_{Θ}$ is well-formed if and only if $σ$ is a proper vertex coloring for $Θ$ . $□$

Under the assumption that a reconciliation map $μ$ exists for $(T, σ)$ to some species tree, the next results shows that the orthology relation $Θ (T, t_{μ})$ is always a subgraph of the orthology relation $Θ (T, {\hat{t}}_{T})$ implied by $(T, σ)$ and its extremal labeling ${\hat{t}}_{T}$ .

Lemma 4

Let $(T, σ)$ be a leaf-labeled tree and $μ$ a reconciliation map from $(T, σ)$ to some species tree S. Then $Θ (T, t_{μ}) \subseteq Θ (T, {\hat{t}}_{T})$ .

Proof

Let $u = {lca}_{T} (x, y)$ and suppose $x y \in Θ (T, t_{μ})$ . Then, Inline graphic by definition of $Θ (T, t_{μ})$ , i.e., $μ (u) \in V^{0} (S)$ . Therefore, Lemma 2 implies $σ (L (T (v))) \cap σ (L (T (v^{'}))) = \emptyset$ for all $v, v^{'} \in {child}_{T} (u)$ . Hence, by definition of the extremal event labeling and thus $x y \in Θ (T, {\hat{t}}_{T})$ . $□$

The converse of Lemma 4 is generally not true, see Fig. 2 for an example. For later reference, we note the following result which is an immediate consequence of Lemma 4 due to the fact that orthology and paralogy relations are complementary.

Corollary 1

Let $(T, σ)$ be a leaf-labeled tree and $μ$ a reconciliation map from $(T, σ)$ to some species tree S. Then $\bar{Θ} (T, {\hat{t}}_{T}) \subseteq \bar{Θ} (T, t_{μ})$ .

Lemma 4, in particular, implies that none of the labelings $t_{μ}$ (provided by any reconciliation map $μ$ ) can yield more speciation events in T, than the extremal labeling ${\hat{t}}_{T}$ . Moreover, it is easy to see that Inline graphic always implies , while ${\hat{t}}_{T} (v) = □$ implies $t_{μ} (v) = □$ .

We briefly compare the formalism introduced here with the literature on maximum parsimony reconciliations. There, one considers reconciliation maps $η : V (T) \to V (S)$ that map duplication events in T also to vertices of S. The mapping $η$ is then interpreted in such a way that the duplication event u took place along an edge in S that is ancestral to $η (u)$ . The map $η$ in this setting does not completely determine the event labeling. The least common ancestor map

\begin{matrix} \hat{η} (v) : = {lca}_{S} (σ (L (T (v)))) . \end{matrix}

corresponds to one of the “most parsimonious reconciliations” (Górecki and Tiuryn 2006; Doyon et al. 2009) and can be obtained in polynomial time. A closely related reconciliation map can be defined in our setting. The LCA-reconciliation map introduced by Hellmuth (2017) satisfies the additional axiom

(LCA) $μ (u) = v {lca}_{S} (σ (L (T (u)))) \in E (S)$ for all $u \in V (T)$ with $t (u) = □$ , where v denotes the unique parent of ${lca}_{S} (σ (L (T (u)))) \in V (S)$ in S.

The Axiom (LCA) is the analog of Eq. (2) for duplication vertices in T, which in our formalism are necessarily mapped to edges. For speciation events, the corresponding condition is expressed by (R3.i). Hellmuth (2017) showed that the existence of a reconciliation map from $(T, t, σ)$ implies also the existence of an LCA-reconciliation map. Figure 2 shows that an LCA-reconciliation map does not necessarily have ${\hat{t}}_{T}$ as its event labeling. Even if $t_{μ} = {\hat{t}}_{T}$ , then $μ$ is not necessarily an LCA-reconciliation map, see Fig. 4.

Fig. 4 — Reconciliation map $μ$ from $(T, σ)$ to the (tube-like) species tree S. The map $μ$ is given implicitly by drawing $(T, σ)$ into S. The map $μ$ is not an LCA-reconciliation map since $μ (u)$ does not map u to the edge $v {lca}_{S} (A, B) \in E (S)$ where v denotes the unique parent of ${lca}_{S} (A, B)$ in S. However, $t_{μ}$ and the extremal map ${\hat{t}}_{T}$ coincide (color figure online)

Orthology and reciprocal best matches

In this section, we further clarify the relationship between the orthology relation and (reciprocal) best matches. As a main result, we find that the reciprocal best match graph contains any possible orthology relation.

Lemma 5

If $(T, σ)$ with leaf set L explains the RBMG $(G, σ)$ and ${\hat{t}}_{T}$ is the extremal event labeling of $(T, σ)$ , then $Θ (T, {\hat{t}}_{T})$ is a subgraph of the RBMG $G (T, σ)$ .

Proof

Consider a vertex $u \in V^{0} (T)$ with $child (u) = {u_{1}, \dots, u_{k}}$ . If ${\hat{t}}_{T} (u) = □$ , then none of the edges xy in G with $x \in L (T (u_{i}))$ and $y \in L (T (u_{j}))$ , $1 \leq i < j \leq k$ is contained in $Θ (T, {\hat{t}}_{T})$ .

Now suppose Inline graphic . For $x \in L (T (u_{i}))$ and $y \in L (T (u_{j}))$ with $1 \leq i < j \leq k$ , we have $x y \in Θ (T, {\hat{t}}_{T})$ and, by construction of ${\hat{t}}_{T}$ , $σ (x) \neq σ (y)$ . In particular, implies that all distinct children $u_{i}, u_{j} \in child (u)$ satisfy $σ (L (T (u_{i}))) \cap σ (L (T (u_{j}))) = \emptyset$ . Thus, ${lca}_{T} (x, y) = u ⪯_{T} {lca}_{T} (x^{'}, y)$ for all $x^{'} \neq x$ with $σ (x^{'}) = σ (x)$ and ${lca}_{T} (x, y) = u ⪯_{T} {lca}_{T} (x, y^{'})$ for all $y^{'} \neq y$ with $σ (y^{'}) = σ (y)$ , i.e., x and y are reciprocal best matches. Hence, $x y \in E (G)$ and thus $Θ (T, {\hat{t}}_{T}) \subseteq G (T, σ)$ . $□$

Lemmas 4 and 5 immediately imply.

Theorem 2

Let T and S be planted trees, $σ : L (T) \to L (S)$ a surjective map, and $μ$ a reconciliation map from $(T, σ)$ to S. If $x y \in Θ (T, t_{μ})$ , then x and y are reciprocal best matches in $(T, σ)$ .

Observation 1

Reciprocal best matches therefore cannot produce false negative orthology assignments as long as the evolution of a gene family proceeds via duplications, losses, and speciations only.

The “false positive” edges in the RBMG compared to the orthology relation are the consequence of a particular class of duplication events:

Theorem 3

Let $(T, t, σ)$ be a leaf- and event-labeled gene tree, $G (T, σ)$ and $Θ (T, t)$ its corresponding RBMG and orthology relation, respectively. Moreover, let $a, b \in L (T)$ , $v : = {lca}_{T} (a, b)$ , and $v_{a}, v_{b} \in {child}_{T} (v)$ such that $a ⪯ v_{a} ≺ v$ , $b ⪯ v_{b} ≺ v$ . Then, $a b \in E (G (T, σ)) \ E (Θ (T, t))$ if and only if $t (v) = □$ and $σ (a), σ (b) \in σ (L (T (v_{a}))) ▵ σ (L (T (v_{b})))$ , where “ $▵$ ” denotes the usual symmetric set difference.

Proof

Suppose first $a b \in E (G (T, σ)) \ E (Θ (T, t))$ . By definition of $Θ (T, t)$ , we immediately find $t (v) = □$ . Since $a b \in E (G (T, σ))$ , i.e., a and b are reciprocal best matches, it must hold $v ⪯_{T} {lca}_{T} (a, b^{'})$ for any $b^{'}$ of color $σ (b)$ . Hence, $σ (b) \notin σ (L (T (v_{a})))$ . Analogously, we conclude $σ (a) \notin σ (L (T (v_{b})))$ and thus, $σ (a), σ (b) \in σ (L (T (v_{a}))) ▵ σ (L (T (v_{b})))$ .

Conversely, assume $t (v) = □$ and $σ (a), σ (b) \in σ (L (T (v_{b}))) ▵ σ (L (T (v_{a})))$ . Since $t (v) = □$ , a and b cannot be orthologs, i.e., $a b \notin E (Θ (T, t))$ . Moreover, $σ (a) \in σ (L (T (v_{b}))) ▵ σ (L (T (v_{a})))$ in particular implies $σ (a) \notin σ (L (T (v_{b})))$ and therefore, $v ⪯_{T} {lca}_{T} (a, b^{'})$ for any $b^{'}$ with $σ (b^{'}) = σ (b)$ . Hence, b is a best match for a in species $σ (b)$ . One similarly concludes that a is a best match for b. Hence, a and b are reciprocal best matches, which concludes the proof. $□$

In practical application we usually do not know the event-labeled gene tree. It is possible, however, to compute the reciprocal best matches directly from sequence data. Therefore, it is of interest to investigate the relationship of reciprocal best match graphs and orthology relations.

Definition 8

(Geiß et al. 2019b) A tree $(T, σ)$ is least resolved (w.r.t. the RBMG $G (T, σ)$ that it explains) if the contraction of any inner edge $e \in E (T)$ implies $G (T_{e}, σ) \neq G (T, σ)$ .

Since $G (T, σ)$ is completely determined by $(T, σ)$ we can drop the reference to $G (T, σ)$ and often simply speak about a “least resolved tree”.

Lemma 6

Let $(G, σ)$ be an RBMG that is explained by $(T, σ)$ . If $(T, σ)$ is least resolved w.r.t. $(G, σ)$ , then every inner edge $e = u v \in E (T)$ satisfies $σ (L (T (v))) \cap σ (L (T (u)) \ L (T (v))) \neq \emptyset$ .

Proof

For contraposition, assume that there is an inner edge $e = u v \in E (T)$ with $σ (L (T (v))) \cap σ (L (T (u)) \ L (T (v))) = \emptyset$ . Hence, for all $x \in L (T (v))$ and $y \in L (T (u)) \ L (T (v))$ we have ${lca}_{T} (x, y) = u$ and $σ (x) = X \neq σ (y) = Y$ . It is easy to see that all such x and y form a reciprocal best match and thus, $x y \in E (G)$ . Clearly, x and y form also reciprocal best match in $(T_{e}, σ)$ and thus, each edge $x y \in E (G)$ with $x \in L (T (v))$ and $y \in L (T (u)) \ L (T (v))$ is contained in $G (T_{e}, σ)$ . Since we have not changed the relative ordering of the ${lca}^{'} s$ of the remaining vertices, all edges in E(G) are contained in $G (T_{e}, σ)$ . $□$

The converse of Lemma 6 is not necessarily true. As an example, consider an inner edge $e = u v \in E (T)$ with $σ (L (T (u))) = σ (L (T (v))) = {c}$ . It is easy to see that e can be contracted.

Lemma 6 implies that if $(T, σ)$ is least resolved w.r.t. $G (T, σ)$ and $u \in V^{0} (T)$ such that u is incident to some other inner vertex $v \in child (u)$ , then there is a child $v^{'} \neq v$ of u which satisfies $σ (L (T (v^{'}))) \cap σ (L (T (v))) \neq \emptyset$ . By construction of ${\hat{t}}_{T}$ we have ${\hat{t}}_{T} (u) = □$ . The latter observation also implies the following:

Corollary 2

Suppose that $(T, σ)$ is least resolved w.r.t. $G (T, σ)$ and let ${\hat{t}}_{T}$ be the extremal event labeling for $(T, σ)$ . Then Inline graphic if and only if all children of u are leaves that are from pairwise distinct species.

Lemma 7

Let $(T, σ)$ be some least resolved tree (w.r.t. some RBMG) with extremal event map ${\hat{t}}_{T}$ and let S(W, F) be a species tree with $L (S) = σ (L (T))$ . Then there is a reconciliation map $μ : V (T) \to V (S) \cup E (S)$ such that $t_{μ} = {\hat{t}}_{T}$ .

Proof

By Cor. 2, every inner vertex u with Inline graphic is only incident to leaves from pairwise distinct species. However, this implies that the set of informative species triples $S (T, {\hat{t}}_{T}, σ)$ is empty, and thus, compatible. Hence, Proposition 1 implies that there is a reconciliation map $μ$ from $(T, {\hat{t}}_{T}, σ)$ to any species tree S, defined by $μ (0_{T}) = 0_{S}$ , $μ (v) = 0_{S} ρ_{S}$ for every inner vertex $v \in V^{0} (T)$ that is incident to another inner vertex in T, and $μ (v) = x = {lca}_{S} (σ (L (T (v))))$ for any inner vertex v that is only incident to leaves that are from pairwise distinct species, and $μ (v) = σ (v)$ for all leaves of T. By construction of $μ$ , we have ${\hat{t}}_{T} (u) = t_{μ} (u)$ with $t_{μ} (u)$ specified by Def. 4 for all $u \in V (T)$ . $□$

Corollary 3

Let $(T, σ)$ be a least resolved tree explaining a co-RBMG $(G, σ)$ . Then $(Θ (T, {\hat{t}}_{T}), σ)$ is a disjoint union of cliques.

Proof

By Cor. 2 all children of a speciation node u w.r.t. ${\hat{t}}_{T}$ are leaves from pairwise distinct species. Thus the leaves L(T(u)) form a complete subgraph in $(Θ (T, {\hat{t}}_{T}), σ)$ . On the other hand, no ancestor of u is a speciation, i.e., there is no edge ab with $a \in L (T (u))$ and $b \notin L (T (u))$ . Thus $(Θ (T, {\hat{t}}_{T}), σ)$ is a disjoint union of the cliques formed by the L(T(u)) with Inline graphic possibly together with isolated vertices that are not children of any speciation node in $(T, {\hat{t}}_{T})$ . $□$

Suppose that we know the orthology relation $Θ (T, {\hat{t}}_{T})$ that is obtained from a least resolved tree $(T, σ)$ that explains the RBMG $(G, σ)$ . Lemma 7 implies that there is always a reconciliation map $μ$ from $(T, σ)$ to any species tree S with $L (S) = σ (L (T))$ such that ${\hat{t}}_{T}$ is determined by $μ$ as in Def. 4. Now we can apply Theorem 2 to conclude that all orthologous pairs in $Θ (T, {\hat{t}}_{T})$ are reciprocal best matches. In other words, all complete subgraphs of $Θ (T, {\hat{t}}_{T})$ are also induced subgraphs of the underlying RBMG $(G, σ)$ . Hence, $Θ (T, {\hat{t}}_{T})$ is obtained from $(G, σ)$ by removing edges such that the resulting graph is the disjoint union of cliques, see the top-right tree in Fig. 5 for an example. However, Fig. 5 also shows that many edges have to be removed to obtain $Θ (T, {\hat{t}}_{T})$ .

Fig. 5 — *Top Left:* A (discriminating) hc-cotree $(T_{h}^{G} c, t_{hc}, σ)$ . Its corresponding hc-cograph $(G, σ) = (Θ (T_{h}^{G} c, t_{hc}), σ)$ is drawn below $(T_{h}^{G} c, t_{hc}, σ)$ . In fact, Prop. 3 implies that $(G, σ)$ is an RBMG. *Top Right:* A tree $(T^{*}, {\hat{t}}_{T}, σ)$ that is least resolved w.r.t. the RBMG $(G, σ)$ together with extremal labeling ${\hat{t}}_{T}$ and the resulting orthology relation $Θ (T^{*}, {\hat{t}}_{T})$ , where $(T^{*}, {\hat{t}}_{T})$ is not discriminating. *Below:* A tree $(T, {\hat{t}}_{T}, σ)$ together with extremal labeling ${\hat{t}}_{T}$ that explains the RBMG $(G, σ)$ but is not least resolved w.r.t. $(G, σ)$ . The resulting orthology relation $Θ (T, {\hat{t}}_{T})$ is drawn below $(T, {\hat{t}}_{T}, σ)$ (color figure online)

This observation establishes the precise relationship of orthology detection and clustering, since (graph) clustering can be interpreted as the graph editing problem for disjoint unions of complete graphs (Böcker et al. 2011). In many orthology prediction tools, such as e.g. OMA (Roth et al. 2008), orthologs are summarized as clusters of orthologous groups (COGs) (Tatusov et al. 1997) that are obtained from reciprocal best matches.

The results above show that the RBMGs contain the orthology relation. Equivalently, RBMGs imply constraints on the event labeling. We also observe that the RBMGs cannot provide conclusive evidence regarding edges that must correspond to orthologous pairs. In the following sections we consider the constraints implied by the detailed structure of RBMGs or BMGs in more detail.

Classification of RBMGs

The structure of RBMGs has been studied in extensive detail by Geiß et al. (2019b). Although we do not have an algorithmically useful complete characterization of RBMGs, there are partial results that can be used to identify different subclasses of RBMGs based on the structure of the connected components of the 3-colored subgraphs (Geiß et al. 2019b, Thm. 7). Let $C (G, σ)$ be the set of the connected components of the induced subgraphs on three colors of an RBMG $(G, σ)$ . Then every $(C, σ) \in C (G, σ)$ is precisely of one of the three types (Geiß et al. 2019b, Thm. 5):

Type (A) $(C, σ)$ contains a $K_{3}$ on three colors but no induced $P_{4}$ .
Type (B) $(C, σ)$ contains an induced $P_{4}$ on three colors whose endpoints have the same color, but no induced cycle $C_{n}$ on $n \geq 5$ vertices.
Type (C) $(C, σ)$ contains an induced cycle $C_{6}$ , called hexagon, such that any three consecutive vertices have pairwise distinct colors.

The graphs for which all $(C, σ) \in C (G, σ)$ are of Type (A) are exactly the RBMGs that are cographs, or co-RBMGs for short (Geiß et al. 2019b, Thm. 8 and Remark 2). Intuitively, these have a close connection to orthology graphs because orthology graphs are cographs.

Connected components of Type (B) and Type (C), on the other hand, contain induced $P_{4} s$ and thus are neither cographs nor connected components of cographs. Obs. 1 implies that RBMGs that contain connected components of Type (B) and Type (C) introduce false positive edges into estimates of the orthology relation. In Sect. 6 below we will address the question to what extent and how such false-positives edges can be identified. We distinguish here co-RBMGs, (B)-RBMGs, and (C)-RBMGs depending on whether $C (G, σ)$ contains only Type (A) components, at least one Type (B) but not Type (C) component, or at least one Type (C) component.

Co-RBMGs have a convenient structure that can be readily understood in terms of hierarchically colored cographs (hc-cographs) introduced by Geiß et al. (2019b, Sect. 7).

Definition 9

An undirected colored graph $(G, σ)$ is a hierarchically colored cograph (hc-cograph) if

$(G, σ) = (K_{1}, σ)$ , i.e., a colored vertex, or
$(G, σ) = (H_{1}, σ_{H_{1}}) ⋈ (H_{2}, σ_{H_{2}})$ and $σ (V (H_{1})) \cap σ (V (H_{2})) = \emptyset$ , or
and $σ (V (H_{1})) \cap σ (V (H_{2})) \in {σ (V (H_{1})), σ (V (H_{2}))}$ ,

where both $(H_{1}, σ_{H_{1}})$ and $(H_{2}, σ_{H_{2}})$ are hc-cographs and $σ (x) = σ_{H_{i}} (x)$ for any $x \in V (H_{i})$ for $i \in {1, 2}$ .

Not all properly colored cographs are hc-cographs, see e.g. Geiß et al. (2019b) for counterexamples. However, for each cograph G, there exists a coloring $σ$ (with a sufficient number of colors) such that $(G, σ)$ is an hc-cograph.

Proposition 3

(Thm. 9 in (Geiß et al. 2019b)) A graph $(G, σ)$ is a co-RBMG if and only if it is an hc-cograph.

Since orthology relations are necessarily cographs we can interpret Proposition 3 as necessary condition for an RBMG to correctly represent orthology.

The recursive construction of $(G, σ)$ in Def. 9 also defines a corresponding hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ whose leaves are the vertices of $(G, σ)$ , i.e., the $(K_{1}, σ)$ appearing in (K1). Each internal node u of $T_{hc}^{G}$ corresponds to either a join (K2) or a disjoint union (K3) and is labeled by Inline graphic such that if u represents a join, and $t_{hc} (u) = □$ if u corresponds to a disjoint union. Each inner vertex u of $T_{hc}^{G}$ represents the induced subgraph $(G, σ) [L (T_{hc}^{G} (u))]$ .

Proposition 4

(Thm. 10 in (Geiß et al. 2019b)) Every co-RBMG $(G, σ)$ is explained by its hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ .

Now let $(T_{hc}^{G}, t_{hc}, σ)$ be the hc-cotree of a co-RBMG $(G, σ)$ . Note, the structure of $T_{hc}^{G}$ is solely determined by the hc-cograph structure of $(G, σ)$ . Somehwat surprisingly, the mathematical structure of the hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ and, in particular, its coloring $t_{hc}$ has a simple biological interpretation. Consider ${v^{'}, v^{''}} = child (u)$ . If Inline graphic in the hc-cotree, then $σ (L (T_{hc}^{G} (v^{'}))) \cap σ (L (T_{hc}^{G} (v^{''}))) = \emptyset$ in agreement with Lemma 2. On the other hand, if $t_{hc} (u) = □$ , then (K3) implies $σ (L (T_{hc}^{G} (v^{'}))) \cap σ (L (T_{hc}^{G} (v^{''}))) \neq \emptyset$ , in which case u indeed must be a duplication from the biological point of view (contraposition of Lemma 2).

The hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ of $(G, σ)$ will in general not be discriminating and it is not necessarily possible to reduce $(T_{hc}^{G}, t_{h} c, σ)$ to a discriminating hc-cotree $({\hat{T}}_{hc}^{G}, \hat{t}, σ)$ that still explains $(G, σ)$ . Although it is always possible to contract edges uv of $(T_{hc}^{G}, t_{hc}, σ)$ with Inline graphic (cf. (Geiß et al. 2019b, Cor. 11)), there are examples where edges uv with $t_{hc} (u) = t_{hc} (u) = □$ cannot be contracted to obtain a tree that still explains $(G, σ)$ (cf. (Geiß et al. 2019b, Fig. 15)). We refer to (Geiß et al. 2019b) for more details and a characterization of edges that are contractable. It is of interest, therefore, to ask whether there are true orthology relations $Θ$ that are not hc-cographs, or equivalently, when does a discriminating hc-cotree $(\hat{T}, \hat{t}, σ)$ that is obtained by edge-contraction from a given hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ still explains an RBMG $(G, σ)$ ? To answer this question we provide first

Definition 10

A tree $(T, t, σ)$ contains no losses, if for all $x \in V (T)$ with $t (x) = □$ we have $σ (L (T (v^{'}))) = σ (L (T (v^{''})))$ for all $v^{'}, v^{''} \in child (x)$ .

Theorem 4

Let $(T, σ)$ be a leaf-labeled tree such that there is a reconciliation map $μ$ to some species tree and assume that $(T, t_{μ}, σ)$ does not contain losses. Then

The RBMG $G (T, σ)$ explained by $(T, σ)$ equals the colored cograph $(Θ (T, t_{μ}), σ)$ .
The unique disciminating cotree $(\hat{T}, \hat{t}, σ)$ of $(Θ (T, t_{μ}), σ)$ explains the RBMG $(G, σ)$ .

Proof

To simplify the notation, we set $(G, σ) = G (T, σ)$ and $(H, σ) = (Θ (T, t_{μ}), σ)$ .

We start with proving Statement (1). By Theorem 2, $(H, σ)$ is a subgraph of $(G, σ)$ and $V (H) = V (G)$ , hence it suffices to show that every edge $a b \in E (G)$ is also contained in E(H). Assume, for contradiction, that this is not the case, i.e., $a b \notin E (H)$ , and thus $t_{μ} (x) = □$ for $x : = {lca}_{T} (a, b)$ . Since $(T, t, σ)$ has no losses, we have $σ (L (T (v^{'}))) = σ (L (T (v^{''})))$ for all $v^{'}, v^{''} \in child (x)$ , and thus $a \in L (T (v^{'}))$ and $b \in L (T (v^{''}))$ for some pair of distinct children $v^{'}, v^{''} \in child (x)$ of x. From $σ (L (T (v^{'}))) = σ (L (T (v^{''})))$ we know that there is a vertex $a^{'} \in L (T (v^{''}))$ with $σ (a^{'}) = σ (a)$ . Thus, ${lca}_{T} (a, b) = x ≻_{T} {lca}_{T} (a^{'}, b)$ for some $a^{'} \in L (T (v^{''}))$ , which implies that $a b \notin E (G)$ ; a contradiction. We conclude that $a b \in E (G)$ if and only if $a b \in E (H)$ and thus $(G, σ) = (H, σ)$ .

Let us now turn to Statement (2). In order to show that $(\hat{T}, \hat{t}, σ)$ explains the RBMG $(G, σ)$ we first note that, since $(G, σ)$ is a cograph by Statement (1), there is a unique discriminating cotree $(\hat{T}, \hat{t}, σ)$ for $(G, σ)$ . Furthermore, $(\hat{T}, \hat{t}, σ)$ is obtained from any cotree $(T, t_{μ}, σ)$ for $(G, σ)$ by contracting all edges uv in T with $t_{μ} (u) = t_{μ} (v)$ (Hellmuth et al. 2013). It remains to show that ab is an edge in $(G, σ)$ if and only if ab forms a reciprocal best match in $(\hat{T}, σ)$ .

First consider duplications. Suppose, we have contracted the edge xv with $t_{μ} (x) = t_{μ} (v) = □$ . By assumption, for all children $v^{'}, v^{''}$ of v we have $σ (L (T (v^{'}))) = σ (L (T (v^{''})))$ . Moreover, since $σ (L (T (v)))$ is the union of species $σ (L (T (w))))$ of its children w, we have $σ (L (T (v))) = σ (L (T (v^{'}))) = σ (L (T (v^{''})))$ . Hence, after contraction of xv, the vertices $v^{'}$ and $v^{''}$ are now children of x and still satisfy $σ (L (\hat{T} (v^{'}))) = σ (L (\hat{T} (v^{''})))$ . In particular, $σ (L (\hat{T} (v^{'}))) = σ (L (\hat{T} (w)))$ for every child w of x. By induction on the number of contracted edges, every vertex x in $\hat{T}$ with $\hat{t} (x) = □$ still satisfies $σ (L (\hat{T} (v^{'}))) = σ (L (\hat{T} (v^{''})))$ for all children $v^{'}, v^{''}$ of x in $\hat{T}$ . Thus, the same argument as in the proof of Statement (1) implies that ab cannot be a reciprocal best match in $\hat{T}$ for all $a \in L (T (v^{'}))$ and $b \in L (T (v^{''}))$ . We also have ${lca}_{\hat{T}} (a, b) = x$ for $a \in L (T (v^{'}))$ and $b \in L (T (v^{''}))$ , and thus $\hat{t} ({lca}_{\hat{T}} (a, b)) = □$ . Since $(\hat{T}, \hat{t}, σ)$ is a cotree for the cograph $(G, σ)$ , $\hat{t} ({lca}_{\hat{T}} (a, b)) = □$ implies $a b \notin E (G)$ . Therefore, $a b \notin E (G)$ unless a and b form a reciprocal best match in $(\hat{T}, σ)$ .

Let us now turn to speciation vertices. Lemma 47 in (Geiß et al. 2019b) states, in particular, that all non-discriminating edges uv with Inline graphic can be contracted to obtain a tree that still explains $(G, σ)$ . Thus, if a and b are reciprocal best matches in $(\hat{T}, σ)$ , then $a b \in E (G)$ . We conclude, therefore, that $a b \in E (G)$ if and only if a and b are reciprocal best matches in $(\hat{T}, σ)$ . $□$

Prop. 3 shows that if the no loss condition of Def. 10 holds, then $(Θ (T, t_{μ}), σ) = G (T, σ)$ is a co-RBMG, an hc-cograph, and an orthology relation.

The no loss condition of Def. 10 is very restrictive, however, and thus in general will not be satisfied in real-life data. Theorem 1 shows that orthology relations correspond to properly colored cographs with compatible sets of the informative triples. The characterization of co-RBMGs in (Geiß et al. 2019b), on the other hand, shows that only hc-colorings may appear. Since the requirement that $σ$ is a proper coloring already implies disjointness of the color sets for join operations, we can interpret the hc-coloring condition as a condition on duplication vertices. The offending vertices are exactly those for which (i) $t (u) = □$ and (ii) there are two children $v^{'}, v^{''} \in child (u)$ such that both $σ (L (T (v^{'}))) \ σ (L (T (v^{''}))) \neq \emptyset$ and $σ (L (T (v^{''}))) \ σ (L (T (v^{'}))) \neq \emptyset$ . In this case, there is a pair of species such that a different “paralog group” (that is, a lineage of genes descending from a duplication) is missing in each of them. Every pair of vertices $a \in L (T (v^{'}))$ with $σ (a) \notin σ (L (T (v^{''})))$ and $b \in L (T (v^{''}))$ with $σ (b) \notin σ (L (T (v^{'})))$ forms a best match and thus a false positive orthology assignment. Since an RBMG is a cograph only if it is hierarchically colored, the presence of such duplications implies that the RBMG is not a cograph. At least in principle, therefore, it should be possible to identify the false positive edges by means of a suitable cograph-editing approach.

Before closing this section, we briefly return to the existence of reconciliation maps. Since every hc-cograph is a properly colored cograph, Theorem 1 immediately implies

Corollary 4

Let $Θ$ be an hc-cograph with vertex set L and associated hc-cotree $(T_{hc}^{Θ}, t_{hc}, σ)$ with leaf set L. Then there exists a reconciliation map $μ$ from $(T_{hc}^{Θ}, t_{hc}, σ)$ to some species tree S if and only if $S (T_{Θ}, t_{Θ}, σ)$ is compatible.

By Cor. 4, it is not necessarily possible to reconcile a (discriminating) hc-cotree with any species tree. An example is shown in Fig. 5. To be more precise, the hc-cotree $(T_{hc}^{G}, t_{hc}, σ)$ in Fig. 5 yields the conflicting species triples AB|C and AC|B. Hence, Prop. 1 implies that $(T_{hc}^{G}, t_{hc}, σ)$ cannot be reconciled with any species tree even though $(T_{hc}^{G}, σ)$ explains the RBMG $(G, σ)$ . One can contract edges of $(T_{h}^{G} c, σ)$ to obtain a least resolved tree $(T^{*}, σ)$ that still explains $(G, σ)$ , see Fig. 5 (top right). In agreement with Lemma 7, $S (T^{*}, t_{μ}, σ) = \emptyset$ and thus, there is always a reconciliation map $μ$ from $(T^{*}, t_{μ}, σ)$ to any species tree S with $L (S) = σ (L (T))$ . Moreover, in agreement with Theorem 2, all orthologous pairs in $Θ (T^{*}, {\hat{t}}_{T}, σ)$ are best matches. Although $(T^{*}, σ)$ explains $(G, σ)$ , the two graphs $(G, σ) = (Θ (T_{h}^{G} c, t), σ)$ and $(Θ (T^{*}, {\hat{t}}_{T}), σ)$ are very different. In particular, by Corollary 3, $Θ (T^{*}, {\hat{t}}_{T})$ is the disjoint union of cliques.

Observation 2

In general it is not necessary to edit $(G, σ)$ to a disjoint union of cliques to obtain a valid orthology relation.

An example is provided by the tree $(T, {\hat{t}}_{T}, σ)$ in Fig. 5. Obviously, $Θ (T, {\hat{t}}_{T})$ is not the disjoint union of cliques. Moreover, AB|C is the only informative triple displayed by $(T, {\hat{t}}_{T}, σ)$ where A, B, and C correspond to the red, blue and green species, respectively. Prop. 1 implies that $(T, {\hat{t}}_{T}, σ)$ can be reconciled with any species tree that displays AB|C. In other words, $Θ (T, {\hat{t}}_{T})$ is already “biologically feasible” and there is no need to remove further edges from $Θ (T, {\hat{t}}_{T})$ .

Non-orthologous reciprocal best matches

In this section we investigate to what extent false positive orthology assignments in the reciprocal best match graph can be identified. Since the orthology relation $Θ$ must be a cograph, it is natural to consider the smallest obstructions, i.e., induced $P_{4}$ s in more detail. First we note that every induced $P_{4}$ in an RBMG contains either three or four distinct colors (Geiß et al. 2019b, Sect. E). Each $P_{4}$ in an RBMG $(G, σ)$ spans an induced subgraph of every BMG $(\vec{G}, σ)$ that contains $(G, σ)$ as its symmetric part. These induced subgraphs of a BMG $(\vec{G}, σ)$ with four vertices are known as quartets. With respect to a fixed BMG, every induced $P_{4}$ belongs to one of three distinct types which are defined in terms of its coloring and the quartet in which it resides. An induced $P_{4}$ with edges ab, bc, and cd is denoted by $⟨ a b c d ⟩$ or, equivalently, $⟨ d c b a ⟩$ .

Definition 11

Let $(\vec{G}, σ)$ be a BMG explained by the tree $(T, σ)$ , with symmetric part $(G, σ)$ and let $Q : = {x, x^{'}, y, z} \subseteq L (T)$ with $σ (x) = σ (x^{'})$ and pairwise distinct colors $σ (x)$ , $σ (y)$ , and $σ (z)$ . The set Q, resp., the induced subgraph $({\vec{G}}_{| Q}, σ_{| Q})$ is

a good quartet if (i) $⟨ x y z x^{'} ⟩$ is an induced $P_{4}$ in $(G, σ)$ and (ii) $(x, z), (x^{'}, y) \in E (\vec{G})$ and $(z, x), (y, x^{'}) \notin E (\vec{G})$ ,
a bad quartet if (i) $⟨ x y z x^{'} ⟩$ is an induced $P_{4}$ in $(G, σ)$ and (ii) $(z, x), (y, x^{'}) \in E (\vec{G})$ and $(x, z), (x^{'}, y) \notin E (\vec{G})$ , and
an ugly quartet if $⟨ x y x^{'} z ⟩$ is an induced $P_{4}$ in $(G, σ)$ .

If Q is a good, bad, or ugly quartet we will refer to the underlying induced $P_{4}$ as a good, bad, or ugly quartet, respectively. Lemma 32 of (Geiß et al. 2019b) states that every quartet Q in an RBMG $(G, σ)$ that is contained in a BMG $(\vec{G}, σ)$ is either good, bad, or ugly. An example of an RBMG containing good, bad, and ugly quartets is shown in Fig. 6. Note that good, bad, and ugly quartets cannot appear in RBMGs of Type (A). These are cographs and thus by definition do not contain induced $P_{4} s$ .

Fig. 6 — The 3-RBMG $(G, σ)$ is explained by two trees $(T_{1}, σ)$ and $(T_{2}, σ)$ . These induce distinct BMGs $\vec{G} (T_{1}, σ)$ and $\vec{G} (T_{2}, σ)$ . In $\vec{G} (T_{1}, σ)$ , $P^{1} = ⟨ a_{1} b_{1} c_{1} a_{2} ⟩$ defines a good quartet, while $P^{2} = ⟨ a_{1} c_{2} b_{2} a_{2} ⟩$ induces a bad quartet. In $\vec{G} (T_{2}, σ)$ the situation is reversed. The good quartets in $\vec{G} (T_{1}, σ)$ and $\vec{G} (T_{2}, σ)$ are indicated by red edges. The induced paths $⟨ a_{1} b_{1} c_{1} b_{2} ⟩$ and $⟨ a_{2} c_{1} b_{1} c_{2} ⟩$ are examples of ugly quartets. Figure reused from (Geiß et al. 2019b), ©Springer (color figure online)

The location of good quartets (in contrast to bad and ugly quartets) turns out to be strictly constrained. This fact can be used to show that the “middle” edge of any good quartet must be a false positive orthology assignment:

Lemma 8

Let $(T, σ)$ be some leaf-labeled tree and ${\hat{t}}_{T}$ the extremal event labeling for $(T, σ)$ . If $⟨ x y z x^{'} ⟩$ is a good quartet in the BMG $\vec{G} (T, σ)$ , then ${\hat{t}}_{T} (v) = □$ for $v : = lca (x, x^{'}, y, z)$ .

Proof

Lemma 36 of Geiß et al. (2019b) implies that for a good quartet $⟨ x y z x^{'} ⟩$ in $\vec{G} (T, σ)$ with $v : = lca (x, x^{'}, y, z)$ there are two distinct children $v_{1}, v_{2} \in child (v)$ such that $x, y ⪯_{T} v_{1}$ and $x^{'}, z ⪯_{T} v_{2}$ . Thus, in particular, $v_{1}$ and $v_{2}$ must be inner vertices in $(T, σ)$ . Since $σ (x) = σ (x^{'})$ by definition of a good quartet, we have $σ (L (T (v_{1}))) \cap σ (L (T (v_{2}))) \neq \emptyset$ . Hence, Inline graphic by definition of ${\hat{t}}_{T}$ (cf. Definition 6). $□$

As an immediate consequence of Lemma 8 and Cor. 1, an analogous statement is true for event labelings $t_{μ}$ for a given reconciliation map:

Corollary 5

Let T and S be planted trees, $σ : L (T) \to L (S)$ a surjective map, and $μ$ a reconciliation map from $(T, σ)$ to S. If $⟨ x y z x^{'} ⟩$ is a good quartet in the BMG $\vec{G} (T, σ)$ , then $t_{μ} (v) = □$ for $v : = lca (x, x^{'}, y, z)$ .

Given an RBMG $(G, σ)$ that contains a good quartet $⟨ x y z x^{'} ⟩$ (w.r.t. to the underlying BMG $(\vec{G}, σ)$ ), the edge yz therefore always corresponds to a false positive orthology assignment, i.e., it is not contained in the true orthology relation $Θ$ .

Not all false positives can be identified in this way from good quartets, however. The RBMG $G (T_{1}, σ)$ in Fig. 7, for instance, contains only one good quartet, that is $⟨ a_{1} c_{2} b_{2} a_{2} ⟩$ . After removal of the false positive edge $c_{2} b_{2}$ , the remaining undirected graph still contains the bad quartet $⟨ a_{1} b_{1} c_{1} a_{2} ⟩$ , hence, in particular, it still contains an induced $P_{4}$ and is, therefore, not an orthology relation.

Neither bad nor ugly quartets can be used to unambiguously identify false positive edges. For an example, consider Fig. 7. The two 3-RBMGs $G (T_{1}, σ)$ and $G (T_{2}, σ)$ both contain the bad quartet $⟨ a_{1} b_{1} c_{1} a_{2} ⟩$ . As a consequence of Lemma 2, neither the root of $T_{1}$ nor the root of $T_{2}$ can be labeled by a speciation event. Hence, as $a_{1}, b_{1}, c_{1}, a_{2}$ reside all in different subtrees below the root of $T_{1}$ , all edges $a_{1} b_{1}, b_{1} c_{1}, c_{1} a_{2}$ in $G (T_{1}, σ)$ correspond to false positive orthology assignments. On the other hand, the vertices $b_{1}$ and $c_{1}$ reside within the same 2-colored subtree below the root of $T_{2}$ and are incident to the same parent in $T_{2}$ . Therefore, one easily checks that there exist reconciliation scenarios where $b_{1}$ and $c_{1}$ are orthologous, hence the edge $b_{1} c_{1}$ must indeed be contained in the orthology relation. Similarly, $⟨ a_{1} b_{1} c_{1} b_{2} ⟩$ and $⟨ a_{1} b_{1} a_{3} c_{2} ⟩$ are ugly quartets in $G (T_{1}, σ)$ and $G (T_{2}, σ)$ , respectively. By the same argumentation as before, the edges $a_{1} b_{1}$ , $b_{1} c_{1}$ , and $c_{1} b_{2}$ are false positives in $G (T_{1}, σ)$ . For $(T_{2}, σ)$ , however, there exist reconciliation scenarios, where $a_{3}$ and $c_{2}$ are orthologs.

Cor. 9 of Geiß et al. (2019b), finally, implies that every (B)-RBMG and every (C)-RBMG contains at least one good quartet. In particular, therefore, there is at least one false positive orthology assignment that can be identified with the help of good quartets. We shall see in Sect. 7.2, using simulated data, that in practice the overwhelming majority of false positive orthology assignments is already identified by good quartets.

From a theoretical point of view it is interesting nevertheless that it is possible to identify even more false positive orthology assignments starting from Lemma 2. It implies that $t (lca (x, y)) = □$ whenever x and y are located in two distinct leaf sets defined for the the same connected component of an induced 3-RBMG of Type (B) or (C). Details can be found in (Geiß et al. 2019b, Lemma 25) and the Supplemental Material. At least in our simulation data scenarios of this type that are not covered already by a good quartet seem to be exceedingly rare, and hence of little practical relevance.

Simulations

Although the edges in the RMBG cannot identify orthologous pairs with certainty (as a consequence to Lemma 3), there is a close resemblance in practice, i.e., for empirically determined scenarios. In order to explore this connection in more detail, we consider simulated evolutionary scenarios $(T, S, μ)$ . These uniquely determine both the (reciprocal) best match graph $\vec{G} (T, σ)$ and $G (T, σ)$ , resp., and the orthology graph $Θ$ , thus allowing a direct comparison of these graphs. Since we only analyze scenarios $(T, S, μ)$ , we did not use simulations tools such as ALF (Dalquén et al. 2011) that are designed to simulate sequence data.

Simulation methods

In order to simulate evolutionary scenarios $(T, S, μ)$ we employ a stepwise procedure:

Construction of the species treeS. We regard S as an ultrametric tree, i.e., its branch lengths are interpreted as real-time. Given a user-defined number of species N we generate S under the innovations model as described by Keller-Schmidt and Klemm (2012). The binary trees generated by this model have similar depth and imbalances as those of real phylogenetic trees from databases.
Construction of the true gene tree $\tilde{T}$ . Traversing the species tree S top-down, one gene tree $\tilde{T}$ is generated with user-defined rates $r_{D}$ for duplications, $r_{L}$ for losses, and $r_{H}$ for horizontal transfer events. The number of events along each edge of the species tree, of each type of event, is drawn from a Poisson distribution with parameter $λ = ℓ r_{e}$ , where $ℓ$ is the length of the edge e and $r_{e}$ is the rate of the event type. Duplication and horizontal transfer events duplicate an active lineage and occur only inside edges of S. For duplications, both offspring lineages remain inside the same edge of the species tree as the parental gene. In contrast, one of the two offsprings of an HGT event is transferred to another, randomly selected, branch of the species tree at the same time. At speciation nodes all branches of the gene tree are copied into each offspring. Loss events terminate branches of $\tilde{T}$ . Loss events may occur only within edges of the species tree that harbor more than one branch of the gene tree. Thus every leaf of S is reached by at least one branch of the gene tree $\tilde{T}$ . All vertices v of $\tilde{T}$ are labeled with their event type t(v), in particular, there are different leaf labels for extant genes and lost genes. The simulation explicitly records the reconciliation map, i.e., the assignment of each vertex of $\tilde{T}$ to a vertex or edge of S.
Construction of the observable gene treeTfrom $\tilde{T}$ . The leaves of $\tilde{T}$ are either observable extant genes or unobservable losses. As described by Hernandez-Rosales et al. (2012), we prune $\tilde{T}$ in bottom-up order by removing all loss events and omitting all inner vertices with only a single remaining child.

Using steps (1) and (2), we simulated 10,000 scenarios for species trees with 3 to 100 species (=leaves) and additional 4000 scenarios for species trees with 3 to 50 leaves, drawn from a uniform distribution. For each of these species trees, exactly one gene tree was simulated as described above. The rate parameters were varied between 0.65 and 0.99 in steps of 0.01 for duplication and loss events. For HGTs, a rate in the range between 0.1 and 0.24, again in steps of 0.01, was used. A detailed list of all simulated scenarios can be found in the Supplemental Material. For each of the 14,000 true gene trees $\tilde{T}$ the total number $S_{n}$ of speciation events, $L_{n}$ of losses, $D_{n}$ of duplications, and $H_{n}$ of HGTs was determined. Summary statistics of the simulated scenarios are compiled in the Supplemental Material.

From each true gene tree $\tilde{T}$ we extracted the observable gene tree T as described in Step (3). For all retained vertices the reconciliation map $μ$ and thus the event labeling $t = t_{μ}$ remains unchanged. Since ${lca}_{T} (x, y) = {lca}_{\tilde{T}} (x, y)$ for all extant genes $x, y \in L (T)$ , it suffices to consider T. The leaf coloring map $σ : L (T) \to L (S)$ is obtained from its definition, i.e., setting $σ (v) = μ (v)$ for all $v \in L (T)$ . We can now extract the orthology relation and reciprocal best match relation from each scenario.

The orthology relation $Θ (T, t)$ is easily constructed from the event labeled gene tree (T, t), since $x y \in Θ (T, t)$ if and only if Inline graphic . An efficient way to compute $Θ (T, t)$ and the RBMG $(G, σ)$ that avoids the explicit evaluation of ${lca}_{T} ()$ is described in the Supplemental Material. For each reconciliation scenario $(T, S, μ)$ , we also identify all good quartets in the BMG $(\vec{G}, σ)$ and then delete the middle edge of the corresponding $P_{4}$ from the RBMG $(G, σ)$ . The resulting graph will be referred to as $(G_{4}, σ_{4})$ .

Simulation results for duplication/loss scenarios

In order to assess the practical relevance of co-RBMGs we measured the abundance of non-cograph components in the simulated RBMGs. More precisely, we determined for each simulated RBMG the connected components of its restrictions to any three distinct colors and determined whether these components are cographs, graphs of Type (B), or graphs of Type (C). In order to identify these graph types, we used algorithms of (Hoàng et al. 2013) to first identify an induced $P_{4}$ belonging to a good quartet. If one exists, we check for the existence of an induced $P_{5}$ and then test whether its endpoints are connected, thus forming a hexagon characteristic for the a Type (C) graph. Otherwise, the presence of the $P_{4}$ implies Type (B), while the absence of induced $P_{4}$ s guarantees that the component is a cograph.

We did not encounter a single Type (C) component in 14,000 simulated scenarios. As we shall see this is a consequence of the fact that all simulated trees are binary. To see this, we consider the structure of connected 3-RBMG of Type (C) in some more detail, generalizing some technical results by Geiß et al. (2019b):

Lemma 9

Let $(G, σ)$ be a connected 3-RBMG containing the induced $C_{6}$ $⟨ x_{1} y_{1} z_{1} x_{2} y_{2} z_{2} ⟩$ with three distinct colors r, s, and t such that $σ (x_{1}) = σ (x_{2}) = r$ , $σ (y_{1}) = σ (y_{2}) = s$ , and $σ (y_{1}) = σ (y_{2}) = t$ . Then, every tree $(T, σ)$ that explains $(G, σ)$ must satisfy the following property: There exist distinct $v_{1}, v_{2}, v_{3} \in child (v)$ where $v : = {lca}_{T} (x_{1}, x_{2}, y_{1}, y_{2}, z_{1}, z_{2})$ such that either $x_{1}, y_{1} ⪯_{T} v_{1}$ , $x_{2}, z_{1} ⪯_{T} v_{2}$ , $y_{2}, z_{2} ⪯_{T} v_{3}$ or $y_{1}, z_{1} ⪯_{T} v_{1}$ , $x_{2}, y_{2} ⪯_{T} v_{2}$ , $x_{1}, z_{2} ⪯_{T} v_{3}$ .

Proof

If $| V (G) | > 6$ , then, due to the connectedness of $\vec{G}$ , at least one of the six vertices of the induced $C_{6}$ is adjacent to more than one vertex of one of the colors r, s, t, hence the first statement immediately follows from Lemma 39(iii) in Geiß et al. (2019b). Now consider the special case $| V (G) | = 6$ . By Cor. 9 of Geiß et al. (2019b), $\vec{G} (T, σ)$ contains a good quartet. W.l.o.g. let $⟨ x_{1} y_{1} z_{1} x_{2} ⟩$ be a good quartet, thus $(x_{1}, z_{1}), (x_{2}, y_{1}) \in E (\vec{G})$ and $(z_{1}, x_{1}), (y_{1}, x_{2}) \notin E (\vec{G})$ . This, in particular, implies ${lca}_{T} (x_{2}, z_{1}) ≺_{T} {lca}_{T} (x_{1}, z_{1})$ , thus there are distinct children $v_{1}, v_{2} \in child (v)$ such that $x_{1} ⪯_{T} v_{1}$ and $x_{2}, z_{1} ⪯_{T} v_{2}$ . Moreover, as $x_{1} y_{1} \in E (G)$ and $(y_{1}, x_{2}) \notin E (\vec{G})$ , we have ${lca}_{T} (x_{1}, y_{1}) ≺_{T} {lca}_{T} (x_{2}, y_{1})$ , hence $y_{1} ⪯_{T} v_{1}$ . Now consider $y_{2}$ . Since $x_{1} y_{2} \notin E (G)$ and $x_{2} y_{2} \in E (G)$ , it must hold ${lca}_{T} (x_{2}, y_{2}) ⪯_{T} {lca}_{T} (x_{1}, y_{2})$ , hence $y_{2} \notin L (T (v_{1}))$ . Assume, for contradiction, that $y_{2} ⪯_{T} v_{2}$ . Then, as $y_{2} z_{2} \in E (G)$ and ${lca}_{T} (y_{2}, z_{1}) ⪯_{T} v_{2}$ , we clearly have $z_{2} ⪯_{T} v_{2}$ . However, this implies ${lca}_{T} (x_{2}, z_{2}) ≺_{T} {lca}_{T} (x_{1}, z_{2})$ , contradicting $x_{1} z_{2} \in E (G)$ . We therefore conclude that there must exist a vertex $v_{3} \in child (v) \ {v_{1}, v_{2}}$ such that $y_{2} ⪯_{T} v_{3}$ . One easily checks that this implies $z_{2} ⪯_{T} v_{3}$ , which completes the proof. $□$

Theorem 5

If $(T, σ)$ is a binary leaf-labeled tree, then $G (T, σ)$ does not contain a connected component of Type (C).

Proof

By Obs. 6 of (Geiß et al. 2019b), the restriction $(T_{rst}, σ_{rst})$ of $(T, σ)$ explains the subgraph $(G_{rst}, σ_{rst})$ of $G (T, σ)$ that is induced by vertices with color r, s, or t. Thm. 2 of (Geiß et al. 2019b) shows, furthermore, that every connected component of $(G_{rst}, σ_{rst})$ is explained by restriction $(T^{'}, σ^{'})$ of $(T_{rst}, σ_{rst})$ to the corresponding vertices. Now suppose $(T, σ)$ is a binary. Then both $(T_{rst}, σ_{rst})$ and $(T^{'}, σ^{'})$ are also binary. By contraposition of Lemma 9, no $C_{6}$ as specified in Lemma 9 can be explained by $(T^{'}, σ^{'})$ , and thus $G (T, σ)$ cannot contain a connected component of Type (C). $□$

Although events that generate more than two offspring lineages are logically possible in real data, most multifurcations in phylogenetic trees are considered to be “soft polytomies”, arising from data that are insufficient to produce a fully resolved, binary trees (Purvis and Garland Jr. 1993; Kuhn et al. 2011; Sayyari and Mirarab 2018). Type (C) 3-RBMGs thus should be very unlikely under biologically plausible assumptions on the model of evolution. Here we only consider the abundance of Type (B) components relative to all Type (A) and (B) components. We denote their ratio by $η$ . The results are summarized in Fig. 8. We find that $η$ is usually below 20% and increases with the number of loss and HGT events. More precisely, 83.47% of the 14,000 scenarios have at least one Type (B) component and 16.53% do not have Type (B) components at all. Among all 3-colored connected components taken from the restrictions to any three colors, 94.41% are of Type (A) and 5.59% are of Type (B).

Fig. 8 — Relative abundance $η = \frac{B}{B + A}$ of (B)-RBMGs in the simulation data. Panel a shows the dependence on the number of edges in the BMG in every simulated scenario, and its average depicted by the line in darker blue. Scatter plots b show the dependence of $η$ on the number of duplications and losses, and HGTs and losses, respectively (color figure online)

A graph G is called $P_{4}$ -sparse if every induced subgraph on five vertices contains at most one induced $P_{4}$ (Jamison and Olariu 1992). The interest in $P_{4}$ -sparse graphs derives from the fact that the cograph editing problem is solvable in linear time from $P_{4}$ -sparse graphs (Liu et al. 2012). It is of immediate practical interest, therefore, to determine the abundance of $P_{4}$ -sparse RBMGs that are not cographs. Among the 14,000 simulated scenarios, we found that about 20.9% of the 3-colored Type (B) components are $P_{4}$ -sparse, while the majority contains “overlapping” $P_{4}$ s. We then investigated the corresponding $S$ -thin graphs. An undirected colored graph $(G, σ)$ is called $S$ -thin if no distinct vertices are in relation $S$ . Two vertices a and b are in relation $S$ if $N (a) = N (b)$ and $σ (a) = σ (b)$ . Somewhat surprisingly, this yields a reversed situation, where more than two thirds of the $S$ -thin 3-colored Type (B) components are now $P_{4}$ -sparse, while only a minority of 31.32% is not $P_{4}$ -sparse. An example of an undirected colored graph $(G, σ)$ and its corresponding $S$ -thin version $(G / S, σ_{/ S})$ , which we found during our simluations, is shown in Panel (B) of Fig. 9.

Fig. 9 — *Top:* Among our 14,000 simulated scenarios we found that a majority of 79.12% of the (not necessarily $S$ -thin) 3-colored Type (B) components are not $P_{4}$ -sparse. For the corresponding $S$ -version of those 3-colored components only 31.32% are not $P_{4}$ -sparse while 68.68% are $P_{4}$ -sparse. *Below:* One of the simulated 3-colored Type (B) components $(G, σ)$ , which is not $S$ -thin, and its corresponding $S$ -thin version $(G / S, σ_{/ S})$ (color figure online)

Next we investigated the relationship of the RBMG $G (T, σ)$ and the orthology graph $Θ$ (see Fig. 10). We empirically confirmed that $E (Θ) \subseteq E (G (T, σ))$ in the absence of HGT (not shown). Also following our expectations, the fraction $| E (G (T, σ)) \ E (Θ) | / | E (G (T, σ)) |$ of false-positive orthology predictions in an RBMG is small as long as duplications and losses remain moderate (l.h.s. panel in Fig. 10). Most of the false positive orthology calls are associated with large numbers of losses for a given number of duplications.

We find that good quartets eliminate nearly all false positive edges from the RBMG and leave a nearly perfect orthology graph (r.h.s. panel in Fig. 10). As we have seen so far, reciprocal best matches indeed form an excellent approximation of orthology in duplication-loss scenarios. In particular, the good quartets identify nearly all false positive edges, making it easy to remove the few remaining $P_{4} s$ using a generic cograph editing algorithm (Liu et al. 2012).

Outlook: evolutionary scenarios with horizontal gene transfer

The benign results above beg the question how robust they are under HGT. Gene family histories with HGT have been a topic of intense study in recent years (Doyon et al. 2010; Tofigh et al. 2011; Bansal et al. 2012; Nøjgaard et al. 2018). Following the so-called DTL-scenarios as proposed e.g. by Tofigh et al. (2011), Bansal et al. (2012) we relax the notion of reconciliation maps, since ancestry is no longer preserved. We replace Axiom (R2) by

(R2w)
Weak Ancestor Preservation.

If $x ≺_{T} y$ , then either $μ (x) ⪯_{S} μ (y)$ or $μ (x)$ and $μ (y)$ are incomparable w.r.t. $≺_{S}$ .

and add the following constraints

(R3.iii)
Addition to the Speciation Constraint.

If $μ (x) \in W^{0}$ , then $μ (v) ⪯_{T} μ (x)$ for all $v \in child (x)$ .
(R4)
HGT Constraint.

If x has a child y such that $μ (x)$ and $μ (y)$ are incomparable, then x also has a child $y^{'}$ with $μ (y^{'}) ⪯_{S} μ (x)$ .

Property (R2w) equivalently states that if $x ≺_{T} y$ , then we must not have $μ (y) ≺_{S} μ (x)$ , which would invert the temporal order. Property (R3.iii) (which follows from (R2) but not from (R2w)) ensures that the children of speciation events are still mapped to positions that are comparable to the image of the speciation node. Condition (R4), finally, requires that every horizontal transfer event also has a vertically inherited offspring. Note that condition (R4) is void if (R2) holds. In summary the axioms (R0), (R1), (R2w), (R3.i), (R3.ii), (R3.iii), and (R4) are a proper generalization of Def. 3. We note that these axioms are not sufficient to ensure time consistency, however. We refer to Nøjgaard et al. (2018) for details. Our choice of axioms also rules out some scenarios that may appear in reality (or simulations), but which are not observable when only evolutionary divergence is available as measurement. For example, Condition (R3.ii) excludes scenarios in which HGT events have no surviving vertically inherited offspring.

We furthermore extend the event map t for a gene tree T to include HGT as an additional event type denoted by the symbol $▵$ . We define Inline graphic such that $t (u) = ▵$ if and only if u has a child v such that $μ (u)$ and $μ (v)$ are incomparable. Since the offsprings of an HGT event are not equivalent, it is useful to introduce an edge labeling $λ : E (T) \to {0, 1}$ such that $λ (u v) = 1$ if $μ (u)$ and $μ (v)$ are incomparable w.r.t. $≺_{S}$ . This edge labeling is investigated in detail by Geiß et al. (2018) as the basis of Fitch’s xenology relation. Alternatively, the asymmetry can be handled by enforcing an ordering of the vertices, see (Hellmuth et al. 2017).

Evolutionary scenarios with horizontal transfer may lead to a situation where two genes x, y in the same species, i.e., with $σ (x) = σ (y)$ , derive from a speciation, i.e., Inline graphic . This is the case when the two lineages underwent an HGT event that transferred a copy back into the lineage in which the other gene has been vertically transmitted. We call such genes xeno-orthologs and exclude them from the orthology relation, see Fig. 11. This choice is motivated (1) by the fact that, by definition, genes of the same species cannot be recognized as reciprocal best matches, and (2) from a biological perspective they behave rather like paralogs. In scenarios with HGT we therefore modify the definition of the orthology graph such that $E (G_{1} ⋈ G_{2})$ is replaced by

\begin{matrix} E (G_{1} \tilde{⋈} G_{2}) : = E (G_{1}) \cup E (G_{2}) \cup {u v ∣ u \in V (G_{1}), v \in V (G_{2}) and σ (u) \neq σ (v)} . \end{matrix}

Fig. 11 — A gene tree $(T, t, λ, σ)$ reconciled with a species tree S. Here, we have two transfer edges uv and $v b^{'}$ with $t (u) = t (v) = ▵$ . For the two children w and v of u it holds $σ (L (T (w))) \cap σ (L (T (v))) \neq \emptyset$ , a property that is shared with duplication vertices. For the two children $b^{'}$ and $c^{'}$ of v it holds $σ (L (T (b^{'}))) \cap σ (L (T (c^{'}))) = \emptyset$ , a property that is shared with speciation vertices. In this example, c and $c^{'}$ are xeno-orthologs and the pairs $(c, c^{'}), (c^{'}, c)$ will be excluded from the resulting orthology relation (color figure online)

The extremal map ${\hat{t}}_{T}$ as in Def. 6 cannot easily be extended to include HGT, as the events Inline graphic and $□$ on some vertex u are solely defined on two exclusive cases: either $σ (L (T (u_{1})))$ and $σ (L (T (u_{2})))$ are disjoint or not for $u_{1}, u_{2} \in child (u)$ . Both cases, however, can also appear when we have HGT (see Fig. 11 for an example). That is, the fact that $σ (L (T (u_{1})))$ and $σ (L (T (u_{2})))$ are disjoint or not, does not help to unambiguously identify the event types in the presence of HGT.

Prop. 1 can be generalized to the case that $(T, t, λ, σ)$ contains HGT events. The existence of reconciliation maps from an event-labeled tree $(T, t, λ, σ)$ to an unknown species tree can be characterized in terms of species triples $σ (a) σ (b) | σ (c)$ that can be derived from $(T, t, λ, σ)$ as follows: Denote by $E : = {e \in E (T, t, λ, σ) ∣ λ (e) = 1}$ the set of all transfer edges in the labeled gene tree and let $(T_{\bar{E}}, t, σ)$ be the forest obtained from $(T, t, λ, σ)$ by removing all transfer edges. By definition, $μ (x)$ and $μ (y)$ are incomparable for every transfer edge xy in T. The set $S (T, t, λ, σ)$ is the set of triples $σ (a) σ (b) | σ (c)$ where $σ (a)$ , $σ (b)$ , $σ (c)$ are pairwise distinct and either

ab|c is a triple displayed by a connected component $T^{'}$ of $T_{\bar{E}}$ such that the root of the triple is a speciation event, i.e., .
or $a, b \in L (T_{\bar{E}} (x))$ and $c \in L (T_{\bar{E}} (y))$ for some transfer edge xy or yx of T.

Proposition 5

(Hellmuth 2017) Given an event-labeled, leaf-labeled tree $(T, t, σ)$ . Then, there is a reconciliation map $μ : V (T) \to V (S) \cup E (S)$ to some species tree S if and only if $S (T, t, σ)$ is compatible. In this case, $(T, t, σ)$ can be reconciled with every species tree S that displays the triples in $S (T, t, σ)$ .

Here, we have not added additional constraints on reconciliation maps that ensure that the map is also “time-consistent”, that is, genes do not travel “back” in the species tree, see (Nøjgaard et al. 2018) for further discussion on this. However, Prop. 5 gives at least a necessary condition for the existence of time-consistent reconciliation maps. A simple proof of Prop. 5 for the case that T is binary and does not contain HGT events can be found in (Hernandez-Rosales et al. 2012). Moreover, generalizations of reconciling event-labeled gene trees with species networks have been established by Hellmuth et al. (2019).

In contrast to pure DL scenarios, it is no longer guaranteed that all true orthology relationships are also reciprocal best matches. Figure 12 gives counterexamples. In three of these scenarios the RBMG contains an induced $P_{4}$ that mimics a good quartet. Removal of the middle edge of good quartets therefore not only reduces false positives in DL scenarios but also introduces additional false negatives in the presence of HGT (Fig. 13).

Fig. 12 — Scenarios with four genes, three species, and a single HGT event for which RBMG $G (T, σ)$ and orthology relation $Θ (T, t)$ differ. The BMG is shown for each scenario. In the first two cases (a) and (b), $G (T, σ)$ contains an induced $P_{4}$ in the RBMG, which might serve as indication for HGT events. In the remaining cases, the $G (T, σ)$ is a cograph, which does not represent the correct orthology relation, however. In scenario (c), the graph $G (T, σ)$ is a triangle with an attached edge, while the orthology relation is given by $Θ (T, t) = K_{4} - e - f$ with the missing edges $e = a_{1} a_{2}$ and $f = a_{1} b_{1}$ , where the latter results from the xenologous pair $a_{2}, b_{1}$ . In the remaining three cases (d)–(f), the RBMG is compared to the orthology relation $Θ (T, t) = K_{4} - e$ , where the edge e again corresponds to the edge between genes of the same species (color figure online)

Fig. 13 — Dependence of the fraction of false positive and false negative orthology assignments in RBMGs in the presence of different levels of HGT, measured as percentage of HGT events among all events in the simulated true gene trees $\tilde{T}$ . As in Fig. 10, data are shown as functions of the number of duplication and loss events in the scenario. While the number of false positives seems to depend very little on even high levels of HGT, the fraction of false negatives is rapidly increasing. Since HGT introduces good quartets that comprise only true orthology edges, their removal further increases the false positive rate (last column) (color figure online)

Discussion

In the theoretical part of this contribution we have clarified the relationships between (reciprocal) best match graphs (RBMGs), orthology, reconciliation map, gene tree, species tree, and event map for the case of duplication loss scenarios.

The orthology graph $Θ$ is necessarily a subgraph of the RBMG. In the absence of HGT, RBMGs therefore produce only false positive but no false negative orthology assignments. Using not only reciprocal best matches but all best matches, furthermore, shows that good quartets identify almost all false positive edges. Removing the central edge of all good quartets in $(\vec{G}, σ)$ yields nearly perfect orthology estimates. This, however, implies that orthology inference is not solely based on reciprocal best matches. Instead, it is necessary to also include certain directional best matches, namely those that identify good quartets.

We observed that a small number of HGT events can cause large deviations between the RBMG $(G, σ)$ and the orthology graph $Θ$ . However, we have considered here the worst-case scenario, where HGT events occur between relatively closely related organisms. While this is of utmost relevance in some cases, for instance for toxin and virulence genes in bacteria, it is of little concern e.g. for the evolution of animals. In the latter case, xenologs almost always originate from bacteria or viruses, i.e., from outgroups. The xenologs then form their own group of co-orthologs and behave as if they would have been lost in the species outside the subtree that received the horizontally transfered gene.

From a more theoretical point of view, our empirical findings in the HGT case beg two questions: (1) Are there local features in the (R)BMG that make it possible to unambiguously identify HGT, at least in some cases? (2) What kind of additional information can be integrated to distinguish good quartets arising from duplication/loss events that can be safely removed from those that are introduced by HGT and should be “repaired” in a different manner. Most obviously, one may ask whether the Fitch relation is sufficient (we conjecture that this is the case) (Geiß et al. 2018; Hellmuth and Seemann 2019), or whether it suffices to know that a leaf is a (recent) result of transfer (we conjecture that this is not enough in general).

The identification of edges in the RBMG that should or should not be removed has important implication for orthology detection approaches that enforce the cograph structure of the predicted orthology relation by means of cograph editing. While this is an NP-complete problem (Liu et al. 2012) in general, the complexity of the colored version, i.e., editing a properly colored graph to the nearest hc-cograph remains open. The removal of false positive edges identified by good quartets empirically reduces the number of induced $P_{4} s$ drastically. This observation also suggests to consider hc-cograph editing with a given best match relation. We suspect that the additional knowledge of the directed edges makes the problem tractable since it already implies a unique least resolved tree that captures much of the cograph structure.

Cograph editing would be fully content with hc-cographs, i.e., co-RBMGs. These are not necessarily “biologically feasible” in the sense that they can be reconciled with a species tree. It will therefore be of interest to consider the problem of editing an hc-cograph to another hc-cograph that is reconcilable with some or a given species tree – a problem that has been considered already for orthology relations (Lafond et al. 2016; Lafond and El-Mabrouk 2014). Since the obstructions are conflicting triples with a speciation at their top node, the offending data are conflicting orthology assignments. It seems natural therefore to phrase the problem not as an arbitrary editing problem but instead to ask for a maximal induced sub-hc-cograph that implies a compatible triple set. If it is indeed true that triples necessarily displayed by the species tree can be extracted directly from the c(R)BMG, it will be of practical use to consider the corresponding edge deletion problem for c(R)BMGs. In particular, it would be interesting to know whether the latter problem is the same as asking for the maximal compatible subset of triples implied by the c(R)BMG or co-BMG?

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 215 KB)^{(215.1KB, pdf)}

Acknowledgements

Open Access funding provided by Projekt DEAL. This work was support in part by the German Federal Ministry of Education and Research (BMBF, project no. 031A538A, de.NBI-RBC) and the Mexican Consejo Nacional de Ciencia y Tecnología (CONACyT, 278966 FONCICYT 2).

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Manuela Geiß, Email: manuela@bioinf.uni-leipzig.de.

Marcos E. González Laffitte, Email: marcoslaffitte@gmail.com.

Alitzel López Sánchez, Email: lopez.alitzel@gmail.com.

Dulce I. Valdivia, Email: dulce.i.valdivia@gmail.com

Marc Hellmuth, Email: mhellmuth@mailbox.org.

Maribel Hernández Rosales, Email: maribel@im.unam.mx.

Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de

References

Altenhoff Adrian M, Boeckmann Brigitte, Capella-Gutierrez Salvador, Dalquen Daniel A, DeLuca Todd, Forslund Kristoffer, Huerta-Cepas Jaime, Linard Benjamin, Pereira Cécile, Pryszcz Leszek P, Schreiber Fabian, da Silva Alan Sousa, Szklarczyk Damian, Train Clément-Marie, Bork Peer, Lecompte Odile, von Mering Christian, Xenarios Ioannis, Sjölander Kimmen, Jensen Lars Juhl, Martin Maria J, Muffato Matthieu, Gabaldón Toni, Lewis Suzanna E, Thomas Paul D, Sonnhammer Erik, Dessimoz Christophe. Standardized benchmarking in the quest for orthologs. Nature Methods. 2016;13(5):425–430. doi: 10.1038/nmeth.3830. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comp Biol. 2012;8:e1002514. doi: 10.1371/journal.pcbi.1002514. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bansal M, Alm E, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012;28:i283–i291. doi: 10.1093/bioinformatics/bts225. [DOI] [PMC free article] [PubMed] [Google Scholar]
Böcker S, Briesemeister S, Klau GW. Exact algorithms for cluster editing: evaluation and experiments. Algorithmica. 2011;60:316–334. [Google Scholar]
Böcker S, Dress AWM. Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math. 1998;138:105–125. [Google Scholar]
Corneil DG, Lerchs H, Steward Burlingham L. Complement reducible graphs. Discr Appl Math. 1981;3:163–174. [Google Scholar]
Dalquén DA, Anisimova M, Gonnet GH, Dessimoz C. ALF—A simulation framework for genome evolution. Mol Biol Evol. 2011;29:1115–1123. doi: 10.1093/molbev/msr268. [DOI] [PMC free article] [PubMed] [Google Scholar]
Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:W84–W89. doi: 10.1093/nar/gkp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dondi R, Lafond M, El-Mabrouk N. Approximating the correction of weighted and unweighted orthology and paralogy relations. Algorithms Mol Biol. 2017;12:4. doi: 10.1186/s13015-017-0096-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Doyon JP, Chauve C, Hamel S. Space of gene/species trees reconciliations and parsimonious models. J Comp Biol. 2009;16:1399–1418. doi: 10.1089/cmb.2009.0095. [DOI] [PubMed] [Google Scholar]
Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011;12:392–400. doi: 10.1093/bib/bbr045. [DOI] [PubMed] [Google Scholar]
Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Tannier E, editor. Comparative genomics: international workshop, RECOMB-CG 2010. Berlin: Springer; 2010. pp. 93–108. [Google Scholar]
Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. doi: 10.1093/bioinformatics/bti325. [DOI] [PubMed] [Google Scholar]
Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part I: clans, basic subclasses, and morphisms. Theor Comp Sci. 1990;70:277–303. [Google Scholar]
Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part II: representation through labeled tree families. Theor Comp Sci. 1990;70:305–342. [Google Scholar]
Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. [PubMed] [Google Scholar]
Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. doi: 10.1016/s0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]
Gabaldón T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat Rev Genet. 2013;14:360–366. doi: 10.1038/nrg3456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geiß M, Anders J, Stadler PF, Wieseke N, Hellmuth M. Reconstructing gene trees from Fitch’s xenology relation. J Math Biol. 2018;77:1459–1491. doi: 10.1007/s00285-018-1260-8. [DOI] [PubMed] [Google Scholar]
Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs. J Math Biol. 2019;78:2015–2057. doi: 10.1007/s00285-019-01332-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geiß Manuela, Stadler Peter F., Hellmuth Marc. Reciprocal best match graphs. Journal of Mathematical Biology. 2019;80(3):865–953. doi: 10.1007/s00285-019-01444-2. [DOI] [PubMed] [Google Scholar]
Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor Comp Sci. 2006;359:378–399. [Google Scholar]
Guigó R, Muchnik I, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6:189–213. doi: 10.1006/mpev.1996.0071. [DOI] [PubMed] [Google Scholar]
Hellmuth M. Biologically feasible gene trees, reconciliation maps and informative triples. Alg Mol Biol. 2017;12:23. doi: 10.1186/s13015-017-0114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66:399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]
Hellmuth M, Huber K, Moulton V. Reconciling event-labeled gene trees with MUL-trees and species networks. J Math Biol. 2019;79:1885–1925. doi: 10.1007/s00285-019-01414-8. [DOI] [PubMed] [Google Scholar]
Hellmuth M, Seemann CR. Alternative characterizations of Fitch’s xenology relation. J Math Biol. 2019;79:969–986. doi: 10.1007/s00285-019-01384-x. [DOI] [PubMed] [Google Scholar]
Hellmuth M, Stadler PF, Wieseke N. The mathematics of xenology: Di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations. J Math Biol. 2017;75:299–237. doi: 10.1007/s00285-016-1084-3. [DOI] [PubMed] [Google Scholar]
Hellmuth Marc, Wieseke Nicolas. Evolutionary Biology. Cham: Springer International Publishing; 2016. From Sequence Data Including Orthologs, Paralogs, and Xenologs to Gene and Species Trees; pp. 373–392. [Google Scholar]
Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci USA. 2015;112:2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13:S6. [Google Scholar]
Hoàng CT, Kamiński M, Sawada J, Sritharan R. Finding and listing induced paths and cycles. Discr Appl Math. 2013;161:633–641. [Google Scholar]
Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
Jamison B, Olariu S. Recognizing $P_{4}$ -sparse graphs in linear time. SIAM J Comput. 1992;21:381–406. [Google Scholar]
Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36:D250–D2504. doi: 10.1093/nar/gkm796. [DOI] [PMC free article] [PubMed] [Google Scholar]
Keller-Schmidt S, Klemm K. A model of macroevolution as a branching process based on innovations. Adv Complex Syst. 2012;15:1250043. [Google Scholar]
Koonin E. Orthologs, paralogs, and evolutionary genomics. Ann Rev Genet. 2005;39:309–338. doi: 10.1146/annurev.genet.39.073003.114725. [DOI] [PubMed] [Google Scholar]
Kuhn TS, Mooers AØ, Thomas GH. A simple polytomy resolver for dated phylogenies. Methods Ecol Evo. 2011;2:427–436. [Google Scholar]
Lafond M, Dondi R, El-Mabrouk N. The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol. 2016;11:4. doi: 10.1186/s13015-016-0067-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genom. 2014;15:S12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection combining clustering and synteny for very large datasets. PLoS ONE. 2014;9:e105015. doi: 10.1371/journal.pone.0105015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comp Sci. 2012;461:45–54. [Google Scholar]
Nichio BTL, Marchaukoski JN, Raittz RT. New tools in orthology analysis: a brief review of promising perspectives. Front Genet. 2017;8:165. doi: 10.3389/fgene.2017.00165. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nøjgaard N, Geiß M, Merkle D, Stadler PF, Wieseke N, Hellmuth M. Time-consistent reconciliation maps and forbidden time travel. Alg Mol Biol. 2018;13:2. doi: 10.1186/s13015-018-0121-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Page RDM, Charleston MA. Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci. 1997;37:57–70. [Google Scholar]
Purvis A, Garland T., Jr Polytomies in comparative analyses of continuous characters. Syst Biol. 1993;42:569–575. [Google Scholar]
Roth ACJ, Gonnet GH, Dessimoz C. Algorithm of OMA for large-scale orthology inference. BMC Bioinform. 2008;9:518. doi: 10.1186/1471-2105-9-518. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rusin LY, Lyubetskaya E, Gorbunov KY, Lyubetsky V. Reconciliation of gene and species trees. BioMed Res Int. 2014;2014:642089. doi: 10.1155/2014/642089. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sayyari E, Mirarab S. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes. 2018;9:E132. doi: 10.3390/genes9030132. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sonnhammer E. L. L., Gabaldon T., Sousa da Silva A. W., Martin M., Robinson-Rechavi M., Boeckmann B., Thomas P. D., Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30(21):2993–2998. doi: 10.1093/bioinformatics/btu492. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stadler PF, Geiß M, Schaller D, López A, Gonzalez Laffitte M, Valdivia D, Hellmuth M, Hernandez Rosales M (2020) From best hits to best matches. Tech Rep 2001.00958, arXiv
Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. doi: 10.1093/bioinformatics/18.1.92. [DOI] [PubMed] [Google Scholar]
Studer RA, Robinson-Rechavi M. How confident can we be that orthologs are similar, but paralogs differ? Trends Genet. 2009;25:210–216. doi: 10.1016/j.tig.2009.03.004. [DOI] [PubMed] [Google Scholar]
Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEEACM Trans Comput Biol Bioinform. 2011;8:517–535. doi: 10.1109/TCBB.2010.14. [DOI] [PubMed] [Google Scholar]
Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non-binary species trees. J Comput Biol. 2008;15:981–1006. doi: 10.1089/cmb.2008.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional annotations of paralogs: a blessing and a curse. Life. 2016;6:39. doi: 10.3390/life6030039. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material 1 (pdf 215 KB)^{(215.1KB, pdf)}

[CR1] Altenhoff Adrian M, Boeckmann Brigitte, Capella-Gutierrez Salvador, Dalquen Daniel A, DeLuca Todd, Forslund Kristoffer, Huerta-Cepas Jaime, Linard Benjamin, Pereira Cécile, Pryszcz Leszek P, Schreiber Fabian, da Silva Alan Sousa, Szklarczyk Damian, Train Clément-Marie, Bork Peer, Lecompte Odile, von Mering Christian, Xenarios Ioannis, Sjölander Kimmen, Jensen Lars Juhl, Martin Maria J, Muffato Matthieu, Gabaldón Toni, Lewis Suzanna E, Thomas Paul D, Sonnhammer Erik, Dessimoz Christophe. Standardized benchmarking in the quest for orthologs. Nature Methods. 2016;13(5):425–430. doi: 10.1038/nmeth.3830. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comp Biol. 2012;8:e1002514. doi: 10.1371/journal.pcbi.1002514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] Bansal M, Alm E, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012;28:i283–i291. doi: 10.1093/bioinformatics/bts225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] Böcker S, Briesemeister S, Klau GW. Exact algorithms for cluster editing: evaluation and experiments. Algorithmica. 2011;60:316–334. [Google Scholar]

[CR5] Böcker S, Dress AWM. Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math. 1998;138:105–125. [Google Scholar]

[CR6] Corneil DG, Lerchs H, Steward Burlingham L. Complement reducible graphs. Discr Appl Math. 1981;3:163–174. [Google Scholar]

[CR7] Dalquén DA, Anisimova M, Gonnet GH, Dessimoz C. ALF—A simulation framework for genome evolution. Mol Biol Evol. 2011;29:1115–1123. doi: 10.1093/molbev/msr268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:W84–W89. doi: 10.1093/nar/gkp373. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] Dondi R, Lafond M, El-Mabrouk N. Approximating the correction of weighted and unweighted orthology and paralogy relations. Algorithms Mol Biol. 2017;12:4. doi: 10.1186/s13015-017-0096-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] Doyon JP, Chauve C, Hamel S. Space of gene/species trees reconciliations and parsimonious models. J Comp Biol. 2009;16:1399–1418. doi: 10.1089/cmb.2009.0095. [DOI] [PubMed] [Google Scholar]

[CR11] Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011;12:392–400. doi: 10.1093/bib/bbr045. [DOI] [PubMed] [Google Scholar]

[CR12] Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Tannier E, editor. Comparative genomics: international workshop, RECOMB-CG 2010. Berlin: Springer; 2010. pp. 93–108. [Google Scholar]

[CR13] Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. doi: 10.1093/bioinformatics/bti325. [DOI] [PubMed] [Google Scholar]

[CR14] Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part I: clans, basic subclasses, and morphisms. Theor Comp Sci. 1990;70:277–303. [Google Scholar]

[CR15] Ehrenfeucht A, Rozenberg G. Theory of 2-structures, part II: representation through labeled tree families. Theor Comp Sci. 1990;70:305–342. [Google Scholar]

[CR16] Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. [PubMed] [Google Scholar]

[CR17] Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. doi: 10.1016/s0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]

[CR18] Gabaldón T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat Rev Genet. 2013;14:360–366. doi: 10.1038/nrg3456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] Geiß M, Anders J, Stadler PF, Wieseke N, Hellmuth M. Reconstructing gene trees from Fitch’s xenology relation. J Math Biol. 2018;77:1459–1491. doi: 10.1007/s00285-018-1260-8. [DOI] [PubMed] [Google Scholar]

[CR20] Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs. J Math Biol. 2019;78:2015–2057. doi: 10.1007/s00285-019-01332-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] Geiß Manuela, Stadler Peter F., Hellmuth Marc. Reciprocal best match graphs. Journal of Mathematical Biology. 2019;80(3):865–953. doi: 10.1007/s00285-019-01444-2. [DOI] [PubMed] [Google Scholar]

[CR22] Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor Comp Sci. 2006;359:378–399. [Google Scholar]

[CR23] Guigó R, Muchnik I, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6:189–213. doi: 10.1006/mpev.1996.0071. [DOI] [PubMed] [Google Scholar]

[CR24] Hellmuth M. Biologically feasible gene trees, reconciliation maps and informative triples. Alg Mol Biol. 2017;12:23. doi: 10.1186/s13015-017-0114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66:399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]

[CR26] Hellmuth M, Huber K, Moulton V. Reconciling event-labeled gene trees with MUL-trees and species networks. J Math Biol. 2019;79:1885–1925. doi: 10.1007/s00285-019-01414-8. [DOI] [PubMed] [Google Scholar]

[CR27] Hellmuth M, Seemann CR. Alternative characterizations of Fitch’s xenology relation. J Math Biol. 2019;79:969–986. doi: 10.1007/s00285-019-01384-x. [DOI] [PubMed] [Google Scholar]

[CR28] Hellmuth M, Stadler PF, Wieseke N. The mathematics of xenology: Di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations. J Math Biol. 2017;75:299–237. doi: 10.1007/s00285-016-1084-3. [DOI] [PubMed] [Google Scholar]

[CR29] Hellmuth Marc, Wieseke Nicolas. Evolutionary Biology. Cham: Springer International Publishing; 2016. From Sequence Data Including Orthologs, Paralogs, and Xenologs to Gene and Species Trees; pp. 373–392. [Google Scholar]

[CR30] Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci USA. 2015;112:2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13:S6. [Google Scholar]

[CR32] Hoàng CT, Kamiński M, Sawada J, Sritharan R. Finding and listing induced paths and cycles. Discr Appl Math. 2013;161:633–641. [Google Scholar]

[CR33] Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]

[CR34] Jamison B, Olariu S. Recognizing $P_{4}$ -sparse graphs in linear time. SIAM J Comput. 1992;21:381–406. [Google Scholar]

[CR35] Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36:D250–D2504. doi: 10.1093/nar/gkm796. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] Keller-Schmidt S, Klemm K. A model of macroevolution as a branching process based on innovations. Adv Complex Syst. 2012;15:1250043. [Google Scholar]

[CR37] Koonin E. Orthologs, paralogs, and evolutionary genomics. Ann Rev Genet. 2005;39:309–338. doi: 10.1146/annurev.genet.39.073003.114725. [DOI] [PubMed] [Google Scholar]

[CR38] Kuhn TS, Mooers AØ, Thomas GH. A simple polytomy resolver for dated phylogenies. Methods Ecol Evo. 2011;2:427–436. [Google Scholar]

[CR39] Lafond M, Dondi R, El-Mabrouk N. The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol. 2016;11:4. doi: 10.1186/s13015-016-0067-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genom. 2014;15:S12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection combining clustering and synteny for very large datasets. PLoS ONE. 2014;9:e105015. doi: 10.1371/journal.pone.0105015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comp Sci. 2012;461:45–54. [Google Scholar]

[CR44] Nichio BTL, Marchaukoski JN, Raittz RT. New tools in orthology analysis: a brief review of promising perspectives. Front Genet. 2017;8:165. doi: 10.3389/fgene.2017.00165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] Nøjgaard N, Geiß M, Merkle D, Stadler PF, Wieseke N, Hellmuth M. Time-consistent reconciliation maps and forbidden time travel. Alg Mol Biol. 2018;13:2. doi: 10.1186/s13015-018-0121-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] Page RDM, Charleston MA. Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci. 1997;37:57–70. [Google Scholar]

[CR47] Purvis A, Garland T., Jr Polytomies in comparative analyses of continuous characters. Syst Biol. 1993;42:569–575. [Google Scholar]

[CR48] Roth ACJ, Gonnet GH, Dessimoz C. Algorithm of OMA for large-scale orthology inference. BMC Bioinform. 2008;9:518. doi: 10.1186/1471-2105-9-518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] Rusin LY, Lyubetskaya E, Gorbunov KY, Lyubetsky V. Reconciliation of gene and species trees. BioMed Res Int. 2014;2014:642089. doi: 10.1155/2014/642089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] Sayyari E, Mirarab S. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes. 2018;9:E132. doi: 10.3390/genes9030132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] Sonnhammer E. L. L., Gabaldon T., Sousa da Silva A. W., Martin M., Robinson-Rechavi M., Boeckmann B., Thomas P. D., Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30(21):2993–2998. doi: 10.1093/bioinformatics/btu492. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] Stadler PF, Geiß M, Schaller D, López A, Gonzalez Laffitte M, Valdivia D, Hellmuth M, Hernandez Rosales M (2020) From best hits to best matches. Tech Rep 2001.00958, arXiv

[CR53] Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. doi: 10.1093/bioinformatics/18.1.92. [DOI] [PubMed] [Google Scholar]

[CR54] Studer RA, Robinson-Rechavi M. How confident can we be that orthologs are similar, but paralogs differ? Trends Genet. 2009;25:210–216. doi: 10.1016/j.tig.2009.03.004. [DOI] [PubMed] [Google Scholar]

[CR55] Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]

[CR56] Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEEACM Trans Comput Biol Bioinform. 2011;8:517–535. doi: 10.1109/TCBB.2010.14. [DOI] [PubMed] [Google Scholar]

[CR57] Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non-binary species trees. J Comput Biol. 2008;15:981–1006. doi: 10.1089/cmb.2008.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional annotations of paralogs: a blessing and a curse. Life. 2016;6:39. doi: 10.3390/life6030039. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Best match graphs and reconciliation of gene trees with species trees

Manuela Geiß

Marcos E González Laffitte

Alitzel López Sánchez

Dulce I Valdivia

Marc Hellmuth

Maribel Hernández Rosales

Peter F Stadler

Abstract

Electronic supplementary material

Introduction

Fig. 1.

Preliminaries

Definition 1

Definition 2

Reconciliation maps, event labelings, and orthology relations

Definition 3

Lemma 1

Proof

Definition 4

Lemma 2

Proof

Definition 5

Definition 6

Lemma 3

Proof

Proposition 1

Fig. 2.

Definition 7

Fig. 3.

Proposition 2

Theorem 1

Proof

Lemma 4

Proof

Corollary 1

Fig. 4.

Orthology and reciprocal best matches

Lemma 5

Proof

Theorem 2

Observation 1

Theorem 3

Proof

Definition 8

Lemma 6

Proof

Corollary 2

Lemma 7

Proof

Corollary 3

Proof

Fig. 5.

Classification of RBMGs

Definition 9

Proposition 3

Proposition 4

Definition 10

Theorem 4

Proof

Corollary 4

Observation 2

Non-orthologous reciprocal best matches

Definition 11

Fig. 6.

Lemma 8

Proof

Corollary 5

Fig. 7.

Simulations

Simulation methods

Simulation results for duplication/loss scenarios

Lemma 9

Proof

Theorem 5

Proof

Fig. 8.

Fig. 9.

Fig. 10.