Skip to main content
Springer logoLink to Springer
. 2021 Jul 3;83(1):10. doi: 10.1007/s00285-021-01631-0

Indirect identification of horizontal gene transfer

David Schaller 1,2,3, Manuel Lafond 4, Peter F Stadler 2,3,6,7,8,9,10,11,12, Nicolas Wieseke 5, Marc Hellmuth 13,
PMCID: PMC8254804  PMID: 34218334

Abstract

Several implicit methods to infer horizontal gene transfer (HGT) focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of a graph, the later-divergence-time (LDT) graph, whose vertices correspond to genes colored by their species. We investigate these graphs in the setting of relaxed scenarios, i.e., evolutionary scenarios that encompass all commonly used variants of duplication-transfer-loss scenarios in the literature. We characterize LDT graphs as a subclass of properly vertex-colored cographs, and provide a polynomial-time recognition algorithm as well as an algorithm to construct a relaxed scenario that explains a given LDT. An edge in an LDT graph implies that the two corresponding genes are separated by at least one HGT event. The converse is not true, however. We show that the complete xenology relation is described by an rs-Fitch graph, i.e., a complete multipartite graph satisfying constraints on the vertex coloring. This class of vertex-colored graphs is also recognizable in polynomial time. We finally address the question “how much information about all HGT events is contained in LDT graphs” with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. In particular, we show that a simple greedy graph editing scheme can be used to efficiently detect HGT events that are implicitly contained in LDT graphs.

Keywords: Gene families, Xenology, Binary relation, Indirect phylogenetic methods, Horizontal gene transfer, Fitch graph, Later-divergence-time, Polynomial-time recognition algorithm

Introduction

Horizontal gene transfer (HGT) laterally introduces foreign genetic material into a genome. The phenomenon is particularly frequent in prokaryotes (Soucy et al. 2015; Nelson-Sathi et al. 2015) but also contributed to shaping eukaryotic genomes (Keeling and Palmer 2008; Husnik and McCutcheon 2018; Acuña et al. 2012; Li et al. 2014; Moran and Jarvik 2010; Schönknecht et al. 2013). HGT may be additive, in which case its effect is similar to gene duplications, or lead to the replacement of a vertically inherited homolog. From a phylogenetic perspective, HGT leads to an incongruence of gene trees and species trees, thus complicating the analysis of gene family histories.

A broad spectrum of computational methods have been developed to identify horizontally transferred genes and/or HGT events, recently reviewed by Ravenhall et al. (2015). Parametric methods use genomic signatures, i.e., sequence features specific to a (group of) species identify horizontally inserted material. Genomic signatures include e.g. GC content, k-mer distributions, sequence autocorrelation, or DNA deformability (Dufraigne et al. 2005; Becq et al. 2010). Direct (or “explicit”) phylogenetic methods start from a given gene tree T and species tree S and compute a reconciliation, i.e., a mapping of the gene tree into the species tree. This problem first arose in the context of host/parasite assemblages (Page 1994; Charleston 1998) considering the equivalent problem of mapping a parasite tree T to a host phylogeny S such that the number of events such as host-switches, i.e., horizontal transfers, is minimized. For a review of the early literature we refer to Charleston and Perkins (2006). A major difficulty is to enforce time consistency in the presence of multiple horizontal transfer events, which renders the problem of finding optimal reconciliations NP-hard (Hallett and Lagergren 2001; Ovadia et al. 2011; Tofigh et al. 2011; Hasić and Tannier 2019). Nevertheless several practical approaches have become available, see e.g. Tofigh et al. (2011), Chen et al. (2012) and Ma et al. (2018).

Indirect (or “implicit”) phylogenetic methods forego the reconstruction of trees and start from sequence similarity or evolutionary distances and use unexpectedly small or large distances between genes as indicators of HGT. While indirect methods have been used successfully in the past, reviewed by Ravenhall et al. (2015), they have received very little attention from a more formal point of view. In this contribution, we focus on a particular type of implicit phylogenetic information, following the ideas of Novichkov et al. (2004). The basic idea is that the evolutionary distance between orthologous genes is approximately proportional to the distances between their species. Xenologous gene pairs as well as duplicate genes thus appear as outliers (Lawrence and Hartl 1992; Clarke et al. 2002; Novichkov et al. 2004; Dessimoz et al. 2008). More precisely, consider a family of homologous genes in a set of species and plot the phylogenetic distance of pairs of most similar homologs as a function of the phylogenetic distances between the species in which they reside. Since distances between orthologous genes can be expected to be approximately proportional to the distances between the species, orthologous pairs fall onto a regression line that defines equal divergence time for the last common ancestor of corresponding gene and species pairs. The gene pairs with “later divergence times”, i.e., those that are more closely related than expected from their species, fall below the regression line (Novichkov et al. 2004). Kanhere and Vingron (2009) complemented this idea with a statistical test based on the Cook distance to identify xenologous pairs in a statistically sound manner. For the mathematical analysis we assume that we can perfectly identify all pairs of genes a and b that are more closely related than expected from the phylogenetic distance of their respective genomes. Naturally, this defines a graph (G,σ), whose vertices x (the genes) are colored by the species σ(x) in which they appear. Here, we are interested in two questions:

  1. What are the mathematical properties that characterize these “later-divergence-time” (LDT) graphs?

  2. What kind of information about HGT events, the gene and species tree, and the reconciliation map between them is contained implicitly in an LDT graph?

In Sect. 6 we will briefly consider the situation that later-divergence-time information is fraught with experimental errors.

These questions are motivated by a series of recent publications that characterized the mathematical structure of orthology (Hellmuth et al. 2013; Lafond and El-Mabrouk 2014), the xenology relation sensu Fitch (Geiß et al. 2018; Hellmuth et al. 2018; Hellmuth and Seemann 2019), and the (reciprocal) best match relation (Geiß et al. 2019, 2020b; Schaller et al. 2021a, b). Each of these relations satisfies stringent mathematical conditions that—at least in principle—can be used to correct empirical estimates and thus serve as a potential means of noise reduction (Hellmuth et al. 2015; Stadler et al. 2020). This approach has also lead to efficient algorithms to extract gene trees, species trees, and reconciliations from the relation data. Although the resulting representations of gene family histories are usually not fully resolved, they can provide important constraints for subsequent refinements. The advantage of the relation-based approach is primarily robustness. While the inference of phylogenetic trees relies on detailed probability models or the additivity of distance metrics, our approach starts from yes/no answers to simple, pairwise comparisons. These data can therefore be represented as edges in a graph, possibly augmented by a measure of confidence. Noise and inaccuracies in the initial estimates then translate into violations of the required mathematical properties of the graphs in question. Graph editing approaches can therefore be harnessed as a means of noise reduction (Hellmuth et al. 2015; Dondi et al. 2017; Lafond and El-Mabrouk 2014; Lafond et al. 2016; Hellmuth et al. 2020b, a; Schaller et al. 2021c).

Previous work following this paradigm has largely been confined to duplication-loss (DL) scenarios, excluding horizontal transfer. As shown in Hellmuth (2017), it is possible to partition a gene set into HGT-free classes separated by HGTs. Within each class, the reconstruction problems then simplify to the much easier DL scenarios. It is of utmost interest, therefore, to find robust methods to infer this partition directly from (dis)similarity data. Here, we explore the usefulness and limitations of LDT graphs for this purpose.

This contribution is organized as follows. After introducing the necessary notation, we introduce relaxed scenarios, a very general framework to describe evolutionary scenarios that emphasizes time consistency of reconciliation rather than particular types of evolutionary events. In Sect. 4, LDT graphs are defined formally and characterized as those properly colored cographs for which a set of accompanying rooted triples is consistent (Theorem 3). The proof is constructive and provides a method (Algorithm 1) to compute a relaxed scenario for a given LDT graph. Section 5 defines HGT events, shows that every edge in a LDT graph corresponds to an HGT event, and characterizes those LDT graphs that already capture all HGT events. In addition, we provide a characterization of “rs-Fitch graphs” (general vertex-colored graphs that capture all HGT events) in terms of their coloring. These properties can be verified in polynomial time. Since LDT graphs do not usually capture all HGT events, we discuss in “Appendix C” several ways to obtain a plausible set of HGT candidates from LDT graphs. In Sect. 7, we address the question “how much information about all HGT events is contained in LDT graphs” with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. We find that LDT graphs cover roughly a third of xenologous pairs, while a simple greedy graph editing scheme can more than double the recall at moderate false positive rates. This greedy approach already yields a median accuracy of 89%, and in 99.8% of the cases produces biologically feasible solutions in the sense that the inferred graphs are rs-Fitch graphs. We close with a discussion of several open problems and directions for future research in Sect. 8.

The material of this contribution is extensive and contains several lengthy, very technical proofs. We therefore divided the presentation into a Narrative Part that contains only those mathematical results that contribute to our main conclusions, and a Technical Part providing additional results and all proofs. To facilitate cross-referencing between the two parts, the same numbering of Definitions, Lemmas, Theorems, etc., is used. Appendices AB, and C contain the technical material corresponding to Sects. 4, 5, and 6, respectively.

Notation

Graphs We consider undirected graphs G=(V,E) with vertex set V(G):=V and edge set E(G):=E, and denote edges connecting vertices x,yV by xy. The graphs K1 and K2 denote the complete graphs on one and two vertices, respectively. The graph K2+K1 is the disjoint union of a K2 and a K1.

The join GH of two graphs G=(V,E) and H=(W,F) is the graph with vertex set Inline graphic and edge set Inline graphic. We write HG if V(H)V(G) and E(H)E(G), in which case H is called a subgraph of G. Given a graph G=(V,E), we write G[W] for the graph induced by WV. A connected component C of G is an inclusion-maximal vertex set such that G[C] is connected. A (maximal) clique C in an undirected graph G is an (inclusion-maximal) vertex set such that, for all vertices x,yC, it holds that xyE(G), i.e., G[C] is complete. A subset WV is a (maximal) independent set if G[W] is edgeless (and W is maximal w.r.t. inclusion). A graph G=(V,E) is complete multipartite if V consists of k1 pairwise disjoint independent sets I1,,Ik and xyE if and only if xIi and yIj with ij.

A graph G together with a vertex coloring σ, denoted by (G,σ), is properly colored if uvE(G) implies σ(u)σ(v). For a coloring σ:VM and a subset WV, we write σ(W):={σ(w)wW} for the set of colors that appear on the vertices in W. Throughout, we will need restrictions of the coloring map σ.

Definition 1

Let σ:LM be a map, LL and σ(L)MM. Then, the map σ|L,M:LM is defined by putting σ|L,M(v)=σ(v) for all vL. If we only restrict the domain of σ, we just write σ|L instead of σ|L,M.

We do neither assume that σ nor that its restriction σ|L,M is surjective.

Rooted trees All trees appearing in this contribution are rooted in one of their vertices. We write xTy if y lies on the unique path from the root to x, in which case y is called an ancestor of x, and x is called a descendant of y. We may also write yTx instead of xTy. We use xTy for xTy and xy. In the latter case, y is a strict ancestor of x. If xTy or yTx, the vertices x and y are comparable and, otherwise, incomparable. We write L(T) for the set of leaves of the tree T, i.e., the T-minimal vertices and say that T is a tree on L(T). We write T(u) for the subtree of T rooted in u. The last common ancestor of a vertex set WV(T) is the T-minimal vertex u:=lcaT(W) for which wTu for all wW. For brevity we write lcaT(x,y)=lcaT({x,y}).

We employ the convention that edges (xy) in a tree are always written such that yTx is satisfied. If (xy) is an edge in T, then par(y):=x is the parent of y, and y the child of x. We denote with childT(x) the set of all children of x in T. It will be convenient for the discussion below to extend the ancestor relation T on V to the union of the edge and vertex sets of T. More precisely, for a vertex xV(T) and an edge e=(u,v)E(T) we put xTe if and only if xTv; and eTx if and only if uTx. In addition, for edges e=(u,v) and f=(a,b) in T we put eTf if and only if vTb.

A rooted tree is phylogenetic if all vertices that are adjacent to at least two vertices have at least two children. A rooted tree T is planted if its root has degree 1. In this case, we denote the “planted root” by 0T. In planted phylogenetic trees there is a unique “planted edge” (0T,ρT) where ρT:=lcaT(L(T)). Note that by definition 0TL(T).

Throughout, we will assume that all trees are rooted and phylogenetic unless explicitly stated otherwise. Whenever there is no danger of confusion, we will refer also to planted phylogenetic trees simply as trees.

The set of inner vertices is given by V0(T):=V(T)\(L(T){0T}). An edge (uv) is an inner edge if both vertices u and v are inner vertices and, otherwise, an outer edge. The restriction of T to a subset LL(T) of leaves, denoted by T|L is obtained by identifying the (unique) minimal subtree of T that connects all leaves in L, and suppressing all vertices with degree two except possibly the root ρTL=lcaT(L). T displays a tree T, in symbols TT, if T can be obtained from a restriction T|L of T by a series of inner edge contractions (Bryant and Steel 1995). If, in addition, L(T)=L(T), then T is a refinement of T. Throughout this contribution, we will consider leaf-colored trees (T,σ) with σ being defined for L(T) only.

Rooted triples A rooted triple is a tree T on three leaves and two internal vertices. We write ab|c for the triple with lcaT(a,b)lcaT(a,c)=lcaT(b,c). For a set R of triples we write L(R):=tRL(t). The set R is compatible if there is a tree T with L(R)L(T) that displays every triple tR. The construction of such a tree T from a triple set R on L makes use of an auxiliary graph that will play a prominent role in this contribution.

Definition 2

(Aho et al. 1981) Let R be a set of rooted triples on the vertex set L. The Aho graph [R,L] has vertex set L and edge set {xyzL:xy|zR}.

The algorithm BUILD (Aho et al. 1981) uses Aho graphs in a top-down recursion starting from a given set of triples R and returns for compatible triple sets R on L an unambiguously defined tree Aho(R,L) on L, which is known as the Aho tree. BUILD runs in polynomial time. The key property of the Aho graph that ensures the correctness of BUILD can be stated as follows:

Proposition 1

(Aho et al. 1981; Bryant and Steel 1995) A set of triples R is compatible if and only if for each subset LL(R) with |L|>1 the graph [R,L] is disconnected.

Cographs are recursively defined as undirected graphs that can be generated as joins or disjoint unions of cographs, starting from single-vertex graphs K1. The recursive construction defines a rooted tree (Tt), called cotree, whose leaves are the vertices of the cograph G, i.e., the K1s, while each of its inner vertices u of T represent the join or disjoint union operations, labeled as t(u)=1 and t(u)=0, respectively. Hence, for a given cograph G and its cotree (Tt), we have xyE(G) if and only if t(lcaT(x,y))=1. Contraction of all tree edges (u,v)E(T) with t(u)=t(v) results in the discriminating cotree (TG,t^) of G with cotree-labeling t^ such that t^(u)t^(v) for any two adjacent interior vertices of TG. The discriminating cotree (TG,t^) is uniquely determined by G (Corneil et al. 1981a). Cographs have a large number of equivalent characterizations. In this contribution, we will need the following classical results:

Proposition 2

(Corneil et al. 1981a) Given an undirected graph G, the following statements are equivalent:

  1. G is a cograph.

  2. G does not contain a P4, i.e., a path on four vertices, as an induced subgraph.

  3. diam(H)2 for all connected induced subgraphs H of G.

  4. Every induced subgraph H of G is a cograph.

Relaxed reconciliation maps and relaxed scenarios

Tofigh et al. (2011) and Bansal et al. (2012) define “Duplication-Transfer-Loss” (DTL) scenarios in terms of a vertex-only map γ:V(T)V(S). The H-trees introduced by Górecki (2010) and Górecki and Tiuryn (2012) formalize the same concept in a very different manner. A definition of a DTL-like class of scenarios in terms of a reconciliation map μ:V(T)V(S)E(S) was analyzed by Nøjgaard et al. (2018). For binary trees, the two definitions are equivalent; for non-binary trees, however, the DTL-scenarios are a proper subset, see Nøjgaard et al. (2018, Fig. 1) for an example. Several other mathematical frameworks have been used in the literature to specify evolutionary scenarios. Examples include the DLS-trees of Górecki and Tiuryn (2006), which can be seen as event-labeled gene trees with leaves denoting both surviving genes and loss-events, maps g:V(S)2V(T) from a suitable subdivision S of the species tree S to the gene tree as used by Hallett and Lagergren (2001), and associations of edges, i.e., subsets of E(T)×E(S) (Wieseke et al. 2013).

In the presence of HGT, the relationships of gene trees and species are not only constrained by local conditions corresponding to the admissible local evolutionary events (duplication, speciation, gene loss, and HGT) but also by the global condition that the HGT events within each lineage admit a temporal order (Merkle and Middendorf 2005; Gorbunov and Lyubetsky 2009; Tofigh et al. 2011). In order to capture time consistency from the outset and to establish the mathematical framework, we consider here trees with explicit timing information (Merkle and Middendorf 2005).

Definition 3

(Time Map) The map τT:V(T)R is a time map for a tree T if xTy implies τT(x)<τT(y) for all x,yV(T).

It is important to note that only qualitative, relative timing information will be used in practice, i.e., we will never need the actual value of time maps but only information on whether an event pre-dates, post-dates, or is concurrent with another. Definition 3 ensures that the ancestor relation T and the timing of the vertices are not in conflict. For later reference, we provide the following simple result.

Lemma 1

Given a tree T, a time map τT for T satisfying τT(x)=τ0(x) with arbitrary choices of τ0(x) for all xL(T) can be constructed in linear time.

Proof

We traverse T in postorder. If x is a leaf, we set τT(x)=τ0(x), and otherwise compute t:=maxuchild(x)τT(u) and set τT(x)=t with an arbitrary value t>t. Clearly the total effort is O(|V(T)|+|E(T)|), and thus also linear in the number of leaves L(T).

Lemma 1 will be useful for the construction of time maps as it, in particular, allows us to put τT(x)=τT(y) for all x,yL(T).

Definition 4

(Time consistency) Let T and S be two trees. A map μ:V(T)V(S)E(S) is called time-consistent if there are time maps τT for T and τS for S satisfying the following conditions for all uV(T):

  1. If μ(u)V(S), then τT(u)=τS(μ(u)).

  2. Else, if μ(u)=(x,y)E(S), then τS(y)<τT(u)<τS(x).

Conditions (C1) and (C2) ensure that the reconciliation map μ preserves time in the following sense: If vertex u of the gene tree is mapped to a vertex μ(u)=v in the species tree, then u and v receive the same time stamp by Condition (C1). If u is mapped to an edge μ(u)=(x,y), then the time stamp of u falls within the time range [τS(x),τS(y)] of the edge xy in the species tree. The following definition of reconciliation is designed (1) to be general enough to encompass the notions of reconciliation that have been studied in the literature, and (2) to separate the mapping between gene tree and species tree from specific types of events. Event types such as duplication or horizontal transfer therefore are considered here as a matter of interpreting scenarios, not as part of their definition.

Definition 5

(Relaxed reconciliation map) Let T and S be two planted trees with leaf sets L(T) and L(S), respectively and let σ:L(T)L(S) be a map. A map μ:V(T)V(S)E(S) is a relaxed reconciliation map for (T,S,σ) if the following conditions are satisfied:

  • (G0)

    Root Constraint. μ(x)=0S if and only if x=0T

  • (G1)

    Leaf Constraint. μ(x)=σ(x) if and only if xL(T).

  • (G2)

    Time Consistency Constraint. The map μ is time-consistent for some time maps τT for T and τS for S.

Condition (G0) is used to map the respective planted roots. (G1) ensures that genes are mapped to the species in which they reside. (G2) enforces time consistency. The reconciliation maps most commonly used in the literature, see e.g. (Tofigh et al. 2011; Bansal et al. 2012), usually not only satisfy (G0)–(G2) but also impose additional conditions. We therefore call the map μ defined here “relaxed”.

Definition 6

(relaxed Scenario) The 6-tuple S=(T,S,σ,μ,τT,τS) is a relaxed scenario if μ is a relaxed reconciliation map for (T,S,σ) that satisfies (G2) w.r.t. the time maps τT and τS.

By definition, relaxed reconciliation maps are time-consistent. Moreover, τT(x)=τS(σ(x)) for all xL(T) by Definitions 4(C1) and 5(G1,G2). In the following we will refer to the map σ:L(T)L(S) as the coloring of S.

Later-divergence-time graphs

LDT graphs and μ-free scenarios

In the absence of horizontal gene transfer, the last common ancestor of two species A and B should mark the latest possible time point at which two genes a and b residing in σ(a)=A and σ(b)=B, respectively, may have diverged. Situations in which this constraint is violated are therefore indicative of HGT. To address this issue in some more detail, we next define “μ-free scenarios” that eventually will lead us to the class of “LDT graphs” that contain all information about genes that diverged after the species in which they reside.

Definition 7

(μ-free scenario) Let T and S be planted trees, σ:L(T)L(S) be a map, and τT and τS be time maps of T and S, respectively, such that τT(x)=τS(σ(x)) for all xL(T). Then, T=(T,S,σ,τT,τS) is called a μ-free scenario.

This definition of a scenario without a reconciliation map μ is mainly a technical convenience that simplifies the arguments in various proofs by avoiding the construction of a reconciliation map. It is motivated by the observation that the “later-divergence-time” of two genes in comparison with their species is independent from any such μ. Every relaxed scenario S=(T,S,σ,μ,τT,τS) implies an underlying μ-free scenario T=(T,S,σ,τT,τS). Statements proved for μ-free scenarios therefore also hold for relaxed scenarios. Note that, by Lemma 1, given the time map τS, one can easily construct a time map τT such that τT(x)=τS(σ(x)) for all xL(T). In particular, when constructing relaxed scenarios explicitly, we may simply choose τT(u)=0 and τS(x)=0 as common time for all leaves uL(T) and xL(S). Although not all μ-free scenarios admit a reconciliation map and thus can be turned into relaxed scenarios, Lemma 2 below implies that for every μ-free scenario T there is a relaxed scenario with possibly slightly distorted time maps that encodes the same LDT graph as T.

Definition 8

(LDT graph) For a μ-free scenario T=(T,S,σ,τT,τS), we define G<(T)=G<(T,S,σ,τT,τS)=(V,E) as the graph with vertex set V:=L(T) and edge set

E:={aba,bL(T),τT(lcaT(a,b))<τS(lcaS(σ(a),σ(b))).}

A vertex-colored graph (G,σ) is a later-divergence-time graph (LDT graph), if there is a μ-free scenario T=(T,S,σ,τT,τS) such that G=G<(T). In this case, we say that T explains (G,σ).

It is easy to see that the edge set of G<(T) defines an undirected graph and that two genes a and b form an edge if the divergence time of a and b is strictly less than the divergence time of the underlying species σ(a) and σ(b). Moreover, there are no edges of the form aa, since τT(lcaT(a,a))=τT(a)=τS(σ(a))=τS(lcaS(σ(a),σ(a))). Hence G<(T) is a simple graph.

By definition, every relaxed scenario S=(T,S,σ,μ,τT,τS) satisfies τT(x)=τS(σ(x)) all xL(T). Therefore, removing μ from S yields a μ-free scenario T=(T,S,σ,τT,τS). Thus, we will use the following simplified notation.

Definition 9

We put G<(S):=G<(T,S,σ,τT,τS) for a given relaxed scenario S=(T,S,σ,μ,τT,τS) and the underlying μ-free scenario (T,S,σ,τT,τS) and say, by slight abuse of notation, that S explains (G<(S),σ).

The next two results show that the existence of a reconciliation map μ does not impose additional constraints on LDT graphs.

Lemma 2

For every μ-free scenario T=(T,S,σ,τT,τS), there is a relaxed scenario S=(T,S,σ,μ,τT~,τS~) for TS and σ such that (G<(T),σ)=(G<(S),σ).

Theorem 1

(G,σ) is an LDT graph if and only if there is a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (G,σ)=(G<(S),σ).

Remark 1

From here on, we omit the explicit reference to Lemma 2 and Theorem 1 and assume that the reader is aware of the fact that every LDT graph is explained by some relaxed scenario S and that for every μ-free scenario T=(T,S,σ,τT,τS), there is a relaxed scenario S for TS and σ such that (G<(T),σ)=(G<(S),σ).

Fig. 1.

Fig. 1

Top row: A relaxed scenario S=(T,S,σ,μ,τT,τS) (left) with its LDT graph (G<(S),σ) (right). The reconciliation map μ is shown implicitly by the embedding of the gene tree T into the species tree S. The times τT and τS are indicated by the position on the vertical axis, i.e., if a vertex x is drawn higher than a vertex y, this implies τT(y)<τT(x). In subsequent figures we will not show the time maps explicitly. Bottom row: Another relaxed scenario S=(T,S,σ,μ,τT,τS) with a connected LDT graph (G<(S),σ). As we shall see, connectedness of an LDT graph depends on the relative timing of the roots of the gene and species tree (cf. Lemma 11)

Properties of LDT graphs

We continue by deriving several interesting characteristics LDT graphs.

Proposition 3

Every LDT graph (G,σ) is properly colored.

As we shall see below, LDT graphs (G,σ) contain detailed information about both the underlying gene trees T and species trees S for all μ-scenarios that explain (G,σ), and thus by Lemma 2 and Theorem 1 also about every relaxed scenario S satisfying G=G<(S). This information is encoded in the form of certain rooted triples that can be retrieved directly from local features in the colored graphs (G,σ).

Definition 10

For a graph G=(L,E), we define the set of triples on L as

T(G):={xy|z:x,y,zLare pairwise distinct,xyE,xz,yzE}.

If G is endowed with a coloring σ:LM we also define a set of color triples

S(G,σ):={σ(x)σ(y)|σ(z):x,y,zL,σ(x),σ(y),σ(z)are pairwise distinct,xz,yzE,xyE}.

Lemma 6

If a graph (G,σ) is an LDT graph, then S(G,σ) is compatible and S displays S(G,σ) for every μ-free scenario T=(T,S,σ,τT,τS) that explains (G,σ).

The next lemma shows that induced K2+K1 subgraphs in LDT graphs imply triples that must be displayed by the gene tree T.

Lemma 7

If (G,σ) is an LDT graph, then T(G) is compatible and T displays T(G) for every μ-free scenario T=(T,S,σ,τT,τS) that explains (G,σ).

The next results shows that LDT graphs cannot contain induced P4s.

Lemma 8

Every LDT graph (G,σ) is a properly colored cograph.

The converse of Lemma 8 is not true is in general. To see this, consider the properly-colored cograph (G,σ) with vertex V(G)={a,a,b,b,c,c}, edges ab,bc,ab,ac and coloring σ(a)=σ(a)=A, σ(b)=σ(b)=B, and σ(c)=σ(c)=C with ABC being pairwise distinct. In this case, S(G,σ) contains the triples AC|B and BC|A. By Lemma 6, the tree S in every μ-free scenario T=(T,S,σ,τT,τS) or relaxed scenario S=(T,S,σ,μ,τT,τS) explaining (G,σ) displays AC|B and BC|A. Since no such scenario can exist, (G,σ) is not an LDT graph.

Recognition and characterization of LDT graphs

In order to design an algorithm for the recognition of LDT graphs, we will consider partitions of the vertex set of a given input graph (G=(L,E),σ). To construct suitable partitions, we start with the connected components of G. The coloring σ:LM imposes additional constraints. We capture these with the help of binary relations that are defined in terms of partitions C of the color set M and employ them to further refine the partition of G.

Definition 12

Let (G=(L,E),σ) be a graph with coloring σ:LM. Let C be a partition of M, and C be the set of connected components of G. We define the following binary relation R(G,σ,C) by setting

(x,y)R(G,σ,C)x,yL,σ(x),σ(y)Cfor someCC,andx,yCfor someCC.

By construction, two vertices x,yL are in relation R(G,σ,C) whenever they are in the same connected component of G and their colors σ(x),σ(y) are contained in the same set of the partition of M. As shown in Lemma 9 in the Technical Part, the relation R:=R(G,σ,C) is an equivalence relation and every equivalence class of R is contained in some connected component of G. In particular, each connected component of G is the disjoint union of R-classes.

The following partition of the leaf sets of subtrees of a tree S rooted at some vertex uV(S) will be useful:

Ifuis not a leaf, thenCS(u):={L(S(v))vchildS(u)}and, otherwise,CS(u):={{u}}.

One easily verifies that, in both cases, CS(u) yields a valid partition of the leaf set L(S(u)). Recall that σ|L,M:LM was defined as the “submap” of σ with LL and σ(L)MM.

Lemma 10

Let (G=(L,E),σ) be a properly colored cograph. Suppose that the triple set S(G,σ) is compatible and let S be a tree on M that displays S(G,σ). Moreover, let LL and uV(S) such that σ(L)L(S(u)). Finally, set R:=R(G[L],σ|L,L(S(u)),CS(u)).

Then, for all distinct R-classes K and K, either xyE for all xK and yK, or xyE for all xK and yK. In particular, for xK and yK, it holds that

xyEK,Kare contained in the same connected component ofG[L].

Lemma 10 suggests a recursive strategy to construct a relaxed scenario S=(T,S,σ,μ,τT,τS) for a given properly-colored cograph (G,σ), which is illustrated in Fig. 2. The starting point is a species tree S displaying all the triples in S(G,σ) that are required by Lemma 6. We show below that there are no further constraints on S and thus we may choose S=Aho(S(G,σ),L) and endow it with an arbitrary time map τS. Given (S,τS), we construct (T,τT) in top-down order. In order to reduce the complexity of the presentation and to make the algorithm more compact and readable, we will not distinguish the cases in which (G,σ) is connected or disconnected, nor whether a connected component is a superset of one or more R-classes. The tree T therefore will not be phylogenetic in general. We shall see, however, that this issue can be alleviated by simply suppressing all inner vertices with a single child.

Fig. 2.

Fig. 2

Visualization of Algorithm 1. A The case uS is a leaf (cf. Line 8). BE The case uS is an inner vertex (cf. Line 12). B The subgraph of (G,σ) induced by L. C The local topology of the species tree S yields CS(uS)={{A,B,},{C,D,}}. Note that L(S(uS)) may contain colors that are not present in σ(L) (not shown). D The equivalence classes of R:=R(G[L],σ|L,L(S(u)),CS(uS)). E The vertex uT and the vertices vT are created in this recursion step. The vertices wK corresponding to the R-classes K are created in the next-deeper steps. Note that some vertices have only a single child, and thus get suppressed in Line 25

graphic file with name 285_2021_1631_Figa_HTML.jpg

The root uT is placed above ρS to ensure that no two vertices from distinct connected components of G will be connected by an edge in G<(S). The vertices vT representing the connected components C of G are each placed within an edge of S below ρS. W.l.o.g., the edges (ρS,vS) are chosen such that the colors of the corresponding connected component C and the colors in L(S(vS)) overlap. Next we compute the relation R:=R(G,σ,CS(ρS)) and determine, for each connected component C, the R-classes K that are a subset of C. For each of them, a child wK is appended to the tree vertex vT. The subtree T(wK) will have leaf set L(T(wK))=K. Since R is defined on CS(ρS) in this first step, G(S) will have all edges between vertices that are in the same connected component C but in distinct R-classes (cf. Lemma 10). The definition of R also implies that we always find a vertex vSchildS(ρS) such that σ(K)L(S(vS)) (more detailed arguments for this are given in the proof of Claim 4 in the proof of Theorem 2 below). Thus we can place wK into this edge (ρS,vS), and proceed recursively on the R-classes L:=K, the induced subgraphs G[L] and their corresponding vertices vSV(S), which then serve as the root of the species trees. More precisely, we identify wK with the root uT created in the “next-deeper” recursion step. Since we alternate between vertices uT for which no edges between vertices of distinct subtrees exist, and vertices vT for which all such edges exist, we can label the vertices uT with “0” and the vertices vT with “1” and obtain a cotree for the cograph G.

This recursive procedure is described more formally in Algorithm 1 which also describes the constructions of an appropriate time map τT for T and a reconciliation map μ. We note that we find it convenient to use as trivial case in the recursion the situation in which the current root uS of the species tree is a leaf rather than the condition |L|=1. In this manner we avoid the distinction between the cases uSL(S) and uSL(S) in the else-condition starting in Line 12. This results in a shorter presentation at the expense of more inner vertices that need to be suppressed at the end in order to obtain the final tree T. We proceed by proving the correctness of Algorithm 1.

Theorem 2

Let (G,σ) be a properly colored cograph, and assume that the triple set S(M,G) is compatible. Then Algorithm 1 returns a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G<(S)=G in polynomial time.

As a consequence of Lemma 6 and 8, and the fact that Algorithm 1 returns a relaxed scenario S for a given properly colored cograph with compatible triple set S(G,σ), we obtain

Theorem 3

A graph (G,σ) is an LDT graph if and only if it is a properly colored cograph and S(G,σ) is compatible.

Theorem 3 has two consequences that are of immediate interest:

Corollary 2

LDT graphs can be recognized in polynomial time.

Corollary 3

The property of being an LDT graph is hereditary, that is, if (G,σ) is an LDT graph then each of its vertex induced subgraphs is an LDT graph.

The relaxed scenarios S explaining an LDT graph (G,σ) are far from being unique. In fact, we can choose from a large set of trees (S,τS) that is determined only by the triple set S(G,σ):

Corollary 4

If (G=(L,E),σ) is an LDT graph with coloring σ:LM, then for all planted trees S on M that display S(G,σ) there is a relaxed scenario S=(T,S,σ,μ,τT,τS) that contains σ and S and that explains (G,σ).

As shown in the Technical Part, for every LDT graph (G,σ) there is a relaxed scenario S=(T,S,σ,μ,τT,τS) explaining (G,σ) such that T displays the discriminating cotree TG of G (cf. Corollary 5 in the Technical Part). However, this property is not satisfied by all relaxed scenarios that explain an (G,σ). Nevertheless, the latter results enable us to relate connectedness of LDT graphs to properties of the relaxed scenarios by which it can be explained (cf. Lemma 11 in Technical Part).

Least resolved trees for LDT graphs

As we have seen e.g. in Corollary 4, there are in general many trees S and T forming relaxed scenarios S that explain a given LDT graph (G,σ). This begs the question to what extent these trees are determined by “representatives”. For S, we have seen that S always displays S(G,σ), suggesting to consider the role of S=Aho(S(G,σ),M), where M is the codomain of σ. This tree is least resolved in the sense that there is no relaxed scenario explaining the LDT graph (G,σ) with a tree S that is obtained from S by edge-contractions. The latter is due to the fact that any edge contraction in Aho(S(G,σ),M) yields a tree S that does not display S(G,σ) any more (Jansson et al. 2012). By Proposition 6, none of the relaxed scenarios containing S explain the LDT graph (G,σ).

Definition 13

Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario explaining the LDT graph (G,σ). The planted tree T is least resolved for (G,σ) if no relaxed scenario (T,S,σ,μ,τT,τS) with T<T explains (G,σ).

In other words, T is least resolved for (G,σ) if no relaxed scenario with a gene tree T obtained from T by a series of edge contractions explains (G,σ).

The examples in Fig. 3 show that LDT graphs are in general not accompanied by unique least resolved trees. In the top row, relaxed scenarios with different least resolved gene trees T and the same least resolved species tree S explain the LDT graph (G,σ). In the example below, two distinct least resolved species trees exist for a given least-resolved gene tree.

Fig. 3.

Fig. 3

Examples of LDT graphs (G,σ) with multiple least resolved trees. Top row: No unique least resolved gene tree. For both trees, contraction of the single inner edge leads to a loss of the gene triple ab|cT(G) (cf. Lemma 7). The species tree is also least resolved since contraction of its single inner edge leads to loss of the species triples σ(a)σ(c)|σ(d),σ(b)σ(c)|σ(d)S(G,σ) (cf. Lemma 6). Bottom row: No unique least resolved species tree. Both trees display the two necessary triples AB|E,CD|ES(G,σ), and are again least resolved w.r.t. these triples. The gene trees are also least resolved since contraction of either of its two inner edges leads e.g. to loss of one of the triples ae|c,ce|aT(G)

The example in Fig. 4 shows, furthermore, that the unique discriminating cotree TG of an LDT graph (G,σ) is not always “sufficiently resolved”. To see this, assume that the graph (G,σ) in the example can be explained by a relaxed scenario S=(T,S,σ,μ,τT,τS) such that T=TG. First consider the connected component consisting of abcd. Since lcaT(a,b)TlcaT(c,d), abE(G) and cdE(G), we have τS(lcaS(σ(a),σ(b)))>τT(lcaT(a,b))>τT(lcaT(c,d))τS(lcaS(σ(c),σ(d))). By similar arguments, the second connected component implies τS(lcaS(σ(c),σ(d)))>τS(lcaS(σ(a),σ(b))); a contradiction. These examples emphasize that LDT graphs constrain the relaxed scenarios, but are far from determining them.

Fig. 4.

Fig. 4

Example of an LDT graph (G,σ) in B that is explained by the relaxed scenario shown in A. Here, (G,σ) cannot be explained by a relaxed scenario S=(T,S,σ,μ,τT,τS) such that T is the unique discriminating cotree (shown in C) for the cograph G, see D and the text for further explanations

Horizontal gene transfer and fitch graphs

HGT-labeled trees and rs-Fitch graphs

As alluded to in the introduction, the LDT graphs are intimately related with horizontal gene transfer. To formalize this connection we first define transfer edges. These will then be used to encode Walter Fitch’s concept of xenologous gene pairs (Fitch 2000; Darby et al. 2017) as a binary relation, and thus, the edge set of a graph.

Definition 14

Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario. An edge (uv) in T is a transfer edge if μ(u) and μ(v) are incomparable in S. The HGT-labeling of T in S is the edge labeling λS:E(T){0,1} with λ(e)=1 if and only if e is a transfer edge.

The vertex u in T thus corresponds to an HGT event, with v denoting the subsequent event, which now takes place in the “recipient” branch of the species tree. Note that λS is completely determined by S. In general, for a given a gene tree T, HGT events correspond to a labeling or coloring of the edges of T.

Definition 15

(Fitch graph) Let (T,λ) be a tree T together with a map λ:E(T){0,1}. The Fitch graph ϝ(T,λ)=(V,E) has vertex set V:=L(T) and edge set

E:={xyx,yL,the unique path connectingxandyinTcontains an edgeewithλ(e)=1.}

By definition, Fitch graphs of 0/1-edge-labeled trees are loopless and undirected. We call edges e of (T,λ) with label λ(e)=1 also 1-edges and, otherwise, 0-edges.

Remark 2

Fitch graphs as defined here have been termed undirected Fitch graphs (Hellmuth et al. 2018), in contrast to the notion of the directed Fitch graphs of 0/1-edge-labeled trees studied e.g. in Geiß et al. (2018) and Hellmuth and Seemann (2019).

Proposition 5

(Hellmuth et al. 2018; Zverovich 1999) The following statements are equivalent.

  1. G is the Fitch graph of a 0/1-edge-labeled tree.

  2. G is a complete multipartite graph.

  3. G does not contain K2+K1 as an induced subgraph.

Definition 16

(rs-Fitch graph) Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario with HGT-labeling λS. We call the vertex colored graph (ϝ(S),σ):=(ϝ(T,λS),σ) the Fitch graph of the scenario S.

A vertex colored graph (G,σ) is a relaxed scenario Fitch graph (rs-Fitch graph) if there is a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G=ϝ(S).

Figure 5 shows that rs-Fitch graphs are not necessarily properly colored. A subtle difficulty arises from the fact that Fitch graphs of 0/1-edge-labeled trees are defined without a reference to the vertex coloring σ, while the rs-Fitch graph is vertex colored. This together with Proposition 5 implies

Fig. 5.

Fig. 5

A The relaxed scenario S=(T,S,σ,μ,τT,τS) as already shown in Fig. 1. B A 0/1-edge-labeled tree (T,λ) satisfying λ=λS. C The corresponding Fitch graph ϝ(T,λ) drawn in a layout that emphasizes the property that ϝ(T,λ) is a complete multipartite graph. Independent sets are circled. D An alternative layout as in Fig. 1 (top row) that emphasizes the relationship G<(S)ϝ(S)=ϝ(T,λ) (cf. Theorem 4 below). Edges that are not present in G<(S) are drawn as dashed lines

Observation 1

If (G,σ) is an rs-Fitch graph then G is a complete multipartite graph.

The “converse” of Observation 1 is not true in general, as we shall see in Theorem 6 below. If, however, the coloring σ can be chosen arbitrarily, then every complete multipartite graph G can be turned into an rs-Fitch graph (G,σ) as shown in Proposition 6.

Proposition 6

If G is a complete multipartite graph, then there exists a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (G,σ) is an rs-Fitch graph.

Although every complete multipartite graph can be colored in such a way that it becomes an rs-Fitch graph (cf. Proposition 6), there are colored, complete multipartite graphs (G,σ) that are not rs-Fitch graphs, i.e., that do not derive from a relaxed scenario (cf. Theorem 6). We summarize this discussion in the following

Observation 2

There are (planted) 0/1-edge labeled trees (T,λ) and colorings σ:L(T)M such that there is no relaxed scenario S=(T,S,σ,μ,τT,τS) with λ=λS.

A subtle—but important—observation is that trees (T,λ) with coloring σ for which Observation 2 applies may still encode an rs-Fitch graph (ϝ(T,λ),σ), see Example 1 and Fig. 6. The latter is due to the fact that ϝ(T,λ)=ϝ(T,λ) may be possible for a different tree (T,λ) for which there is a relaxed scenario S=(T,S,σ,μ,τT,τS) with λ=λS. In this case, (ϝ(T,λ),σ)=(ϝ(S),σ) is an rs-Fitch graph. We shall briefly return to these issues in the discussion Sect. 8.

Fig. 6.

Fig. 6

0/1-edge-labeled tree (T,λ) for which no relaxed scenario exists such that (T,λ)=(T,λS) (see Example 1). Red edges indicates 1-labeled edges. Nevertheless for ϝ:=ϝ(T,λ) there is an alternative tree (T,λ) for which a relaxed scenario S=(T,S,σ,μ,τT,τS) exists (right) such that ϝ=ϝ(T,λ)=ϝ(S)

Example 1

Consider the planted edge-labeled tree (T,λ) shown in Fig. 6 with leaf set L={a,b,b,c,d}, together with a coloring σ where σ(b)=σ(b) and σ(a),σ(b),σ(c),σ(d) are pairwise distinct.

Assume, for contradiction, that there is a relaxed scenario S=(T,S,σ,μ,τT,τS) with (T,λ)=(T,λS). Hence, μ(v) and μ(b)=σ(b) as well as μ(u) and μ(b)=σ(b) must be comparable in S. Therefore, μ(u) and μ(v) must both be comparable to σ(b) and thus, they are located on the path from ρS to σ(b). But this implies that μ(u) and μ(v) are comparable in S; a contradiction, since then λS(u,v)=0λ(u,v)=1.

LDT graphs and rs-Fitch graphs

We proceed to investigate to what extent an LDT graph provides information about an rs-Fitch graph. As we shall see in Theorem 5 there is indeed a close connection between rs-Fitch graphs and LDT graphs. We start with a useful relation between the edges of rs-Fitch graphs and the reconciliation maps μ of their scenarios.

Lemma 13

Let ϝ(S) be an rs-Fitch graph for some relaxed scenario S. Then, abE(ϝ(S)) implies that lcaS(σ(a),σ(b))Sμ(lcaT(a,b)).

The next result shows that a subset of transfer edges can be inferred immediately from LDT graphs:

Theorem 4

If (G,σ) is an LDT graph, then Gϝ(S) for all relaxed scenarios S that explain (G,σ).

Since we only have that xy is an edge in ϝ(S) if the path connecting x and y in the tree T of S contains a transfer edge, Theorem 4 immediately implies

Corollary 6

For every relaxed scenario S=(T,S,σ,μ,τT,τS) without transfer edges, it holds that E(G<(S))=.

Theorem 4 provides the formal justification for indirect phylogenetic approaches to HGT inference that are based on the work of Lawrence and Hartl (1992), Clarke et al. (2002), and Novichkov et al. (2004) by showing that (x,y)E(G<(S)) can be explained only by HGT, irrespective of how complex the true biological scenario might have been. However, it does not cover all HGT events. Figure 7 shows that there are relaxed scenarios S for which G<(S)ϝ(S) even though ϝ(S) is properly colored. Moreover, it is possible that an rs-Fitch graph (G,σ) contains edges xyE(G) with σ(x)=σ(y). In particular, therefore, an rs-Fitch graph is not always an LDT graph.

Fig. 7.

Fig. 7

Two relaxed scenarios S1 and S2 with the same rs-Fitch graph ϝ=ϝ(S1)=ϝ(S2) (right) and different LDT graphs G<(S1)ϝ and G<(S2)=ϝ

It is natural, therefore, to ask whether for every properly colored Fitch graph there is a relaxed scenario S such that G<(S)=ϝ(S). An affirmative answer is provided by

Theorem 5

The following statements are equivalent.

  1. (G,σ) is a properly colored complete multipartite graph.

  2. There is a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G=G<(S)=ϝ(S).

  3. (G,σ) is complete multipartite and an LDT graph.

  4. (G,σ) is properly colored and an rs-Fitch graph.

In particular, for every properly colored complete multipartite graph (G,σ) the triple set S(G,σ) is compatible.

relaxed scenarios for which (ϝ(S),σ) is properly colored do not admit two members of the same gene family that are separated by a HGT event. While restrictive, such models are not altogether unrealistic. Proper coloring of (ϝ(S),σ) is, in particular, the case if every horizontal transfer is replacing, i.e., if the original copy is effectively overwritten by homologous recombination (Thomas and Nielsen 2005), see also (Choi et al. 2012) for a detailed case study in Streptococcus. As a consequence of Theorem 5, LDT graphs are sufficient to describe replacing HGT. However, the incidence rate of replacing HGT decreases exponentially with phylogenetic distance between source and target (Williams et al. 2012), and additive HGT becomes the dominant mechanism between phylogenetically distant organisms. Still, replacing HGTs may also be the result of additive HGT followed by a loss of the (functionally redundant) vertically inherited gene.

rs-Fitch graphs with general colorings

In scenarios with additive HGT, the rs-Fitch graph is no longer properly colored and no-longer coincides with the LDT graph. Since not every vertex-colored complete multipartite graph (G,σ) is an rs-Fitch graph (cf. Theorem 6), we ask whether an LDT (G,σ) that is not itself already an rs-Fitch graph imposes constraints on the rs-Fitch graphs (ϝ(S),σ) that derive from relaxed scenarios S that explain (G,σ). As a first step towards this goal, we aim to characterize rs-Fitch graphs, i.e., to understand the conditions imposed by the existence of an underlying scenario S on the compatibility of the collection of independent sets I of G and the coloring σ. As we shall see, these conditions can be explained in terms of an auxiliary graph that we introduce in a very general setting:

Definition 17

Let L be a set, σ:LM a map and I={I1,,Ik} a set of subsets of L. Then the graph Aϝ(σ,I) has vertex set M and edges xy if and only if xy and x,yσ(I) for some II.

By construction Aϝ(σ,I) is a subgraph of Aϝ(σ,I) whenever II. An extended version of Definition 17 that contains also an edge-labeling of Aϝ(σ,I) can be found in the Technical Part—this technical detail is not needed here. As it turns out, rs-Fitch graphs are characterized by the structure of their auxiliary graphs Aϝ as shown in the next

Theorem 6

A graph (G,σ) is an rs-Fitch graph if and only if (i) it is complete multipartite with independent sets I={I1,,Ik}, and (ii) if k>1, there is an independent set II such that Aϝ(σ,I\{I}) is disconnected.

As a consequence of Theorem 6, we obtain

Corollary 9

rs-Fitch graphs can be recognized in polynomial time.

As for LDT graphs, the property of being an rs-Fitch graph is hereditary.

Corollary 14

If (G=(L,E),σ) is an rs-Fitch graph, then the colored vertex induced subgaph (G[W],σ|W) is an rs-Fitch graph for all non-empty subsets WL.

Note, however, that Corollary 14 is not satisfied if we restrict the codomain of σ to the observable part of colors, i.e., if we consider σ|W,σ(W):Wσ(W) instead of σ|W:WM, even if σ is surjective. To see this consider the vertex colored graph (G,σ) with V(G)={a,a,b}, E(G)={aa,ab,ab} and σ:V(G)M={A,B} where σ(a)=σ(a)=Aσ(b)=B. A possible relaxed scenario S for (G,σ) is shown in Fig. 8A. The deletion of b yields W=V(G)\{b}={a,a} and the graph (G[W],σ|W) for which S with HGT-labeling λS as in Fig. 8B is a relaxed scenario that satisfies G[W]=ϝ(T,λS). However, if we restrict the codomain of σ to obtain σ|W,{A}:{a,a}σ(W)={A}, then there is no relaxed scenario S for which G[W]=ϝ(T,λS), since there is only a single species tree S on L(S)={A} (Fig. 8C) that consists of the single edge (0T,A) and thus, μ(v) and μ(a) as well as μ(v) and μ(a) must be comparable in this scenario.

Fig. 8.

Fig. 8

Shown are three distinct relaxed scenarios S, S and S with corresponding rs-Fitch graphs. Here σ=σ|{a,a} and σ=σ|{a,a},{A} (cf. Definition 1). Putting (G,σ)=(ϝ(S),σ), one can observe that (G[{a,a}],σ)=(ϝ(S),σ) is an rs-Fitch graph. In contrast, σ is restricted to the “observable” part of species (consisting of A alone), and (G[{a,a}],σ) is not an rs-Fitch graph, see text for further details

Least resolved trees for Fitch graphs

It is important to note that the characterization of rs-Fitch graphs in Theorem 6 does not provide us with a characterization of rs-Fitch graphs that share a common relaxed scenario with a given LDT graph. As a potential avenue to address this problem we investigate the structure of least-resolved trees for Fitch graphs as possible source of additional constraints.

Definition 18

The edge-labeled tree (T,λ) is Fitch-least-resolved w.r.t. ϝ(T,λ), if for all trees TT that are displayed by T and every labeling λ of T it holds that ϝ(T,λ)ϝ(T,λ).

As shown in the Technical Part (Theorem 7), Fitch-least-resolved trees can be characterized in terms of their edge-labeling, a result that is very similar to the results for “directed” Fitch graphs of 0/1-edge-labeled trees in Geiß et al. (2018). As a consequence of this characterization, Fitch-least-resolved trees can be constructed in polynomial time. However, Fitch-least-resolved trees are far from being unique. In particular, Fitch-least-resolved trees are only of very limited use for the construction of relaxed scenarios S=(T,S,σ,μ,τT,τS) from an underlying Fitch graph. In fact, even though (G,σ) is an rs-Fitch graph, Example 3 in the Technical Part shows that it is possible that there is no relaxed scenario S=(T,S,σ,μ,τT,τS) with HGT-labeling λS such that (T,λ)=(T,λS) for any of its Fitch-least-resolved trees (T,λ).

Editing problems

Editing colored graphs to LDT graphs and Fitch graphs

Empirical estimates of LDT graphs from sequence data are expected to suffer from noise and hence to violate the conditions of Theorem 3. It is of interest, therefore, to consider the problem of correcting an empirical estimate (G,σ) to the closest LDT graph. We therefore briefly investigate the usual three edge modification problems for graphs: completion only considers the insertion of edges, for deletion edges may only be removed, while solutions to the editing problem allow both insertions and deletions, see e.g. Burzyn et al. (2006).

Problem 1

(LDT-Graph-Modification (LDT-M))

Input:

A colored graph (G=(V,E),σ) and an integer k.

Question:

Is there a subset FE such that |F|k and (G=(V,EF),σ) is an LDT graph where {\,,Δ}?

We write LDT-E, LDT-C, LDT-D for the editing, completion, and deletion version of LDT-M. By virtue of Theorem 3, the LDT-M is closely related to the problem of finding a compatible subset RS(GR,σ) with maximum cardinality. The corresponding decision problem, MaxRTC, is known to be NP-complete (Jansson 2001, Thm. 1). In the technical part we prove

Theorem 9

LDT-M is NP-complete.

Even through at present it remains unclear whether rs-Fitch graphs can be estimated directly, the corresponding graph modification problems are at least of theoretical interest.

Problem 2

(rs-Fitch Graph-Modification (rsF-M))

Input:

A colored graph (G=(V,E),σ) and an integer k.

Question:

Is there a subset FE such that |F|k and (G=(V,EF),σ) is an rs-Fitch graph where {\,,Δ}?

As above, we write rsF-E, rsF-C, rsF-D for the editing, completion, and deletion version of rsF-M. Since rs-Fitch graphs are complete multipartite, their complements are disjoint unions of complete graphs. The problems rsF-M are thus closely related the cluster graph modification problems. Both Cluster Deletion and Cluster Editing are NP-complete, while Cluster Completion is polynomial (by completing each connected component to a clique, i.e., computing the transitive closure) (Shamir et al. 2004). We obtain

Theorem 10

rsF-C and rsF-E are NP-complete.

rsF-D remains open since the complement of the transitive closure of the complement of a colored graph (G,σ) is not necessarily an rs-Fitch graph. This is in particular the case if (G,σ) is complete multipartite but not an rs-Fitch graph.

Editing LDT graphs to Fitch graphs

Putative LDT graphs (G,σ) can be estimated directly from sequence (dis)similarity data. The most direct approach was introduced by Novichkov et al. (2004), where, for (reciprocally) most similar genes x and y from two distinct species σ(x)=A and σ(x)=B, dissimilarities δ(x,y) between genes and dissimilarities Δ(A,B) of the underlying species are compared under the assumption of a (gene family specific) clock-rate r, i.e., the expectation that orthologous gene pairs satisfy δ(x,y)rΔ(A,B). In this setting, xyE(G) if δ(x,y)<rΔ(A,B) at some level of statistical significance. The rate assumption can be relaxed to consider rank-order statistics. For fixed x, differences in the orders of δ(x,y) and Δ(σ(x),σ(y)) assessed by rank-order correlation measures have been used to identify x as HGT candidate e.g. Lawrence and Hartl (1992); Clarke et al. (2002). An interesting variation on the theme is described by Sevillya et al. (2020), who use relative synteny rather than sequence similarity for the same purpose. A more detailed account on estimating (G,σ) will be given elsewhere.

In contrast, it seems much more difficult to infer a Fitch graph (ϝ,σ) directly from data. To our knowledge, no method for this purpose has been proposed in the literature. However, (ϝ,σ) is of much more direct practical interest because the independent sets of ϝ determine the maximal HGT-free subsets of genes, which could be analyzed separately by better-understood techniques. In this section, we therefore focus on the aspects of (ϝ,σ) that are not captured by LDT graphs (G,σ). In the light of the previous section, these are in particular non-replacing HGTs, i.e., HGTs that result in genes x and y in the same species σ(x)=σ(y). In this case, (ϝ,σ) is no longer properly colored and thus Gϝ. To get a better intuition on this case consider three genes a, a, and b with σ(a)=σ(a)σ(b) with abE(G) and abE(G). By Lemma 7, the gene tree T of any explaining relaxed scenario displays the triple ab|a. Fig. 9 shows two relaxed scenarios with a single HGT that explain this situation: In the first, we have aaE(ϝ), while the other implies aaE(ϝ). Neither scenario is a priori less plausible than the other. Although the frequency of true homologous replacement via crossover decreases exponentially with the phylogenetic distance of donor and acceptor species (Williams et al. 2012), additive HGT with subsequent loss of one copy is an entirely plausible scenario.

Fig. 9.

Fig. 9

Two relaxed scenarios with T displaying the triple ab|a and explaining the same graph (G,σ)

A pragmatic approach to approximate (ϝ,σ) is therefore to consider the step from an LDT graph (G,σ) to (ϝ,σ) as a graph modification problem. First we note that Algorithm 1 explicitly produces a relaxed scenario S and thus implies a corresponding gene tree TS with HGT-labeling λS, and thus an rs-Fitch graph (ϝ(S),σ). However, Algorithm 1 was designed primarily as proof device. It produces neither a unique relaxed scenario nor necessarily the most plausible or a most parsimonious one. Furthermore, both the LDT graph (G,σ) and the desired rs-Fitch graph (ϝ,σ) are consistent with a potentially very large number of scenarios. It thus appears preferable to altogether avoid the explicit construction of scenarios at this stage.

Since every LDT graph (G,σ) is explained by some S, it is also a spanning subgraph of the corresponding rs-Fitch graph (ϝ(S),σ). The step from an LDT graph (G,σ) to an rs-Fitch graph (ϝ,σ) can therefore be viewed as an edge-completion problem. The simplest variation of the problem is

Problem 3

(Fitch graph completion) Given an LDT graph (G,σ), find a minimum cardinality set Q of possible edges such that ((V(G),E(G)Q),σ) is a complete multipartite graph.

A close inspection of Problem 3 shows that the coloring is irrelevant in this version, and the actual problem to be solved is the problem Complete Multipartite Graph Completion with a cograph as input. We next show that this task can be performed in linear time. The key idea is to consider the complementary problem, i.e., the problem of deleting a minimum set of edges from the complementary cograph G¯ such that the end result is a disjoint union of complete graphs. This is known as Cluster Deletion problem (Shamir et al. 2004), and is known to have a greedy solution for cographs (Gao et al. 2013).

Lemma 18

There is a linear-time algorithm to solve Problem 3 for every cograph G.

All maximum clique partitions of a cograph G have the same sequence of cluster sizes (Gao et al. 2013, Thm. 1). However, they are not unique as partitions of the vertex set V(G). Thus the minimal editing set Q that needs to be inserted into a cograph to reach a complete multipartite graphs will not be unique in general. In the Technical Part, we briefly sketch a recursive algorithm operating on the cotree of G¯.

However, an optimal solution to Problem 3 with input (G,σ) does not necessarily yield an rs-Fitch graph or an rs-Fitch graph (ϝ(S),σ) such that G=G<(S), see Fig. 10. In particular, there are LDT graphs (G,σ) for which more edges need to be added to obtain an rs-Fitch graph than the minimum required to obtain a complete multipartite graph, see Fig. 11.

Fig. 10.

Fig. 10

Upper panel: A relaxed scenario S with LDT graph (G<(S),σ) and rs-Fitch graph (ϝ(S),σ). There are two minimum edge completion sets that yield the complete multipartite graphs (ϝ1,σ) and (ϝ2,σ) (lower part). By Theorem 6, (ϝ2,σ) is not an rs-Fitch graph. The graph (ϝ1,σ) is an rs-Fitch graph for the relaxed scenario S. However, G<(S)G<(S) for all scenarios S with (ϝ(S),σ)=(ϝ1,σ). To see this, note that the gene tree T=((a,b),(a,b)) in S is uniquely determined by application of Lemma 5 and 7. Assume that there is any edge-labeling λ such that ϝ(T,λ)=ϝ1. The none-edges in ϝ1 imply that along the two paths from a to a and b to b there is no transfer edge, that is, there cannot be any transfer edge in T; a contradiction

Fig. 11.

Fig. 11

The LDT graph (G<(S),σ) for the relaxed scenario S has a unique minimum edge completion set (as determined by full enumeration), resulting in the complete multipartite graph (ϝ1,σ). However, Theorem 6 implies that (ϝ1,σ) is not rs-Fitch graph. An edge completion set with more edges must be used to obtain an rs-Fitch graph, for instance (ϝ2,σ), which is explained by the scenario S

A more relevant problems for our purposes, therefore is

Problem 4

(rs-Fitch graph completion) Given an LDT graph (G,σ) find a minimum cardinality set Q of possible edges such that ((V(G),E(G)Q),σ) is an rs-Fitch graph.

The following, stronger version is what we ideally would like to solve:

Problem 5

(strong rs-Fitch graph completion) Given an LDT graph (G,σ) find a minimum cardinality set Q of possible edges such that ϝ=((V(G),E(G)Q),σ) is an rs-Fitch graph and there is a common relaxed scenario S, that is, S satisfies G=G<(S) and ϝ=ϝ(S).

The computational complexity of Problems 4 and 5 is unknown. We conjecture, however, that both are NP-hard. In contrast to the application of graph modification problems to correct possible errors in the originally estimated data, the minimization of inserted edges into an LDT graph lacks a direct biological interpretation. Instead, most-parsimonious solutions in terms of evolutionary events are usually of interest in biology. In our framework, this translates to

Problem 6

(Min transfer completion) Let (G,σ) be an LDT graph and S be the set of all relaxed scenarios S with G=G<(S). Find a relaxed scenario SS that has a minimal number of transfer edges among all elements in S and the corresponding rs-Fitch graph ϝ(S).

One way to address this problem might be as follows: Find edge-completion sets for the given LDT graph (G,σ) that minimize the number of independent sets in the resulting rs-Fitch graph ϝ=((V(G),E(G)Q),σ). The intuition behind this idea is that, in this case, the number of pairs within the individual independent sets is maximized and thus, we get a maximized set of gene pairs without transfer along their connecting path in the gene tree. It remains an open question whether this idea always yields a solution for Problem 6.

Simulation results

Evolutionary scenarios covering a wide range of HGT frequencies were generated with the simulation library AsymmeTree (Stadler et al. 2020). The tool generates a planted species tree S with time map τS. A constant-rate birth-death process then generates a gene tree (T~,τT~) with additional branching events producing copies at inner vertex u of S propagating to each descendant lineage of u. To model HGT events, a recipient branch of S is selected at random. The simulation is event-based in the sense that each node of the “true” gene tree other than the planted root is one of speciation, gene duplication, horizontal gene transfer, gene loss, or a surviving gene. Here, the lost as well as the surviving genes form the leaf set of T~.

We used the following parameter settings for AsymmeTree: Planted species trees with a number of leaves between 10 and 50 (randomly drawn in each scenario) were generated using the Innovation Model (Keller-Schmidt and Klemm 2012) and equipped with a time map as described in Stadler et al. (2020). Multifurcations were introduced into the species tree by contraction of inner edges with a common probability p=0.2 per edge to simulate. Gene trees therefore are also not binary in general. We used multifurcations to model the effects of limited phylogenetic resolution. Duplication and HGT events, however, always result in bifurcations in the gene tree T~. We considered different combinations of duplication, loss, and HGT event rates (indicated on the horizontal axis in Figs. 12, 13 and 14). For each combination of event rates, we simulated 1000 scenarios per event rate combination. Figure 12 summarizes basic statistics of the simulated data sets.

Fig. 12.

Fig. 12

Top panel: Distribution of the numbers of species (i.e. species tree leaves), species thereof that contain at least one surviving genes, surviving genes in total (non-loss leaves in the gene trees), loss events (loss leaves), and horizontal transfer events (inner vertices that are HGT events). Bottom panel: Mean and standard deviation of these quantities. The numbers in the legend indicate the mean and standard deviation taken over all event rate combinations. The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer

Fig. 13.

Fig. 13

Left: Fraction of “visible” transfer edges among the “true” transfer edges in T in the simulated scenarios, i.e., the edges that correspond to a path in T~ containing at least one transfer edge w.r.t. S~ (see also the explanation in the text). The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer. Since E:=E(ϝ(S))E~:=E(ϝ(S~)[L(T)]), we also show the ratio |E|/|E~|. Right: A relaxed scenario S=(T,S,σ,μ,τT,τS) with an “invisible” transfer edge (u,a) (as determined by the knowledge of S~=(T~,S,σ,μ~,τT~,τS)). In this example we have ϝ(S~)[L(T)={a,a}]ϝ(S)

Fig. 14.

Fig. 14

Xenologs inferred from LDT graphs. Only observable scenarios S whose LDT graph (G<(S),σ) contains at least one edge are included (82.3% of all scenarios). The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer. Top panel: Recall. Fraction of edges in ϝ(S) represented in G<(S) (light blue). As an alternative, the fraction of edges in a “minimum edge completion” (m.e.c.) to the “closest” complete multipartite graph is shown in dark blue. We observe a substantial increase in the fraction of inferred edges. The Fitch graph ϝ(S) obtained from the scenario S produced by Algorithm 1 with input (G<(S),σ) yields an even better recall (light green). Second panel: Increase in the number of correctly inferred edges relative to the LDT graph G<(S). Third panel: Precision. In contrast to LDT graphs, which by Theorem 4 cannot contain false positive edges, this is not the case for the estimated Fitch graphs obtained as m.e.c. and by Algorithm 1. While false positive edges are typically rare, occasionally very poor estimates are observed. Bottom panel: Accuracy

The simulation also determines the set of surviving genes LL(T~), the reconciliation map μ~:V(T~)V(S)E(S) and the coloring σ:LL(S) representing the species in which each surviving gene resides. From the true tree T~, the observable gene tree T=T~|L is obtained by recursively removing leaves that correspond to loss events, i.e. L(T~)\L, and suppressing inner vertices with a single child and setting τT(x)=τT~(x) and μ(x)=μ~(x) for all xV(T). This defines a relaxed scenario S=(T,S,σ,μ,τT,τS). From the scenario S, we can immediately determine the associated HGT map λS, the Fitch graph ϝ(S), and the LDT graph G<(S). We also consider S~=(T~,S,σ,μ~,τT~,τS) which, from a formal point of view, is not a relaxed scenario, see Fig. 13. In this example, the gene-species association σ:LL(S) is not a map for the entire leaf set L(T~). Still, we can define the true LDT graph G<(S~) and the true Fitch graph ϝ(S~) of S~ in the same way as LDT graphs using Definitions 89, and 16, respectively. Note that this does not guarantee that every true Fitch graph is also an rs-Fitch graph. The example in Fig. 13 shows, furthermore, that ϝ(S~)[L]ϝ(S) is possible. For the LDT graphs, on the other hand, we have G<(S)=G<(S~) because S~ and S are based on the same time maps.

The distinction between the true graph ϝ(S~)[L] and the rs-Fitch graph ϝ(S) is closely related to the definition of transfer edges. So far, we only took into account transfer edges (uv) in the (observable) gene trees T, for which u and v are mapped to incomparable vertices or edges of the species trees S (cf. Definition 14). Thus, given the knowledge of the relaxed scenario S=(T,S,σ,μ,τT,τS), these transfer edges are in that sense “visible”. However, given S~=(T~,S,σ,μ~,τT~,τS), which still contains all loss branches, it is possible that a non-transfer edge in T corresponds to a path in T~ which contains a transfer edge w.r.t. S~, i.e., some edge (u,v)E(T~) such that μ~(u) and μ~(v) are incomparable in S. In particular, this is the case whenever a gene is transferred into some recipient branch followed by a back-transfer into the original branch and a loss in the recipient branch (see Fig. 13, right). Figure 13 shows that, in the majority of the simulated scenarios, the HGT information is preserved in the observable data. In fact, ϝ(S)=ϝ(S~) in 86.7% of simulated scenarios. Occasionally, however, we also encounter scenarios in which large fractions of the xenologous pairs are hidden from inference by the LDT-based approach.

In the following, we will only be concerned with estimating a Fitch graph ϝ(S), i.e., the graph resulting from the “visible” transfer edges. These were edgeless in about 17.7% of the observable scenarios S (all parameter combinations taken into account). In these cases the LDT and thus also the inferred Fitch graphs are edgeless. These scenarios were excluded from further analysis.

We first ask how well the LDT graph G<(S) approximates the Fitch graph ϝ(S). As shown in Fig. 14, the recall is limited. Over a broad range of parameters, the LDT graph contains about a third of the xenologous pairs. This begs the question whether the solution of the editing Problem 3, obtained using the exact recursive algorithm detailed in Sect. C in the Technical Part, leads to a substantial improvement. We find that recall indeed increases substantially, at very moderate levels of false positives. The editing approach achieves a median precision of well above 90% in most cases and a median recall of at least 60%, it provides results that are at the very least encouraging. We find that minimal edge completion (Problem 3) already yields an rs-Fitch graph in the vast majority of cases (99.8%, scenarios of all parameter combinations taken into account), even if we restrict the color set to M:=σ(L) (instead of L(S)) and thus force surjectivity of the coloring σ. We note that the original LDT graph and the minimal edge completion may not always be explained by a common scenario. This suggests that it will be worthwhile to consider the more difficult editing problems for rs-Fitch graphs with a relaxed scenario S that at the same time explains the LDT graph.

Algorithm 1 provides a means to obtain an rs-Fitch graph satisfying the latter constraint but without giving any guarantees for optimality in terms of a minimal edge completion. An implementation is available in the current release of the AsymmeTree package. For the rs-Fitch graphs ϝ(S) of the scenarios S constructed by Algorithm 1 with (G<(S),σ) as input, we observe another moderate increase of recall when compared with the minimal edge completion results. This comes, however, at the expense of a loss in precision. This is not surprising, since ϝ(S) by construction contains at least as many edges as any minimal edge completion of G<(S). Therefore, the number of both true positive and false positive edges in ϝ(S) can be expected to be higher, resulting in a higher recall and lower precision, respectively.

The recall is given by TP/(TP+FN), and |E(ϝ(S))|=TP+FN in terms of true positives TP and false negatives FN. Moreover, G<(S) is a subgraph of the Fitch graphs ϝm.e.c. and ϝ(S) inferred with editing or with Algorithm 1, respectively. The ratio |E(ϝ(S))E(ϝ)|/|E(ϝ(S)E(G<(S)))| with ϝ{ϝm.e.c.,ϝ(S)} therefore directly measures the increase in the number of correctly predicted xenologous pairs relative to the LDT. It is equivalent to the ratio of the respective recalls. By construction, the ratio is always 1. This is summarized as the second panel in Fig. 14.

Discussion and future directions

In this contribution, we have introduced later-divergence-time (LDT) graphs as a model capturing the subset of horizontal transfer detectable through the pairs of genes that have diverged later than their respective species. Within the setting of relaxed scenarios, LDT graphs (G,σ) are exactly the properly colored cographs with a consistent triple set S(G,σ). We further showed that LDT graphs describe a sufficient set of HGT events if and only if they are complete multipartite graphs. This corresponds to scenarios in which all HGT events are replacing. Otherwise, additional HGT events exist that separate genes from the same species. To better understand these, we investigated scenario-derived rs-Fitch graphs and characterized them as those complete multipartite graphs that satisfy an additional constraint on the coloring (expressed in terms of an auxiliary graph). Although the information contained in LDT graphs is not sufficient to unambiguously determine the missing HGT edges, we arrive at an efficiently solvable graph editing problem from which a “best guess” can be obtained. To our knowledge, this is the first detailed mathematical investigation into the power and limitation of an implicit phylogenetic method for HGT inference.

From a data analysis point of view, LDT graphs appear to be an attractive avenue to infer HGT in practice. While existing methods to estimate them from (dis)similarity data certainly can be improved, it is possible to use their cograph structure to correct the initial estimate in the same way as orthology data (Hellmuth et al. 2015). Although the LDT modification problems are NP-complete (Theorem 9), it does not appear too difficult to modify efficient cograph editing heuristics (Crespelle 2019; Hellmuth et al. 2020a) to accommodate the additional coloring constraints.

LDT graphs by themselves clearly do no contain sufficient information to completely determine a relaxed scenario. Additional information, e.g. a best match graph (Geiß et al. 2019, 2020a) will certainly be required. The most direct practical use of LDT information is to infer the Fitch graph, whose independent sets correspond to maximal HGT-free subsets of genes. These subsets can be analyzed separately (Hellmuth 2017) using recent results to infer gene family histories, including orthology relations from best match data (Geiß et al. 2020a; Schaller et al. 2021b). The main remaining unresolved question is whether the resulting HGT-free subtrees can be combined into a complete scenario using only relational information such as best match data. One way to attack this is to employ the techniques used by Lafond and Hellmuth (2020) to characterize the conditions under which a fully event-labled gene tree can be reconciled with unknown species trees. These not only resulted in an polynomial-time algorithm but also establishes additional constraints on the HGT-free subtrees. An alternative, albeit mathematically less appealing approach is to adapt classical phylogenetic methods to accommodate the HGT-free subtrees as constraints. We suspect that best match data can supply further, stringent constraints for this task. We will pursue this avenue elsewhere.

Several alternative routes can be followed to obtain Fitch graphs from LDT graphs. The most straightforward approach is to elaborate on the editing problems briefly discussed in Sect. 6. A natural question arising in this context is whether there are non-LDT edges that are shared by all minimal completion sets Q, and whether these “obligatory Fitch-edges” can be determined efficiently. A natural alternative is to modify Algorithm 1 to incorporate some form of cost function to favor the construction of biologically plausible scenarios. In a very different approach, one might also consider to use LDT graphs as constraints in probabilistic models to reconstruct scenarios, see e.g. Sjöstrand et al. (2014) and Khan et al. (2016).

Although we have obtained characterizations of both LDT graphs and rs-Fitch graphs, many open questions and avenues for future research remain.

Reconciliation maps The notion of relaxed reconciliation maps used here appears to be at least as general as alternatives that have been explored in the literature. It avoids the concurrent definition of event types and thus allows situations that may be excluded in a more restrictive setting. For example, relaxed scenarios may have two or more vertically inherited genes x and y in the same species with u:=lcaT(x,y) mapping to a vertex of the species trees. In the usual interpretation, u correspond to a speciation event (by virtue of μ(u)V0(S)); on the other hand, the descendants x and y constitute paralogs in most interpretations. Such scenarios are explicitly excluded e.g. in Stadler et al. (2020). Lemma 3 suggests that relaxed scenarios are sufficiently flexible to make it possible to replace a scenario S that is “forbidden” in response to such inconsistent interpretations of events by an “allowed” scenario S with the same σ such that G<(S)=G<(S). Whether this is indeed true, or whether a more restrictive definition of reconciliation imposes additional constraints of LDT graphs will of course need to be checked in each case.

The restriction of a μ-free scenario to a subset L of leaves of T and to a subset M of leaves of S is well defined as long as σ(L)M. One can also define a corresponding restriction of the reconciliation map μ. Most importantly, the deletion of some leaves of T may leave inner vertices in T with only a single child, which are then suppressed to recover a phylogenetic tree. This replaces paths in T by single edges and thus affects the definition of the HGT map λS since a path in T that contains two adjacent vertices u1, u2 with incomparable images μ(u1) and μ(u2) may be replaced by an edge with comparable end points in the restricted scenario S. This means that HGT events may become invisible, and thus ϝ(S) is not necessarily an induced subgraph of ϝ(S), but a subgraph that may lack additional edges. Note that this is in contrast to the assumptions made in the analysis of (directed) Fitch graphs of 0/1-edge-labeled graphs (Geiß et al. 2018; Hellmuth and Seemann 2019), where the information on horizontal transfers is inherited upon restriction of (T,λ).

Observability The latter issue is a special case of the more general problem with observability of events. Conceptually, we assume that evolution followed a true scenario comprising discrete events (speciations, duplications, horizontal transfer, gene losses, and possibly other events such as hybridization which are not considered here). In computer simulations, of course we know this true scenario, as well as all event types. Gene loss not only renders some leaves invisible but also erases the evidence of all subtrees without surviving leaves. Removal of these vertices in general results in a non-phylogenetic gene tree that contains inner vertices with a single child. In the absence of horizontal transfer, this causes little problems and the unobservable vertices can be be removed as described in the previous paragraph, see e.g. Hernández-Rosales et al. (2012). The situation is more complicated with HGT. In Nøjgaard et al. (2018), an HGT-vertex is deemed observable if it has both a horizontally and a vertically inherited descendant. In our present setting, the scenario retains an HGT-edge by virtue of consecutive vertices in T with incomparable μ-images, irrespective of whether an HGT-vertex is retained. This type of “vertex-centered” notion of xenology is explored further in Hellmuth et al. (2017). We suspect that these different points of view can be unified only when gene losses are represented explicitly or when gene and species tree trees are not required to be phylogenetic (with single-child vertices implicating losses). Either extension of the theory, however, requires a more systematic understanding of which losses need to be represented and what evidence can be acquired to “observe” them.

Impact of orthology Pragmatically, one would define two genes x and y to be orthologs if μ(lcaT(x,y))V0(S), i.e., if x and y are the product of a speciation event. Lemma 3 implies that there is always a scenario without any orthologs that explains a given LDT graph (G,σ). In particular, therefore, (G,σ) makes no implications on orthology. Conversely, however, orthology information is available and additional information on HGT might become available. In a situation akin to Fig. 9 (with the ancestral duplication moved down to the speciation), knowing that a and b are orthologs in the more restrictive sense that μ(lcaT(a,b))=lcaS(σ(a),σ(b)) excludes the r.h.s. scenario and implies that a is the horizontally inherited child, and therefore also that a and a are xenologs. This connection of orthology and xenology will be explored elsewhere.

Other types of implicit phylogenetic information LDT graphs are not the only conceivable type of accessible xenology information. A large class of methods is designed to assess whether a single gene is a xenolog, i.e., whether there is evidence that it has been horizontally inserted into the genome of the recipient species. The main subclasses evaluate nucleotide composition patterns, the phyletic distribution of best-matching genes, or combination thereof. A recent overview can be found e.g. in Sánchez-Soto et al. (2020). It remains an open question how this information can be utilized in conjunction with other types of HGT information, such as LDT graphs. It seems reasonable to expect that it can provide not only additional constraints to infer rs-Fitch graphs but also provides directional information that may help to infer the directed Fitch graphs studied by Geiß et al. (2018) and Hellmuth and Seemann (2019)). Complementarily, we may ask whether it is possible to gain direct information on HGT edges between pairs of genes in the same genome, and if so, what needs to be measured to extract this information efficiently.

We also have to leave open several mathematical questions. Regarding 0/1-edge labeled trees (T,λ), it would be of interest to know whether there is always a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (T,λ)=(T,λS) for a suitable choice of σ. Elaborating on Theorem 5, it would be interesting to characterize the leaf colorings σ for (T,λ) such that there is a relaxed scenario S with ϝ(T,λ)=ϝ(S).

Acknowledgements

We thank the three anonymous referees for their valuable comments that helped to siginificantly improve the paper. This work was funded in part by the Deutsche Forschungsgemeinschaft (proj. CO1 within CRG 1423, No. 421152132, Proj. STA850/49-1 and MI439/14-2), and by the Natural Sciences and Engineering Research Council of Canada (NSERC, Grant RGPIN-2019-05817).

Technical part

Later-divergence-time graphs

LDT graphs and evolutionary scenarios

In the absence of horizontal gene transfer, the last common ancestor of two species A and B should mark the latest possible time point at which two genes a and b residing in σ(a)=A and σ(b)=B, respectively, may have diverged. Situations in which this constraint is violated are therefore indicative of HGT.

Definition 7

(μ-free scenario) Let T and S be planted trees, σ:L(T)L(S) be a map and τT and τS be time maps of T and S, respectively, such that τT(x)=τS(σ(x)) for all xL(T). Then, T=(T,S,σ,τT,τS) is called a μ-free scenario.

The condition that τT(x)=τS(σ(x)) for all xL(T) is mostly a technical convenience that makes μ-free scenarios easier to interpret. Nevertheless, by Lemma 1, given the time map τS, one can easily construct a time map τT such that τT(x)=τS(σ(x)) for all xL(T). In particular, when constructing relaxed scenarios explicitly, we may simply choose τT(u)=0 and τS(x)=0 as common time for all leaves uL(T) and xL(S).

Definition 8

(LDT graph) For a μ-free scenario T=(T,S,σ,τT,τS), we define G<(T)=G<(T,S,σ,τT,τS)=(V,E) as the graph with vertex set V:=L(T) and edge set

E:={aba,bL(T),τT(lcaT(a,b))<τS(lcaS(σ(a),σ(b))).}

A vertex-colored graph (G,σ) is a later-divergence-time graph (LDT graph), if there is a μ-free scenario T=(T,S,σ,τT,τS) such that G=G<(T). In this case, we say that T explains (G,σ).

It is easy to see that the edge set of G<(T) defines an undirected graph and that there are no edges of the form aa, since τT(lcaT(a,a))=τT(a)=τS(σ(a))=τS(lcaS(σ(a),σ(a))). Hence G<(T) is a simple graph.

By definition, every relaxed scenario S=(T,S,σ,μ,τT,τS) satisfies τT(x)=τS(σ(x)) all xL(T). Therefore, removing μ from S yields a μ-free scenario T=(T,S,σ,τT,τS). Thus, we will use the following simplified notation.

Definition 9

We put G<(S):=G<(T,S,σ,τT,τS) for a given relaxed scenario S=(T,S,σ,μ,τT,τS) and the underlying μ-free scenario (T,S,σ,τT,τS) and say, by slight abuse of notation, that S explains (G<(S),σ).

Lemma 2

For every μ-free scenario T=(T,S,σ,τT,τS), there is a relaxed scenario S=(T,S,σ,μ,τT~,τS~) for TS and σ such that (G<(T),σ)=(G<(S),σ).

Proof

Let T=(T,S,σ,τT,τS) be a μ-free scenario. In order to construct a relaxed scenario S=(T,S,σ,μ,τT~,τS~) that satisfies G<(S)=G<(T), we start with a time map τT~ for T satisfying τT~(0T)=max(τT(0T),τS(0S)) and τT~(v)=τT(v) for all vV(T)\{0T}. Correspondingly, we introduce a time map τS~ for S such that τS~(0S)=max(τT(0T),τS(0S)) and τS~(v)=τS(v) for all vV(S)\{0S}. By construction, we have tmax,T:=max{τT(v)vV(T)}=τT(0T)=τS(0S). Moreover, we have tmin,S:=min{τS(v)vV(S)}min{τT(v)vV(T)}=:tmin,T. To see this, we can choose xV(T) such that τT(v)=tmin,T. By the definition of time maps and minimality of τT(v), the vertex x must be a leaf. Hence, since T is a μ-free scenario, we have τT(x)=τS(σ(x)) with X:=σ(x)L(S)V(S). Therefore, it must hold that tmin,Stmin,T. We now define P:={pV(S)E(S)XSp}, i.e., the set of all vertices and edges on the unique path in S from 0S to the leaf X. Since τS(X)=tmin,T<tmax,T=τS(0S), we find, for each vV(T), either a vertex uP such that τT(v)=τS(u) or an edge (u,w)P such that τS(w)<τT(v)<τS(u). Hence, we can specify the reconciliation map μ by defining, for every vV(T),

μ(v):=0Sifv=0T,σ(v)ifvL(T),uif there is some vertexuPwithτT(v)=τS(u),(u,w)if there is some edge(u,w)PwithτS(w)<τT(v)<τS(u).

For each vV0(T), exactly one of the two alternatives for P applies, hence μ is well-defined. It is now an easy task to verify that all conditions in Definitions 4 and 5 are satisfied for S=(T,S,σ,μ,τT~,τS~) by construction. Hence, by Definition 6, S is a relaxed scenario.

It remains to show that G<(T)=G<(S). Let a,bL(T) be arbitrary. Clearly, neither lcaT(a,b) nor lcaS(σ(a),σ(b)) equals the planted root 0T or 0S, respectively. Since we have only changed the timing of the roots 0T or 0S, we obtain abE(G<(S)) if and only if τT~(lcaT(a,b))=τT(lcaT(a,b))<τS~(lcaS(σ(a),σ(b)))=τS(lcaS(σ(a),σ(b))) if and only if abE(G<(T)), which completes the proof.

Theorem 1

(G,σ) is an LDT graph if and only if there is a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (G,σ)=(G<(S),σ).

Proof

By definition, (G,σ) is an LDT graph for every relaxed scenario S with coloring σ that satisfies (G,σ)=(G<(S),σ). Now suppose that (G,σ) is an LDT graph. By definition, there is a μ-free scenario T=(T,S,σ,τT,τS) with coloring σ such that (G,σ)=(G<(T),σ). By Lemma 2, there is a relaxed scenario S=(T,S,σ,μ,τT~,τS~) for TS and σ such that (G,σ)=(G<(S),σ).

Remark 3

From here on, we omit the explicit reference to Lemma 2 and Thm 1 and assume that the reader is aware of the fact that every LDT graph is explained by some relaxed scenario S and that for every μ-free scenario T=(T,S,σ,τT,τS), there is a relaxed scenario S for TS and σ such that (G<(T),σ)=(G<(S),σ).

We now derive some simple properties of μ-free and relaxed scenarios. It may be surprising at first glance that “the speciation nodes”, i.e., vertices uV0(T) with μ(u)V(S) do not play a special role in determining LDT graphs.

Lemma 3

For every relaxed scenario S=(T,S,σ,μ,τT,τS) there exists a relaxed scenario S~=(T,S,σ,μ~,τT~,τS) such that G<(S~)=G<(S) and for all distinct x,yL(T) with xyE(G<(S)) holds τT~(lcaT(x,y))>τS(lcaS(σ(x),σ(y))).

Proof

For the relaxed scenario S=(T,S,σ,μ,τT,τS) we write V0(S):=V(S)\(L(S){0S}) and define

DS:={|τS(y)-τS(x)|:x,yV(S),τS(x)τS(y)},DT:={|τT(y)-τT(x)|:x,yV(T),τT(x)τT(y)}, andDTS:={|τT(x)-τS(y)|:xV(T),yV(S),τT(x)τS(y)}.

We have DS and DT since we do not consider empty trees, and thus, at least the “planted” edges 0SρS and 0TρT always exist. By construction, all values in DT, DS, and DTS are strictly positive. Now define

ϵ:=12min(DSTDSDT).

Since DS and DT are not empty, ϵ is well-defined and, by construction, ϵ>0. Next we set, for all vV(T),

τT~(v):=τT(v)+ϵ,ifvV0(T)τT(v),otherwise,μ~(v):=(par(x),x),ifμ(v)=xV0(S)μ(v),otherwise.
Claim 1

S~:=(T,S,σ,μ~,τT~,τS) is a relaxed scenario.

Proof

By construction, if μ(v)(L(S){0S}) and thus, μ(v)V0(S), μ(v) and μ~(v) coincide. Therefore, (G0) and (G1) are trivially satisfied for μ~. In order to show (G2), we first note that τT~(v)=τT(v)=τS(σ(v)) holds for all vL(T) by Definition 4.

We next argue that τT~ is a time map. To this end, let x,yV(T) with xTy. Hence, τT(x)<τT(y) and, in particular, τT(y)-τT(x)2ϵ. Assume for contradiction that τT~(x)τT~(y). This implies τT~(x)=τT(x)+ϵ and τT~(y)=τT(y), since τT(x)<τT(y) and ϵ>0 always implies τT(x)+ϵ<τT(y)+ϵ and τT(x)<τT(y)+ϵ. Therefore, τT~(y)-τT~(x)=τT(y)-(τT(x)+ϵ)ϵ>0 and thus, τT~(y)>τT~(x); a contradiction.

We continue with showing that the two time maps τT~ and τS are time-consistent w.r.t. S~. To see that Condition (C1) is satisfied, observe that, by construction, μ~(v)V(S) does hold only in case μ(v)E(S)V0(S) and thus, μ(v)L(S){0S}. In this case, μ~(v)=μ(v) and since μ(v) satisfies (G1) we have vL(T){0T}. Thus, vV0(T) and, therefore, τT~(v)=τT(v)=τS(μ(v)). Therefore, Condition (C1) is satisfied.

Now consider Condition (C2). As argued above, μ~(v)E(S) holds for all vV0(T)=V(T)\(L(T){0T}). By construction, τT~(v)=τT(v)+ϵ. There are two cases: μ(v)=xV0(S), or μ(v)=(y,x)E(S) with y=par(x). The following arguments hold for both cases: We have μ~(v)=(y,x)E(S). Moreover, τS(x)τT(v)<τT~(v) since τT and τS satisfy (C1) and (C2). Furthermore, τT(v)<τS(y) and, by construction, τS(y)-τT(v)2ϵ. This immediately implies that τS(y)τT(v)+2ϵ=τT~(v)+ϵ>τT~(v). In summary, τS(x)<τT~(v)<τS(y) whenever μ~(v)=(y,x)E(S). Therefore, Condition (C2) is satisfied for S~.

Claim 2

E(G<(S))E(G<(S~)).

Proof

Let xy be an edge in G<(S) and thus xy, and set vT:=lcaT(x,y) and vS:=lcaS(σ(x),σ(y)). By definition, we have τT(vT)<τS(vS). Therefore, we have τS(vS)-τT(vT)DTS and, hence, τS(vS)-τT(vT)2ϵ. Since xy, vT=lcaT(x,y) is an inner vertex of T. By construction, therefore, τT~(vT)=τT(vT)+ϵ. The latter arguments together with the fact that τS remains unchanged imply that τS(vS)-τT~(vT)ϵ>0, and thus, τT~(vT)<τS(vS). Therefore, we conclude that xy is an edge in G<(S~).

It remains to show

Claim 3

For all distinct x,yL(T) with xyE(G<(S)), we have τT~(lcaT(x,y))>τS(lcaS(σ(x),σ(y))).

Proof

Suppose xyE(G<(S)) for two distinct x,yL(T), and set vT:=lcaT(x,y) and vS:=lcaS(σ(x),σ(y)). By definition, this implies τT(vT)τS(vS). Since xy, we clearly have that vT=lcaT(x,y) is an inner vertex of T, and hence, τT~(vT)=τT(vT)+ϵ. The latter two argument together with ϵ>0 and the fact that τS remains unchanged imply that τT~(vT)>τS(vS).

In particular, therefore, xyE(G<(S)) implies that xyE(G<(S~)) and therefore, E(G<(S~))E(G<(S)). Together with Claim  2 and the fact that both G<(S) and G<(S~) have vertex set L(T), we conclude that G<(S)=G<(S~), which completes the proof.

Since the relaxed scenario S~=(T,S,σ,μ~,τT~,τS) as constructed in the proof of Lemma 3 satisfies μ~(v)V0(S) we obtain

Corollary 1

For every relaxed scenario S=(T,S,σ,μ,τT,τS) there exists a relaxed scenario S~=(T,S,σ,μ~,τT~,τS) such that G<(S~)=G<(S) and μ~(v)V0(S) for all vV(T).

Lemma 3, however, does not imply that one can always find a relaxed scenario with a reconciliation map μ~ for given trees T and S satisfying μ~(lcaT(x,y))SlcaS(σ(x),σ(y)) for all distinct x,yL(T) with xyE(G<(S)), as shown in Example 2.

Example 2

Consider the LDT graph (G<(S),σ) with corresponding relaxed scenario S as shown in Fig. 15. Note first that v=lcaT(a,b)=lcaT(c,d) and ab,cdE(G<). To satisfy both μ~(v)SlcaS(σ(a),σ(b)) and μ~(v)SlcaS(σ(c),σ(d)), we clearly need that μ~(v)SρS, and thus τT~(v)τS~(ρS). However, adE(G<) and lcaT(a,d)=u imply that τT~(u)<τS(σ(a),σ(d))=τS(ρS). Hence, we obtain τT~(u)<τS(ρS)τT~(v); a contradiction to (u,v)E(T) and τT~ being a time map for T. Therefore, there is no relaxed scenario S~=(T,S,σ,μ~,τT~,τS) such that G<(S~)=G<(S) and such that μ~(lcaT(x,y))SlcaS(σ(x),σ(y)) for all distinct x,yL(T) with xyE(G<(S)).

Fig. 15.

Fig. 15

Left a relaxed scenario S=(T,S,σ,μ,τT,τS) with corresponding graph (G<(S),σ) (right). For (G<(S),σ) there is no relaxed scenario S~=(T,S,σ,μ~,τT~,τS) such that G<(S~)=G<(S) and for all distinct x,yL(T) with xyE(G<(S)) it holds that μ~(lcaT(x,y))SlcaS(σ(x),σ(y)), see Example 2

For the special case that the graph under consideration has no edges we have

Lemma 4

For an edgeless graph G and for any choice of  T and S with L(T)=V(G) and σ(L(T))=L(S) there is a relaxed scenario S=(T,S,σ,μ,τT,τS) that satisfies G=G<(S).

Proof

Given T and S we construct a relaxed scenario as follows. Let τS be an arbitrary time map on S. Then we can choose τT such that τS(ρS)<τT(u)<τS(0S) for all uV0(T). Each leaf uL(T) then has a parent in T located above the last common ancestor ρS of all species in which case G<(S) is edgeless.

Lemma 4 is reminiscent of the fact that for DL-only scenarios any given gene tree T can be reconciled with an arbitrary species tree as long as σ(L(T))=L(S) (Guigó et al. 1996; Geiß et al. 2020a).

Properties of LDT graphs

Proposition 3

Every LDT graph (G,σ) is properly colored.

Proof

Let T=(T,S,σ,τT,τS) be a μ-free scenario such that (G,σ)=(G<(T),σ) and recall that every μ-free scenario satisfies τT(x)=τS(σ(x)) for all xL(T) with σ(x)L(S). Let a,bL(T) be distinct and suppose that σ(a)=σ(b)=A. Since a and b are distinct we have a,bTlcaT(a,b) and hence, by Definition 3, τT(a)<τT(lcaT(a,b)). This implies that τT(a)=τS(A)=τS(lcaS(A,A))<τT(lcaT(a,b)). Therefore, abE(G). Consequently, abE(G) implies σ(a)σ(b), which completes the proof.

Extending earlier work of Dekker (1986) and Bryant and Steel (1995) derived conditions under which two triples r1,r2 imply a third triple r3 that must be displayed by any tree that displays r1,r2. In particular, we make frequent use of the following

Lemma 5

If a tree T displays xy|z and zw|y then T displays xy|w and zw|x. In particular T|{x,y,z,w}=((x,y),(z,w)) (in Newick format).

Definition 10

For every graph G=(L,E), we define the set of triples on L

T(G):={xy|z:x,y,zLare pairwise distinct,xyE,xz,yzE}.

If G is endowed with a coloring σ:LM we also define a set of color triples

S(G,σ):={σ(x)σ(y)|σ(z):x,y,zL,σ(x),σ(y),σ(z)are pairwise distinct,xz,yzE,xyE}.
Lemma 6

If a graph (G,σ) is an LDT graph then S(G,σ) is compatible and S displays S(G,σ) for every μ-free scenario T=(T,S,σ,τT,τS) that explains (G,σ).

Proof

Suppose that (G=(L,E),σ) is an LDT graph and let T=(T,S,σ,τT,τS) be a μ-free scenario that explains (G,σ). In order to show that S(G,σ) is compatible it suffices to show that S displays every triple in S(G,σ).

Let AB|CS(G,σ). By definition, ABC are pairwise distinct and there must be vertices a,b,cL with σ(a)=A, σ(b)=B, and σ(c)=C such that abE and bc,acE. First, abE and bc,acE imply τT(lcaT(a,b))τS(lcaS(A,B)), τT(lcaT(b,c))<τS(lcaS(B,C)), and τT(lcaT(a,c))<τS(lcaS(A,C)). Moreover, for any three vertices abc in T it holds that 1|{lcaT(a,b),lcaT(a,c),lcaT(b,c)}|2.

Therefore we have to consider the following four cases: (1) u:=lcaT(a,b)=lcaT(b,c)=lcaT(a,c), (2) u:=lcaT(a,b)=lcaT(a,c)lcaT(b,c) and (3) u:=lcaT(a,b)=lcaT(b,c)lcaT(a,c), (4) lcaT(a,b)u:=lcaT(b,c)=lcaT(a,c). Note, for any three vertices xyz in T, lcaT(x,y)lcaT(x,z)=lcaT(y,z) implies that lcaT(x,y)TlcaT(x,z)=lcaT(y,z). In Cases (1) and (2), we find τS(lcaS(A,C))>τT(u)τS(lcaS(A,B)). Together with the fact that lcaS(A,C) and lcaS(A,B) are comparable in S, this implies that AB|C is displayed by S. In Case (3), we obtain τS(lcaS(B,C))>τT(u)τS(lcaS(A,B)) and, by analogous arguments, AB|C is displayed by S. Finally, in Case (4), the tree T displays the triple ab|c. Thus, τS(lcaS(A,B))τT(lcaT(a,b))<τT(u)<τS(lcaS(A,C)). Again, AB|C is displayed by S.

The next lemma shows that induced K2+K1 subgraphs in LDT graphs implies triples that must be displayed by T.

Lemma 7

If (G,σ) is an LDT graph, then T(G) is compatible and T displays T(G) for every μ-free scenario T=(T,S,σ,τT,τS) that explains (G,σ).

Proof

Suppose that (G=(L,E),σ) is an LDT graph and let T=(T,S,σ,τT,τS) be a μ-free scenario that explains (G,σ). In order to show that T(G) is compatible it suffices to show that T displays every triple in T(G,σ).

Let ab|cT(G). By definition, a,b,cL(T) are distinct, and abE and ac,bcE. Since abE, we have A:=σ(a)σ(b)=:B by Proposition 3.

There are two cases, either σ(c){A,B} or not. Suppose first that w.l.o.g. σ(c)=A. In this case, abE and bcE together imply τT(lcaT(a,b))<τS(lcaS(A,B))τT(lcaT(b,c)). This and the fact that lcaT(a,b) and lcaT(b,c) are comparable in T implies that T displays ab|c.

Suppose now that σ(c)=C{A,B}. We now consider the four possible topologies of S=S|ABC: (1) S is a star, (2) S=AB|C, (3) S=AC|B, and (4) S=BC|A.

In Cases (1), (2) and (4), we have τS(lcaS(A,B))τS(lcaS(A,C)), where equality holds only in Cases (1) and (4). This together with abE and acE implies τT(lcaT(a,b))<τS(lcaS(A,B))τS(lcaS(A,C))τT(lcaT(a,c)). This and the fact that lcaT(a,b) and lcaT(a,c) are comparable in T implies that T displays ab|c. In Case (3), abE and bcE imply τT(lcaT(a,b))<τS(lcaS(A,B))=τS(lcaS(B,C))τT(lcaT(b,c)). By analogous arguments as before, T displays ab|c.

We note, finally, that the Aho graph of the triple set [T(G),L] in a sense recapitulates G. More precisely, we have:

Proposition 4

Let (G=(L,E),σ) be a vertex-colored graph. If for all edges xyE there is a vertex z such that xz,yzE (and thus, in particular, in case that G is disconnected), then [T(G),L]=G.

Proof

Clearly, the vertex sets of [T(G),L] and G are the same, that is, L. Let xyE and thus, we have xy. There is a vertex zx,y in G with xz,yzE if and only if xy|zT(G) and thus, if and only if xy is an edge in [T(G),L]=G.

Definition 11

For a vertex-colored graph (G,σ), we will use the shorter notation x1-x2--xn and X1-X2--Xn for a path Pn that is induced by the vertices {xi1in} with colors σ(xi)=Xi, 1in and edges xixi+1, 1in-1.

Lemma 8

Every LDT graph (G,σ) is a properly colored cograph.

Proof

Let T=(T,S,σ,τT,τS) be a μ-free scenario that explains (G,σ). By Proposition 3, (G,σ) is properly colored. To show that G=(L,E) is a cograph it suffices to show that G does not contain an induced path on four vertices (cf. Proposition 2). Hence, assume for contradiction that G contains an induced P4.

First we observe that for each edge ab in this P4 it holds that σ(a)σ(b) since, otherwise, by Proposition 3, abE. Based on possible colorings of the P4 w.r.t. σ and up to symmetry, we have to consider four cases: (1) A-B-C-D, (2) A-B-C-A, (3) A-B-A-C and (4) A-B-A-B.

In Case (1) the P4 is of the form a-b-c-d with σ(a)=A, σ(b)=B, σ(c)=C, σ(d)=D. By Lemma  6, the species tree S must display both AC|B and BD|C. Hence, by Lemma 5, S|ABCD=((A,C),(B,D)) in Newick format. Let x:=lcaS(A,B,C,D)=ρS|ABCD. Note, x “separates” A and C from B and D. Now, abE and adE implies that τT(lcaT(a,b))<τS(x)τT(lcaT(a,d)). This and the fact that lcaT(a,b) and lcaT(a,d) are comparable in T implies that T displays ab|d. Similarly, cdE and adE implies that T displays cd|a is displayed by T. By Lemma 5, T|abcd=((a,b),(c,d)). Let y:=lcaT(a,b,c,d)=ρT|abcd. Now, bcE, lcaT(b,c)=y, and lcaS(B,C)=x implies τT(y)<τS(x). This and lcaT(a,d)=y and lcaS(A,D)=x imply that adE, and thus abcd do not induce a P4 in G; a contradiction.

Case (2) can be directly excluded, since Lemma 6 implies that, in this case, S must display AC|B and AB|C; a contradiction.

Now consider Case (3), that is, the P4 is of the form a-b-a-c with σ(a)=σ(a)=A, σ(b)=B and σ(c)=C. By Lemma  6, the species tree S must display BC|A and thus x:=lcaS(A,B)=lcaS(A,C). Since abE and acE we observe τT(lcaT(a,b))<τS(x)lcaT(a,c) and, as in Case (1) we infer that T displays ab|c. By similar arguments, acE and acE implies that T displays ac|a. By Lemma  5, T|abcd=((a,b),(a,c)) and thus, y:=lcaT(a,b)=lcaT(a,c) and abE implies that τT(y)<τS(x). Since y=lcaT(a,c) and τT(y)<τS(x)=τS(lcaS(A,C)), we can conclude that acE. Hence, abcd do not induce a P4 in G; a contradiction.

In Case (4) the P4 is of the form a-b-a-b with σ(a)=σ(a)=A and σ(b)=σ(b)=B. Now, ab,abE and abE imply that τT(lcaT(a,b)),τT(lcaT(a,b))<τS(lcaS(A,B))τT(lcaT(a,b)). Hence, by similar arguments as above, T must display ab|b and ab|a. By Lemma 5, Tabcd=((a,b),(a,b)) and thus, y:=lcaT(ab)=lcaT(a,b). However, abE implies that τT(y)<τS(lcaS(A,B)); a contradiction to τS(lcaS(A,B))τT(lcaT(a,b)).

The converse of Lemma 8 is not true in general. To see this, consider the properly-colored cograph (G,σ) with vertex V(G)={a,a,b,b,c,c}, edges ab,bc,ab,ac and coloring σ(a)=σ(a)=A σ(b)=σ(b)=B, σ(c)=σ(c)=C with ABC being pairwise distinct. In this case, S(G,σ) contains the triples AC|B and BC|A. By Lemma 6, the tree S in every μ-free scenario T=(T,S,σ,τT,τS) or relaxed scenario S=(T,S,σ,μ,τT,τS) explaining (G,σ) displays AC|B and BC|A. Since no such scenario can exist, (G,σ) is not an LDT graph.

Recognition and characterization of LDT graphs

Definition 12

Let (G=(L,E),σ) be a graph with coloring σ:LM. Let C be a partition of M, and C be the set of connected components of G. We define the following binary relation R(G,σ,C) by setting

(x,y)R(G,σ,C)x,yL,σ(x),σ(y)Cfor someCC,andx,yCfor someCC.

In words, two vertices x,yL are in relation R(G,σ,C) whenever they are in the same connected component of G and their colors σ(x),σ(y) are contained in the same set of the partition of M.

Lemma 9

Let (G=(L,E),σ) be a graph with coloring σ:LM and C be a partition of M. Then, R:=R(G,σ,C) is an equivalence relation and every equivalence class of R, or short R-class, is contained in some connected component of G. In particular, each connected component of G is the disjoint union of R-classes.

Proof

It is easy to see that R is reflexive and symmetric. Moreover, xy,yzR implies that σ(x),σ(y),σ(z) must be contained in the same set of the partition C, and xyz must be contained in the same connected component of G. Therefore, xyR and thus, R is transitive. In summary, R is an equivalence relation.

We continue with showing that every R-class K is entirely contained in some connected component of G. Clearly, there is a connected component C of G such that CK. Assume, for contradiction, that KC. Hence, G must be disconnected and, in particular, there is a second connected component C of G such that CK. Hence, there is a pair xyK such that xCK and yCK. But then x and y are in different connected components of G violating the definition of R; a contradiction. Hence, every R-class is entirely contained in some connected component of G. This and the fact the R-classes are disjoint implies that each connected component of G is the disjoint union of R-classes.

The following partition of the leaf sets of subtrees of a tree S rooted at some vertex uV(S) will be useful:

Ifuis not a leaf, thenCS(u):={L(S(v))vchildS(u)}and, otherwise,CS(u):={{u}}.

One easily verifies that, in both cases, CS(u) yields a valid partition of the leaf set L(S(u)). Recall that σ|L,M:LM was defined as the “submap” of σ with LL and σ(L)MM.

Lemma 10

Let (G=(L,E),σ) be a properly colored cograph. Suppose that the triple set S(G,σ) is compatible and let S be a tree on M that displays S(G,σ). Moreover, let LL and uV(S) such that σ(L)L(S(u)). Finally, set R:=R(G[L],σ|L,L(S(u)),CS(u)).

Then, for all distinct R-classes K and K, either xyE for all xK and yK, or xyE for all xK and yK. In particular, for xK and yK, it holds that

xyEK,Kare contained in the same connected component ofG[L].
Proof

Let σ:LM and put S=S(G,σ). Since S is a compatible triple set on M, there is a tree S on M that displays S. Moreover, the condition σ(L)L(S(u))M together with the fact that CS(u) is a partition of L(S(u)) ensures that R is well-defined.

Now suppose that K and K are distinct R-classes. As a consequence of Lemma 9, we have exactly the two cases: either (i) K and K are contained in the same connected component C of G[L] or (ii) KC and KC for distinct components C and C of G[L].

Case (i). Assume, for contradiction, that there are two vertices xK and yK with xyE. Note that CL and thus, G[C] is an induced subgraph of G[L]. By Proposition 2, both induced subgraphs G[L] and G[C] are cographs. Now we can again apply Proposition 2 to conclude that diam(G[C])2. Hence, there is a vertex zC such that xz,zyE. Since x and y are in distinct classes of R but in the same connected component C of G[L], σ(x) and σ(y) must lie in distinct sets of CS(u). In particular, it must hold that σ(x)σ(y). The fact that G[L] is properly colored together with xz,yzE implies that σ(z)σ(x),σ(y). By definition and since G[L] is an induced subgraph of G, we obtain that σ(x)σ(y)|σ(z)S. In particular, σ(x)σ(y)|σ(z) is displayed by S. Since σ(x) and σ(y) lie in distinct sets of CS(u), u must be an inner vertex, and we have σ(x)L(S(v)) and σ(y)L(S(v)) for distinct v,vchildS(u). In particular, it must hold that lcaS(σ(x),σ(y))=u. Moreover, zCL and σ(L)L(S(u)) imply that σ(z)L(S(u)). Taken together, the latter two arguments imply that S cannot display the triple σ(x)σ(y)|σ(z); a contradiction.

Case (ii). By assumption, the R-classes K and K are in distinct connected components of G[L], which immediately implies xyE for all xK, yK.

In summary, either xyE for all xK and yK, or xyE for all xK and yK. Moreover, Case (i) establishes the if-direction and Case (ii) establishes, by means of contraposition, the only-if-direction of the final statement.

Lemma 10 suggests a recursive strategy to construct a relaxed scenario S=(T,S,σ,μ,τT,τS) for a given properly-colored cograph (G,σ), which is outlined in the main part of this paper and described more formally in Algorithm 1. We proceed by proving the correctness of Algorithm 1.

Theorem 2

Let (G,σ) be a properly colored cograph, and assume that the triple set S(M,G) is compatible. Then Algorithm 1 returns a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G<(S)=G in polynomial time.

Proof

Let σ:LM and put S:=S(G,σ). By a slight abuse of notation, we will simply write μ and τT also for restrictions to subsets of V(T). Observe first that due to Line 1, the algorithm continues only if (G,σ) is a properly colored cograph and S is compatible, and returns a tuple S=(T,S,σ,μ,τT,τS) in this case. In particular, a tree S on M that displays S exists, and can e.g. be constructed using BUILD (Line 1). By Lemma 1, we can always construct a time map τS for S satisfying τS(x)=0 for all xL(S) (Line 2). By definition, τS(y)>τS(x) must hold for every edge (y,x)E(S), and thus, we obtain ϵ>0 in Line 3. Moreover, the recursive function BuildGeneTree maintains the following invariant:

Claim 4

In every recursion step of the function BuildGeneTree, we have σ(L)L(S(uS)).

Proof

Since S (with root ρS) is a tree on M by construction and thus L(S(ρS))=M, the statement holds for the top-level recursion step on L and ρS. Now assume that the statement holds for an arbitrary step on L and uS. If uS is a leaf, there are no deeper recursion steps. Thus assume that uS is an inner vertex. Recall that CS(uS) is a partition of L(S(uS)) (by construction), and that R=R(G[L],σ|L,L(S(u)),CS(uS)) is an equivalence relation (by Lemma 9). This together with the definition of R and σ(L)L(S(uS)), implies that there is a child vSchildS(uS) such that σ(K)L(S(vS)) for all R-classes K. In particular, therefore, the statement is true for all recursive calls on K and vS in Line 21. Repeating this argument top-down along the recursion hierarchy proves the claim.

Note, that we are in the else-condition in Line 13 only if uS is not a leaf. Therefore and as a consequence of Claim 4 and by similar arguments as in its proof, there is a vertex vSchildS(uS) such that σ(C)L(S(vS)) for every connected component C of G[L] in Line 17, and a vertex vSchildS(uS) such that σ(K)L(S(vS)) for every R-class K in Line 20. Moreover, parS(uS) is always defined since we have uS=ρS and thus parS(uS)=0S in the top-level recursion step, and recursively call the function BuildGeneTree on vertices vS such that vSSuS.

In summary, all assignments are well-defined in every recursion step. It is easy to verify that the algorithm terminates since, in each recursion step, we either have that uS is a leaf, or we recurse on vertices vS that lie strictly below uS. We argue that the resulting tree T is a not necessarily phylogenetic tree on L by observing that, in each step, each xL is either attached to the tree as a leaf if uS is a leaf, or, since R forms a partition of L by Lemma 9, passed down to a recursion step on K for some R-class K. Nevertheless, T is turned into a phylogenetic tree T by suppression of degree-two vertices in Line 25. Finally, μ(x) and τT(x) are assigned for all vertices xL(T)=L in Line 11, and for all newly created inner vertices in Lines 7 and 18.

Recall that τS is a valid time map satisfying τS(x)=0 for all xL(S) by construction. Before we continue to show that S is a relaxed scenario, we first show that the conditions for time maps and time consistency are satisfied for (T,τT,S,τS,μ):

Claim 5

For all x,yV(T) with xTy, we have τT(x)<τT(y). Moreover, for all xV(T), the following statements are true:

  • (i)

    if μ(x)V(S), then τT(x)=τS(μ(x)), and

  • (ii)

    if μ(x)=(a,b)E(S), then τS(b)<τT(x)<τS(a).

Proof

Recall that we always write an edge (uv) of a tree T such that vTu. For the first part of the statement, it suffices to show that τT(x)<τT(y) holds for every edge (y,x)E(T), and thus to consider all vertices xρT in T and their unique parent, which will be denoted by y in the following. Likewise, we have to consider all vertices xV(T) including the root to show the second statement. The root ρT of T corresponds to the vertex uT created in Line 6 in the top-level recursion step on L and ρS. Hence, we have μ(ρT)=(parS(ρS)=0S,ρS)E(S) and τT(ρT)=τS(ρS)+ϵ (cf. Line 7). Therefore, we have to show (ii). Since ϵ>0, it holds that τS(ρS)<τT(ρT). Moreover, τS(0S)-τS(ρS)3ϵ holds by construction, and thus τS(0S)-(τT(ρT)-ϵ)3ϵ and τS(0S)-τT(ρT)2ϵ, which together with ϵ>0 implies τT(ρT)<τS(0S).

We now consider the remaining vertices xV(T)\{ρT}. Every such vertex x is introduced into T in some recursion step on L and uS in one of the Lines 6, 10,  15 or 21. There are exactly the following three cases: (a) xL(T) is a leaf attached to some inner vertex uT in Line 10, (b) x=vT as created in Line 15, and (c) x=wT as assigned in Line 21. Note that if x=uT as created in Line 6, then uT is either the root of T, or equals a vertex wT as assigned in Line 21 in the “parental” recursion step.

In Case (a), we have that xL(T) is a leaf and attached to some inner vertex y=uT. Since uS must be a leaf in this case, and thus τS(uS)=0, we have τT(y)=0+ϵ=ϵ and τT(x)=0 (cf. Lines 7 and 11). Since ϵ>0, this implies τT(x)<τT(y). Moreover, we have μ(x)=σ(x)L(S)V(S) (cf. Line 11), and thus have to show Subcase (i). Since uS is a leaf and σ(L)L(S(uS)), we conclude σ(x)=uS. Thus we obtain τT(x)=0=τS(uS)=τS(μ(x)).

In Case (b), we have x=vT as created in Line 15, and x is attached as a child to some vertex y=uT created in the same recursion step. Thus, we have τT(y)=τS(uS)+ϵ and τT(x)=τS(uS)-ϵ (cf. Lines 7 and 18). Therefore and since ϵ>0, it holds τT(x)<τT(y). Moreover, we have μ(x)=(uS,vS)E(S) for some vSchildS(uS). Hence, we have to show Subcase (ii). By a similar calculation as before, ϵ>0, τS(uS)-τS(vS)3ϵ and τT(x)=τS(uS)-ϵ imply τS(vS)<τT(x)<τS(uS).

In Case (c), x=wT as assigned in Line 21 is equal to uT as created in Line 6 in some next-deeper recursion step with uSchildS(uS). Thus, we have τT(x)=τS(uS)+ϵ and μ(x)=(uS,uS)E(S) (cf. Line 7). Moreover, x is attached as a child of some vertex y=vT as created in Line 15. Thus, we have τT(y)=τS(uS)-ϵ. By construction and since (uS,uS)E(S), we have τS(uS)-τS(uS)3ϵ. Therefore, (τT(y)+ϵ)-(τT(x)-ϵ)3ϵ and thus τT(y)-τT(x)ϵ. This together with ϵ>0 implies τT(x)<τT(y). Moreover, since μ(x)=(uS,uS)E(S) for some uSchildS(uS), we have to show Subcase (ii). By a similar calculation as before, ϵ>0, τS(uS)-τS(uS)3ϵ and τT(x)=τS(uS)+ϵ imply τS(uS)<τT(x)<τS(uS).

Claim 6

S=(T,S,σ,μ,τT,τS) is a relaxed scenario.

Proof

The tree T is obtained from T by first adding a planted root 0T (and connecting it to the original root) and then suppressing all inner vertices except 0T that have only a single child in Line  25. In particular, T is a planted phylogenetic tree by construction. The root constraint (G0) μ(x)=0S if and only if x=0T also holds by construction (cf. Line 26). Since we clearly have not contracted any outer edges (yx), i.e. with xL(T), we conclude that L(T)=L(T)=L. As argued before, we have τT(x)=0 and μ(x)=σ(x) whenever xL(T)=L(T) (cf. Line 11). Since all other vertices are either 0T or mapped by μ to some edge of S (cf. Lines 26, 7 and 18), the leaf constraint (G1) μ(x)=σ(x) is satisfied if and only if xL(T).

By construction, we have V(T)\{0T}V(T). Moreover, suppression of vertices clearly preserves the -relation between all vertices x,yV(T)\{0T}. Together with Claim 5, this implies τT(x)<τT(y) for all vertices x,yV(T)\{0T} with xTy. For the single child ρT of 0T in T, we have τT(ρT)τS(ρS)+ϵ where equality holds if the root of T was not suppressed and thus is equal to ρT. Moreover, τT(0T)=τS(0S) and τS(0S)-τS(ρS)3ϵ hold by construction. Taken together the latter two arguments imply that τT(ρT)<τT(0T). In particular, we obtain τT(x)<τT(y) for all vertices x,yV(T) with xTy. Hence, τT is a time map for T, which, moreover, satisfies τT(x)=0 for all xL(T).

To show that S=(T,S,σ,μ,τT,τS) is a relaxed scenario, it remains to show that μ is time-consistent with the time maps τT and τS. In case xL(T)V(T), we have μ(x)=σ(x)L(S)V(S) and thus τT(x)=0=τS(σ(x))=τS(μ(x)). For 0T, we have τT(0T)=τS(0S)=τS(μ(0T)). The latter two arguments imply that all vertices xL(T){0T} satisfy (C1) in the Definition 4. The remaining vertices of T are all vertices of T as well. In particular, they are all inner vertices that are mapped to some edge of S (cf. Lines 7 and 18). The latter two arguments together with Claim 5 imply that, for all vertices xV(T)\(L(T){0T}), we have μ(x)=(a,b)E(S) and τS(b)<τT(x)<τS(a). Therefore, every such vertex satisfies (C2) in Definition 4. It follows that the time consistency constraint (G2) is also satisfied, and thus S is a relaxed scenario.

Claim 7

Every vertex vV0(T) was either created in Line 6 or in Line 15. In particular, it holds for all x,yL(T) with lcaT(x,y)=v:

  1. If v was created in Line 6, then xyE(G) and xyE(G<(S)).

  2. If v was created in Line 15, then xyE(G) and xyE(G<(S)).

Furthermore, G is a cograph with cotree (Tt) where t(v)=0 if v was created in Line 6 and t(v)=1, otherwise.

Proof

Since T is phylogenetic, every vertex vV0(T) is the last common ancestor of two leaves x,yL:=L(T). Let vV0(T) be arbitrary and choose arbitrary leaves x,yL such that lcaT(x,y)=v. Since vV0(T), the leaves x and y must be distinct.

Note that vL(T){0T}, and thus, v is also an inner vertex in T. Therefore, we have exactly the two cases (1) v=uT is created in Line 6, and (2) v=vT is created in Line 15. Similar as before, the case that v=wK is assigned in Line 21 is covered by Case (a), since, in this case, wK is created in a deeper recursion step.

We consider the recursion step on L and uS, in which v was created. Clearly, it must hold that x,yL. Before we continue, set R:=R(G[L],σ|L,L(S(u)),CS(uS)) as in Line 13. Note, since S is a relaxed scenario, the graph (G<(S),σ) is well-defined.

For Statement (1), suppose that v=uT was created in Line 6. Hence, we have the two cases (i) the vertex uS of S in this recursion step is a leaf, and (ii) uS is an inner vertex. In Case (i), we have L(S(uS))={uS}. Together with Claim 4 and σ(x),σ(y)σ(L), this implies σ(x)=σ(x)=uS. By assumption, (G,σ) is properly colored. By Proposition 3(G<(S),σ) must be properly colored as well. Hence, we conclude that xyE(G) and xyE(G<(S)), respectively. In Case (ii), uS is not a leaf. Therefore, lcaT(x,y)=v=uT is only possible if x and y lie in distinct connected components of G[L]. This immediately implies xyE(G). Moreover, we have σ(x),σ(y)L(S(uS)) and thus lcaS(σ(x),σ(y))SuS. Since τS is a time map for S, it follows that τS(lcaS(σ(x),σ(y)))τS(uS). Together with τT(uT)=τS(uS)+ϵ (cf. Line 7) and ϵ>0, this implies τS(lcaS(σ(x),σ(y)))<τT(v)=τT(lcaT(x,y)). Hence, xyE(G<(S)).

For Statement (2), suppose that v=vT was created in Line 15. Therefore, lcaT(x,y)=v=vT is only possible if x and y lie in the same connected components of G[L] but in distinct R-classes. Now, we can apply Lemma 10 to conclude that xyE(G). Moreover, the fact that x and y lie in the same connected component of G[L] but in distinct R-classes implies that σ(x) and σ(y) lie in distinct sets of CS(uS). Hence, there are distinct vS,vSchildS(u) such that σ(x)SvS and σ(y)SvS. In particular, lcaS(σ(x),σ(y))=uS. In Line 18, we assign τT(lcaT(x,y))=τT(vT)=τS(uS)-ϵ. Together with ϵ>0, the latter two arguments imply τT(lcaT(x,y))<τS(uS)=τS(lcaS(σ(x),σ(y))). Therefore, we have xyE(G<(S)).

By the latter arguments, the cotree (Tt) as defined above is well-defined and, for all vV0(T), we have t(v)=1 if and only if xyE(G) for all x,yL with lcaT(x,y)=v. Hence, (Tt) is a cotree for G.

Claim 8

The relaxed scenario S satisfies G<(S)=G.

Proof

Since L(T)=L, the two undirected graphs G<(S) and G have the same vertex set. By Claim 7, we have, for all distinct x,yL, either xyE(G) and xyE(G<(S)), or xyE(G) and xyE(G<(S)).

Together, Claims 6 and 8 imply that Algorithm 1 returns a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G<(S)=G.

To see that Algorithm 1 runs in polynomial time, we first note that the function BuildGeneTree() operates in polynomial time. This is clear for the setup and the if part. The construction of R in the else part involves the computation of connected components and the evaluation of Definition 12, both of which can be achieved in polynomial time. This is also true for the comparisons of color classes required to identify vS and vS. Since the sets K in recursive calls of BuildGeneTree() form a partition of L, and the vS are children of uS in S and the depth of the recursion is bounded by O(|L(S)|), the total effort remains polynomial.

Theorem 3

A graph (G,σ) is an LDT graph if and only if it is a properly colored cograph and S(G,σ) is compatible.

Proof

By Lemma 6 and 8, if (G,σ) is an LDT graph then it is a properly colored cograph and S(G,σ) is compatible. Now suppose that (G,σ) is a properly colored cograph and S(G,σ) is compatible. Then, by Theorem 2, Algorithm 1 outputs a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G<(S)=G. By definition, this in particular implies that (G,σ) is an LDT graph.

Corollary 2

LDT graphs can be recognized in polynomial time.

Proof

Cographs can be recognized in linear time (Corneil et al. 1981b), the proper coloring can be verified in linear time, the triple set S(G,σ) contains not more than |V(G)|·|E(G)| triples and can be constructed in O(|V(G)|·|E(G)|) time, and compatibility of S(G,σ) can be checked in O(min(|S|log2|V(G)|,|S|+|V(G)|2ln|V(G)|)) time (Jansson et al. 2005).

Corollary 3

The property of being an LDT graph is hereditary, that is, if (G,σ) is an LDT graph then each of its vertex induced subgraphs is an LDT graph.

Proof

Let (G=(V,E),σ) be an LDT graph. It suffices to show that (G-x,σ|V\{x}) is an LDT graph, where G-x is obtained from G by removing xV and all its incident edges. By Proposition 2, G-x is a cograph that clearly remains properly colored. Moreover, every induced path on three vertices in G-x is also an induced path on three vertices in G. This implies that if xy|zS=S(G-x,σ|V\{x}), then xy|zS(G,σ). Hence, SS(G,σ). By Theorem 3, S(G,σ) is compatible. Hence, any tree that displays all triples in S(G,σ), in particular, displays all triples in S. Therefore, S is compatible. In summary, (G-x,σ|V\{x}) is a properly colored cograph and S is compatible. By Theorem 3 it is an LDT graph.

The relaxed scenarios S explaining an LDT graph (G,σ) are far from being unique. In fact, we can choose from a large set of trees (S,τS) that is determined only by the triple set S(G,σ):

Corollary 4

If (G=(L,E),σ) is an LDT graph with coloring σ:LM, then for all planted trees S on M that display S(G,σ) there is a relaxed scenario S=(T,S,σ,μ,τT,τS) that contains σ and S and that explains (G,σ).

Proof

If (G,σ) is an LDT graph, then the species tree S assigned in Line 1 in Algorithm 1 is an arbitrary tree on M displaying S(G,σ).

Corollary 5

If (G,σ) is an LDT graph, then there exists a relaxed scenario S=(T,S,σ,μ,τT,τS) explaining (G,σ) such that T displays the discriminating cotree TG of G.

Proof

Suppose that (G,σ) is an LDT graph. By Theorem 3, (G,σ) must be a properly colored cograph and S(G,σ) is comparable. Hence, Theorem 2 implies that Algorithm 1 constructs a relaxed scenario S=(T,S,σ,μ,τT,τS) explaining (G,σ). In particular, the tree T together with labeling t as specified in Claim 7 is a cotree for G. Since the unique discriminating cotree (TG,t^) of G is obtained from any other cotree by contraction of edges in T, the tree T must display TG.

Although, Corollary 5 implies that there is always a relaxed scenario S where the tree T displays the discriminating cotree TG of G=G(S), this is not true for all relaxed scenarios S with G=G(S). Figure 16 shows a relaxed scenario S=(T,S,σ,μ,τT,τS) with G=G(S) for which T does not display TG.

Fig. 16.

Fig. 16

A relaxed scenario S (A) with gene tree T (B) and its associated graph (G<(S),σ) (C). The discriminating cotree TG<(S) (D) is not displayed by T

Corollary 5 enables us to relate connectedness of LDT graphs to properties of the relaxed scenarios by which it can be explained.

Lemma 11

An LDT graph (G=(L,E),σ) with |L|>1 is connected if and only if for every relaxed scenario S=(T,S,σ,μ,τT,τS) that explains (G,σ), we have τT(ρT)<τS(lcaS(σ(L))).

Proof

By contraposition, suppose first that there is a relaxed scenario S=(T,S,σ,μ,τT,τS) that explains (G,σ) such that τT(ρT)τS(lcaS(σ(L))). Since |L(T)|=|L|>1, the root ρT is not a leaf. To show that G is disconnected we consider two distinct children v,wchild(ρT) of the root and leaves xL(T(v)) and yL(T(w)) and verify that x and y cannot be adjacent in G. If σ(x)=σ(y), then xyE since (G,σ) is properly colored (cf. Lemma 8). Hence, suppose that σ(x)σ(y). By construction, lcaT(x,y)=ρT and thus, by assumption, τT(lcaT(x,y))=τT(ρT)τS(lcaS(σ(L))). Now lcaS(σ(L))SlcaS(σ(x),σ(y)) implies that τS(lcaS(σ(L)))τS(lcaS(σ(x),σ(y))) and thus, τT(lcaT(x,y))τS(lcaS(σ(x),σ(y))). Hence, xyE. Consequently, for all distinct children v,wchild(ρT), none of the vertices in L(T(v)) are adjacent to any of the vertices in L(T(w)) and thus, G is disconnected.

Conversely, suppose that G is disconnected. We consider Algorithm 1 with input (G,σ). By Theorems 2 and 3, the algorithm constructs a relaxed scenario S=(T,S,σ,μ,τT,τS) that explains (G,σ). Consider the top-level recursion step on L and ρS. Since G is disconnected, the vertex uT created in Line 6 of this step equals the root ρT of the final tree T. To see this, assume first that ρS is a leaf. Then, we attach the |L|>1 elements in L as leaves to uT (cf. Line 10). Now assume that ρS is not a leaf. Since G[L]=G has at least two components, we attach at least two vertices vT created in Line 15 to uT. Hence uT is not suppressed in Line 25 and thus ρT=uT. By construction, therefore, we have τT(ρT)=τT(uT)=τS(uS)+ϵ=τS(ρS)+ϵ for some ϵ>0. From σ(ρS)SlcaS(σ(L)) and the definition of time maps, we obtain τS(ρS)τS(lcaS(σ(L))). Therefore, we have τT(ρT)τS(lcaS(σ(L)))+ϵ>τS(lcaS(σ(L))), which completes the proof. Therefore, we have shown so-far that if all relaxed scenarios S=(T,S,σ,μ,τT,τS) that explain (G,σ) satisfy τT(ρT)τS(lcaS(σ(L))), then (G,σ) must be connected. However, τT(ρT)=τS(lcaS(σ(L))) cannot occur, since we can reuse the same arguments as in the beginning of this proof to show that, in this case, G is disconnected.

Least resolved trees for LDT graphs

As we have seen e.g. in Corollary 4, there are in general many trees S and T forming relaxed scenarios S that explain a given LDT graph (G,σ). This begs the question to what extent these trees are determined by “representatives”. For S, we have seen that S always displays S(G,σ), suggesting to consider the role of S=Aho(S(G,σ),M). This tree is least resolved in the sense that there is no relaxed scenario explaining the LDT graph (G,σ) with a tree S that is obtained from S by edge-contractions. The latter is due to the fact that any edge contraction in Aho(S(G,σ),M) yields a tree S that does not display S(G,σ) any more (Jansson et al. 2012). By Proposition 6, none of the relaxed scenarios containing S explain the LDT (G,σ).

Definition 13

Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario explaining the LDT graph (G,σ). The planted tree T is least resolved for (G,σ) if no relaxed scenario (T,S,σ,μ,τT,τS) with T<T explain (G,σ).

In other words, T is least resolved for (G,σ) if no scenario with a gene tree T obtained from T by a series of edge contractions explains (G,σ). The examples in Fig. 3 show that there is not always a unique least resolved tree.

As outlined in the main part of this paper, the examples in Fig. 3 show that LDT graphs are in general not accompanied by unique least resolved trees and the example in Fig. 4 shows that the unique discriminating cotree TG of an LDT graph (G,σ) is not always “sufficiently resolved”.

Horizontal gene transfer and Fitch graphs

HGT-labeled trees and rs-Fitch graphs

As alluded to in the introduction, the LDT graphs are intimately related with horizontal gene transfer. To formalize this connection we first define transfer edges. These will then be used to encode Walter Fitch’s concept of xenologous gene pairs (Fitch 2000; Darby et al. 2017) as a binary relation, and thus, the edge set of a graph.

Definition 14

Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario. An edge (uv) in T is a transfer edge if μ(u) and μ(v) are incomparable in S. The HGT-labeling of T in S is the edge labeling λS:E(T){0,1} with λ(e)=1 if and only if e is a transfer edge.

The vertex u in T thus corresponds to an HGT event, with v denoting the subsequent event, which now takes place in the “recipient” branch of the species tree. Note that λS is completely determined by S. In general, for a given a gene tree T, HGT events correspond to a labeling or coloring of the edges of T.

Definition 15

(Fitch graph) Let (T,λ) be a tree T together with a map λ:E(T){0,1}. The Fitch graph ϝ(T,λ)=(V,E) has vertex set V:=L(T) and edge set

E:={xyx,yL,the unique path connectingxandyinTcontains an edgeewithλ(e)=1.}

By definition, Fitch graphs of 0/1-edge-labeled trees are loop-less and undirected. We call edges e of (T,λ) with label λ(e)=1 also 1-edges and, otherwise, 0-edges.

Remark 4

Fitch graphs as defined here have been termed undirected Fitch graphs (Hellmuth et al. 2018), in contrast to the notion of the directed Fitch graphs of 0/1-edge-labeled trees studied e.g. in Geiß et al. (2018) and Hellmuth and Seemann (2019).

Proposition 5

(Hellmuth et al. 2018; Zverovich 1999) The following statements are equivalent.

  1. G is the Fitch graph of a 0/1-edge-labeled tree.

  2. G is a complete multipartite graph.

  3. G does not contain K2+K1 as an induced subgraph.

A natural connection between LDT graphs and complete multipartite graphs is suggested by the definition of triple sets T(G), since each forbidden induced subgraph K2+K1 of a complete multipartite graphs corresponds to a triple in an LDT graph. More precisely, we have:

Lemma 12

(G,σ) is a properly colored complete multipartite if and only if it is properly colored and T(G)=.

Proof

The equivalence between the statements can be seen by observing that G is a complete multipartite graph if and only if G does not contain an induced K2+K1 (cf. Proposition 5). By definition of T(G), this is the case if and only if T(G)=.

Definition 16

(rs-Fitch graph) Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario with HGT-labeling λS. We call the vertex colored graph (ϝ(S),σ):=(ϝ(T,λS),σ) the Fitch graph of the scenario S.

A vertex colored graph (G,σ) is a relaxed scenario Fitch graph (rs-Fitch graph) if there is a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G=ϝ(S).

Figure 5 shows that rs-Fitch graphs are not necessarily properly colored. A subtle difficulty arises from the fact that Fitch graphs of 0/1-edge-labeled trees are defined without a reference to the vertex coloring σ, while the rs-Fitch graph is vertex colored.

Observation 1

If (G,σ) is an rs-Fitch graph then G is a complete multipartite graph.

The “converse” of Observation 1 is not true in general, as we shall see in Theorem 6 below. If, however, the coloring σ can be chosen arbitrarily, then every complete multipartite graph G can be turned into an rs-Fitch graph (G,σ) as shown in Proposition 6.

Proposition 6

If G is a complete multipartite graph, then there exists a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (G,σ) is an rs-Fitch graph.

Proof

Let G be a complete multipartite graph and set L:=V(G) and R:=E(G). If R=, then the relaxed scenario S constructed in the proof of Lemma 4 shows that E(G)=E(ϝ(S))=. Hence, we assume that R and explicitly construct a relaxed scenario S=(T,S,σ,μ,τT,τS) such that (G,σ) is an rs-Fitch graph.

We start by specifying the coloring σ:LM. Since G is a complete multipartite graph it is determined by its independent sets I1,,Ik, which form a partition of L. We set M:={1,2,,k} and color every xIj with color σ(x)=j, 1jk. By construction, (G,σ) is properly colored, and σ(x)=σ(y) whenever xyR, i.e., whenever x and y lie in the same independent set. Therefore, we have S(G,σ)=. Let S be the planted star tree with leaf set L(S)={1,,k}=M and childS(ρS)=M. Since R, we have k2, and thus, ρS has at least two children and is, therefore, phylogenetic. We choose the time map τS by putting τS(0S)=2, τS(ρS)=1 and τS(x)=0 for all xL(S).

Finally, we construct the planted phylogenetic tree T with planted root 0T and root ρT as follows: Vertex ρT has k children u1,,uk. If Ij={xj} consists of a single element, then we put uj:=xj as a leaf or T, and otherwise, vertex uj has exactly |Ij| children where child(uj)=Ij. Now label, for all i{2,,k}, the edge (ρT,ui) with “1”, and all other edges with “0”. Since k2, the tree T is also phylogenetic by construction.

We specify the time map τT and the reconciliation map μ by defining, for every vV(T),

τT(v):=2=τS(0S)01/21/4μ(v):=0Sifv=0T,σ(v)ifvL(T),(ρS,1)ifv=ρT,and(ρS,i)ifv=uiL(T),1ik.

With the help of Fig. 17, it is now easy to verify that (i) τT is a time map for T, (ii) the reconciliation map μ is time-consistent, and (iii) λS=λ. In summary, S=(T,S,σ,μ,τT,τS) is a relaxed scenario, and (G,σ)=(ϝ(S),σ) is an rs-Fitch graph.

Fig. 17.

Fig. 17

Construction in the proof of Proposition 6

Although every complete multipartite graph can be colored in such a way that it becomes an rs-Fitch graph (cf. Proposition 6), there are colored, complete multipartite graphs (G,σ) that are not rs-Fitch graphs, i.e., that do not derive from a relaxed scenario (cf. Theorem 6). We summarize this discussion in the following

Observation 2

There are (planted) 0/1-edge labeled trees (T,λ) and colorings σ:L(T)M such that there is no relaxed scenario S=(T,S,σ,μ,τT,τS) with λ=λS.

A subtle—but important—observation is that trees (T,λ) with coloring σ for which Observation 2 applies may still encode an rs-Fitch graph (ϝ(T,λ),σ), see Example 1 and Fig. 6. The latter is due to the fact that ϝ(T,λ)=ϝ(T,λ) may be possible for a different tree (T,λ) for which there is a relaxed scenario S=(T,S,σ,μ,τT,τS) with λ=λS. In this case, (ϝ(T,λ),σ)=(ϝ(S),σ) is an rs-Fitch graph. We shall briefly return to these issues in the discussion Sect. 8.

LDT graphs and rs-Fitch graphs

We proceed to investigate to what extent an LDT graph provides information about an rs-Fitch graph. As we shall see in Theorem 5 there is indeed a close connection between rs-Fitch graphs and LDT graphs. We start with a useful relation between the edges of rs-Fitch graphs and the reconciliation maps μ of their scenarios.

Lemma 13

Let ϝ(S) be an rs-Fitch graph for some relaxed scenario S. Then, abE(ϝ(S)) implies that lcaS(σ(a),σ(b))Sμ(lcaT(a,b)).

Proof

Assume first that abE(ϝ(S)) and denote by Pxy the unique path in T that connects the two vertices x and y. Clearly, u:=lcaT(a,b) is contained in Pab, and this path Pab can be subdivided into the two paths Pu,a and Pu,b that have only vertex u in common. Since abE(ϝ(S)), none of the edges (vw) along the path Pab in T is a transfer edge, and thus, the images μ(v) and μ(w) are comparable in S. This implies that the images of any two vertices along the path Pu,a as well as the images of any two vertices along Pu,b are comparable. In particular, therefore, μ(u) is comparable with both μ(a)=σ(a)=:A and μ(b)=σ(b)=:B, where we may have A=B. Together with the fact that A and B are leaves in S, this implies that μ(u) is an ancestor of A and B. Since lcaS(A,B) is the “last” vertex that is an ancestor of both A and B, we have lcaS(A,B)Sμ(u).

The next result shows that a subset of transfer edges can be inferred immediately from LDT graphs:

Theorem 4

If (G,σ) is an LDT graph, then Gϝ(S) for all relaxed scenarios S that explain (G,σ).

Proof

Let S=(T,S,σ,μ,τT,τS) be a relaxed scenario that explains (G,σ), i.e., G=G<(S). By definition, V(G)=V(ϝ(S))=L(T). Hence it remains to show that E(G)E(ϝ(S)). To this end, consider abE(G) and assume, for contradiction, that abE(ϝ(S)). Let A:=σ(a) and B:=σ(b). By Lemma  13, lcaS(A,B)Sμ(lcaT(a,b)). But then, by Definitions 3 and 4, τS(lcaS(A,B))τS(lcaT(a,b)), implying abE(G), a contradiction.

Since we only have that xy is an edge in ϝ(S) if the path connecting x and y in the tree T of S contains a transfer edge, Theorem 4 immediately implies

Corollary 6

For every relaxed scenario S=(T,S,σ,μ,τT,τS) without transfer edges, it holds that E(G<(S))=.

Theorem 4 provides the formal justification for indirect phylogenetic approaches to HGT inference that are based on the work of Lawrence and Hartl (1992), Clarke et al. (2002), and Novichkov et al. (2004) by showing that xyE(G<(S)) can be explained only by HGT, irrespective of how complex the true biological scenario might have been. However, it does not cover all HGT events. Figure 7 shows that there are relaxed scenarios S for which G<(S)ϝ(S) even though ϝ(S) is properly colored. Moreover, it is possible that an rs-Fitch graph (G,σ) contains edges xyE(G) with σ(x)=σ(y). In particular, therefore, an rs-Fitch graph is not always an LDT graph.

It is natural, therefore, to ask whether for every properly colored Fitch graph there is a relaxed scenario S such that G<(S)=ϝ(S). An affirmative answer is provided by

Theorem 5

The following statements are equivalent.

  1. (G,σ) is a properly colored complete multipartite graph.

  2. There is a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G=G<(S)=ϝ(S).

  3. (G,σ) is complete multipartite and an LDT graph.

  4. (G,σ) is properly colored and an rs-Fitch graph.

In particular, for every properly colored complete multipartite graph (G,σ) the triple set S(G,σ) is compatible.

Proof

(1) implies (2). We assume that (G,σ) is a properly colored multipartite graph and set L:=V(G) and E:=E(G). If E=, then the relaxed scenario S constructed in the proof of Lemma 4 satisfies G=G<(S)=ϝ(S), i.e., the graphs are edgeless. Hence, we assume that E and explicitly construct a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G=G<(S)=ϝ(S).

The graph (G,σ) is properly colored and complete multipartite by assumption. Let I1,,Ik denote the independent sets of G. Since E, we have k>1. Since all xIi are adjacent to all yIj, ij and (G,σ) is properly colored, it must hold that σ(Ii)σ(Ij)=. For a fixed i let vi1,vi|Ii| denote the elements in Ii.

We first start with the construction of the species tree S. First we add a planted root 0S with child ρS. Vertex ρS has children w1,,wk where each wj corresponds to one Ij. Note, σ:LM may not be surjective, in which case we would add one additional child x to ρS for each color xM\σ(L).

If |σ(Ij)|=1, then we identify the single color xσ(Ij) with wj. Otherwise, i.e., if |σ(Ij)|>1, vertex wj has as children the set childS(wj)=σ(Ij) which are leaves in S. See Fig. 18 for an illustrative example. Now we can choose the time map τS for S such τS(0S)=3, τS(ρS)=2, τS(x)=0 for all xL(S) and τS(x)=1 for all xV0(S)\{ρS}.

Fig. 18.

Fig. 18

Construction of the relaxed scenario S in the proof of Theorem 5

We now construct T as follows. The tree T has planted root 0T with child ρT. Vertex ρT has k children u1,,uk where each uj corresponds to one Ij. Vertex uj is a leaf if |Ij|=1, and, otherwise, has exactly |Ij| children that are uniquely identified with the elements in Ij.

We now define the time map τT and reconciliation map μ for vV(T):

τT(v):=3=τS(0S)01.51.25μ(v):=0Sifv=0T,σ(v)ifvL(T),(ρS,w1)ifv=ρT,and(ρS,wi)ifv=uiL(T),1ik.

With the help of Fig. 18 it is now easy to verify that (i) τT is a time map for T, and that (ii) the reconciliation map μ is time-consistent. In summary the constructed S=(T,S,σ,μ,τT,τS) is a relaxed scenario.

We continue with showing that E=E(G<(S))=E(ϝ(S)). To this end, let a,bL be two vertices. Note, abE if and only if aIi and bIj for distinct i,j[k]:={1,2,,k}.

First assume that abE and thus, aIi and bIj for distinct i,j[k]. By construction, aTuiujTb with lcaT(ui,uj)=ρT. In particular, we have parT(ui)=parT(uj)=ρT and the path from a to b contains the two edges (ρT,ui) and (ρT,uj). By construction, we have μ(ρT)=(ρS,w1), and for all 1lk, μ(ul)=σ(ul)=wl if ul is a leaf, and μ(ul)=(ρS,wl) otherwise. These two arguments imply that μ(ρT) and μ(ul) are comparable if and only if ul=u1. Now, since uiuj, they cannot both be equal to u1 and thus, at least one of the edges (ρT,ui) and (ρT,uj) is a transfer edge. Hence, abE(ϝ(S)). By construction, abE implies lcaT(a,b)=ρT. Hence, we have μ(lcaT(a,b))=μ(ρT)=(ρS,w1)SρS=lcaS(σ(a),σ(b)), and thus abE(G<(S)).

Now assume that abE, and thus, a,bIi for some i[k]. It clearly suffices to consider the case ab, and thus, a,bchildT(ui) and uiL(T) holds by construction. In particular, the path between a and b only consists of the edges (ui,a) and (ui,b). Moreover, we have σ(a),σ(b)Swi and μ(ui)=(ρS,wi). Hence, none of the edges (ui,a) and (ui,b) is a transfer edge, and abE(ϝ(S)). We have μ(lcaT(a,b))=(ρS,wi)TwiTlcaS(σ(a),σ(b)), and thus τT(lcaT(a,b))>τS(lcaS(σ(a),σ(b))). Hence, abE(G<(S)).

In summary, abE if and only if abE(ϝ(S)) if and only if abE(G<(S)), and consequently, G=G<(S)=ϝ(S).

(2) implies (1). Thus, suppose that there is a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G=G<(S)=ϝ(S). Proposition 3 implies that (G,σ)=(G<(S),σ) is properly colored. Moreover, (G,σ)=(ϝ(S),σ) is an rs-Fitch graph and thus, by Observation 1, G is complete multipartite.

Statements (1) and (2) together with Proposition 5 imply (3). Conversely, if (3) is satisfied then Proposition 3 implies that (G,σ) is properly colored. This and the fact that G is complete multipartite implies (1). Therefore, Statements (1), (2) and (3) are equivalent.

Furthermore, (4) implies (1) by Observation 1. Conversely, (G,σ) in Statement (2) is an rs-Fitch graph and an LDT graph. Hence it is properly colored by Proposition 3. Thus (2) implies (4).

Statement (3), in particular, implies that every properly colored complete multipartite (G,σ) is an LDT graph and, thus, there is a relaxed scenario S such that G=G<(S). Now, we can apply Lemma 6 to conclude that S(G,σ) is compatible, which completes the proof.

Corollary 7

A colored graph (G,σ) is an LDT graph and an rs-Fitch graph if and only if (G,σ) is a properly colored complete multipartite graph (and thus, a properly colored Fitch graph for some 0/1-edge-labeled tree).

Proof

If (G,σ) is an rs-Fitch graph then, by Observation 1, G is a complete multipartite graph. Moreover, since (G,σ) is an LDT graph, (G,σ) is properly colored (cf. Proposition 3). Conversely, if (G,σ) is a properly colored complete multipartite graph it is, by Theorem 5(2), an rs-Fitch graph and an LDT graph. Now the equivalence between Statements (1) and (3) in Theorem 5 shows that (G,σ) is an LDT graph.

Corollary 8

Let (G,σ) be a vertex-colored graph. If T(G)= and S(G,σ) is incompatible, then G is a complete multipartite graph (and thus, a Fitch graph for some 0/1-edge-labeled tree), but σ is not a proper vertex coloring of G.

Proof

By definition, if T(G)=, then G cannot contain an induced K2+K1. By Proposition 5, G is a Fitch graph. Contraposition of the last statement in Theorem 5 and G being a Fitch graph for some (T,λ) implies that σ is not a proper vertex coloring of G.

As outlined in the main part of this paper, LDT graphs are sufficient to describe replacing HGT. They fail, however, to describe additive HGT in full detail.

rs-Fitch graphs with general colorings

In scenarios with additive HGT, the rs-Fitch graph is no longer properly colored and no-longer coincides with the LDT graph. Since not every vertex-colored complete multipartite graphs (G,σ) is an rs-Fitch graph (cf. Theorem 6), we ask whether an LDT graph (G,σ) that is not itself already an rs-Fitch graph imposes constraints on the rs-Fitch graphs (ϝ(S),σ) that derive from relaxed scenarios S that explain (G,σ). As a first step towards this goal, we aim to characterize rs-Fitch graphs, i.e., to understand the conditions imposed by the existence of an underlying scenario S on the compatibility of the collection of independent sets I of G and the coloring σ. As we shall see, these conditions can be explained in terms of an auxiliary graph that we introduce in a very general setting:

Definition 17

Let L be a set, σ:LM a map and I={I1,,Ik} a set of subsets of L. Then the graph Aϝ(σ,I) has vertex set M and edges xy if and only if xy and x,yσ(I) for some II. We define an edge labeling :E(Aϝ)2I such that (e):={IIx,yIs.t.σ(x)σ(y)=e}.

By construction Aϝ(σ,I) is a subgraph of Aϝ(σ,I) whenever II. The labeling of an edge e records the sets II that imply the presence of the edge.

Theorem 6

A graph (G,σ) is an rs-Fitch graph if and only if (i) it is complete multipartite with independent sets I={I1,,Ik}, and (ii) if k>1, there is an independent set II such that Aϝ(σ,I\{I}) is disconnected.

Proof

Let G=(L,E) be a graph with coloring σ:LM. Suppose first that G satisfies (i) and (ii). To show that (G,σ) is an rs-Fitch graph, we will construct a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G=ϝ(S). If k=1, or equivalently E=, then the relaxed scenario S constructed in the proof of Lemma 4 satisfies G=ϝ(S), i.e., both graphs are edgeless. Now assume that k>1 and thus, E. Hence, we can choose an independent set II such that Aϝ:=Aϝ(σ,I\{I}) is disconnected. Note that I\{I} is non-empty since k>1. Moreover, since Aϝ is a disconnected graph on the color set M, there is a connected component C of Aϝ such that (M\C)σ(I). Hence M1:=M\C and M2:=C form a bipartition of M such that neither M1 nor M2 are empty sets.

We continue by showing that every II\{I} satisfies either σ(I)M1 or σ(I)M2. To see this, assume, for contradiction, that there are colors Aσ(I)M1 and Bσ(I)M2 for some II\{I}. Thus, BC and, by definition, ABE(Aϝ). Therefore, A and B must lie in the connected component C; a contradiction. Therefore, we can partition I\{I} into I1:={II\{I}σ(I)M1} and I2:={II\{I}σ(I)M2}. Note that one of the sets I1 and I2, but not both of them, may be empty. This may be the case, for instance, if σ is not surjective.

Now, we construct a relaxed scenario S=(T,S,σ,μ,τT,τS) with coloring σ such that G=ϝ(S). We first define the species tree S as the planted tree where ρS (i.e. the single child of 0S) hast two children w1 and w2. If |M1|=1, we identify w1 with the single element in M1, and otherwise, we set childS(w1)=L(S(w1)):=M1. We proceed analogously for w2 and M2. Thus, S is phylogenetic by construction. We choose the time map τS by putting τS(0S)=2, τS(ρS)=1, τS(w1)=τS(w2)=0.5 and τS(x)=0 for all xL(S). This completes the construction of S and τS.

We proceed with the construction of the gene tree T, its time map τT and the reconciliation map μ. This tree T has leaf set L, planted root 0T, and root ρT. We set μ(0T)=0S and τT(0T)=τS(0S)=2, and moreover μ(x)=σ(x) and τT(x)=0 for all xL.

For each IjI\{I}, we add a vertex uj. We will later specify how these vertices are connected (via paths) to ρT. If |Ij|=1, uj becomes a leaf of T that is identified with the unique element in Ij. Otherwise, we add exactly |Ij| children to uj, each of which is identified with one of the elements in Ij. If uj is a leaf, we already defined μ(uj)=σ(uj) and τT(uj)=0.

Otherwise, we set τT(uj)=0.6 and μ(uj)=(ρS,w1) if IjI1 and μ(uj)=(ρS,w2) if IjI2. Recall that M1σ(I). However, both M2σ(I) and M2σ(I)= are possible. The latter case appears e.g. whenever Aϝ(σ,I) was already disconnected. To connect the vertices uj to ρT, we distinguish the three mutually exclusive cases:

Case (a): M2σ(I)= and I1.

We set μ(ρT)=(ρS,w2) and τT(ρT)=0.9. We attach all uj that correspond to elements IjI1 as children of ρT. If |I|>1 or I2, we create a vertex u to which all elements in I and all uj such that IjI2 are attached as children, attach u as a child of ρT, and set μ(u)=(ρS,w1) and τT(u)=0.75. Otherwise, we simply attach the single element x in I as a child of ρT. Clearly, the so constructed tree T is phylogenetic. Note that the edges (ρT,uj) with IjI1 as well as the edges (u,uj) with IjI2 are transfer edges. Together with (ρT,u) or (ρT,x), respectively, these are the only transfer edges.

Case (b): M2σ(I)= and I1=.

By the arguments above, the latter implies I2. Hence, we can set μ(ρT)=(ρS,w1) and τT(ρT)=0.9 and attach all elements of I as well as the vertices uj corresponding to the independent sets IjI2=I\{I} as children of ρT. Since |I|1 and I21, the tree T obtained in this manner is again phylogenetic. Moreover, note that the transfer edges are exactly the edges (ρT,uj).

Case (c): M2σ(I).

In this case, the sets I1:={xIσ(x)M1} and I2:={xIσ(x)M2} must be non-empty. We set μ(ρT)=(0T,ρT) and τT(ρT)=1.5. If |I1|>1 or I2, we create a vertex u to which all elements in I1 and all uj such that IjI2 are attached as children, and set μ(u)=(ρS,w1) and τT(u)=0.75. Otherwise, we simply attach the single element in I1 as a child of ρT. For the “other side”, we proceed analogously: If |I2|>1 or I1, we create a vertex u to which all elements in I2 and all uj such that IjI1 are attached as children, and set μ(u)=(ρS,w2) and τT(u)=0.75. Otherwise, we simply attach the single element in I2 as a child of ρT. By construction, the so constructed tree is again phylogenetic. Moreover, the transfer edges are exactly the edges (u,uj) and (u,uj).

Using Fig. 19, one can easily verify that, in all three Cases (a)-(c), the reconciliation map μ is time-consistent with τT and τS. Thus, S is a relaxed scenario. Moreover, Fig. 19 together with the fact that σ(I)M1 holds for all II1, and σ(I)M2 holds for all II2, shows that G=ϝ(S) in all three cases. Hence, (G,σ) is an rs-Fitch graph.

Fig. 19.

Fig. 19

Illustration of the relaxed scenario constructed in the if-direction of the proof of Theorem 6. For Cases (a) and (c), only the situation in which a vertex u and u, resp., is necessary is shown. Otherwise, the single element in I, I1 or I2 would be a child of the root ρT. Moreover, the vertices uj are drawn under the assumption that |Ij|>1. Otherwise, there are identified with the single leaf in Ij

For the only-if-direction, assume that (G=(V,E),σ) is an rs-Fitch graph. Hence, there exists a relaxed scenario S=(T,S,σ,μ,τT,τS) such that G=ϝ(S). By Observation 1 and Proposition 5, (G,σ) is a complete multipartite graph that is determined by its set of independent sets I={I1,,Ik}. Hence, Condition (i) is satisfied.

Now assume, for contradiction, that Condition (ii) is violated. Thus k2 and there is no independent set IC such that Aϝ(σ,I\{I}) is disconnected. If |M|=1, then the species tree S only consists of the planted root 0S and the root ρS, which in this case is identified with the single element in M. Clearly, all vertices and edges are comparable in such a tree S, and hence, there is no transfer edges in S, implying E= and thus |I|=1; a contradiction to k2.

Thus we have |M|2 and the root ρS of the species tree S has at least two children. Since Aϝ(σ,I\{I}) is connected for every IC, the graph Aϝ(σ,I) is also connected. Since each color appears at most once as a leaf of S, σ(L(S(v1)))σ(L(S(v2)))= holds for any two distinct children v1,v2childS(ρS). These three assertions, together with the definition of the auxiliary graph Aϝ(σ,I), imply that there are two distinct colors A,BM such that AB is an edge in Aϝ(σ,I), ASv1 and BSv2 for distinct children v1,v2childS(ρS). By definition of Aϝ(σ,I) there is an independent set II containing a vertex aI with σ(a)=A and a vertex bI with σ(b)=B. Since a and b lie in the same independent set, we have abE. By Lemma 13, μ(lcaT(a,b))SlcaS(A,B)=ρS. Since, by assumption, Aϝ(σ,I\{I}) is also connected, we find two distinct colors C and D (not necessarily distinct from A and B) such that CD is an edge in Aϝ(σ,I), CSv3 and DSv4 for distinct children v3,v4childS(ρS) (but not necessarily distinct from v1 and v2), and in particular, an independent set II\{I} containing a vertex cI with σ(c)=C and a vertex dI with σ(d)=D. By construction, II, and thus, all edges between I and I exist in G, in particular the edges acadbcbd. Since c,dI, we have cdE and thus, by Lemma 13, μ(lcaT(c,d))SlcaS(C,D)=ρS.

We now consider the unique path P in T that connects lcaT(a,b) and lcaT(c,d). Since μ is time-consistent and μ(lcaT(a,b)),μ(lcaT(c,d))SρS, we conclude that, for every edge uv along this path P, we have μ(u),μ(v)SρS and thus μ(u),μ(v){ρS,(0S,ρS)}. But then, μ(u) and μ(v) are comparable in S. Therefore, P does not contain any transfer edge. Since abE, the path connecting a and lcaT(a,b) does not contain any transfer edges. Likewise, cdE implies that the path connecting c and lcaT(c,d) does not contain any transfer edges. Thus, the path connecting a and c also does not contain any transfer edge, which implies that acE(ϝ(S))=E; a contradiction since a and c belong to two distinct independent sets.

Hence, we conclude that for k>1 there exists an independent set IC such that Aϝ(σ,I\{I}) is disconnected.

Corollary 9

rs-Fitch graphs can be recognized in polynomial time.

Proof

Every rs-Fitch graph (G,σ) must be complete multipartite, which can be verified in polynomial time. In this case, the set of independent sets I={I1,,Ik} of G can also be determined and the graph Aϝ(σ,I) can be constructed in polynomial time. Finally, we need to find an independent set II, such that Aϝ(σ,I\{I}) is disconnected. Clearly, checking whether Aϝ(σ,I\{I}) is disconnected can be done in polynomial time and since there are at most |V(G)| independent sets in I, finding an independent set I such that Aϝ(σ,I\{I}) is disconnected (if one exists) can be done in polynomial time as well.

Corollary 10

Let (G,σ) be a complete multipartite graph with coloring σ:V(G)M and set of independent sets I. Then, (G,σ) is an rs-Fitch graph if and only if Aϝ(σ,I) is disconnected or there is a cut QE(Aϝ(σ,I)) such that all edges eQ have the same label (e)={I} for some II.

Proof

If Aϝ(σ,I) is disconnected, then Aϝ(σ,I\{I}) remains disconnected for all II and, by Theorem 6, (G,σ) is an rs-Fitch graph.

If there is a cut QE(Aϝ(σ,I)) such that all edges eQ have the same label (e)={I} for some II, then, by definition, E(Aϝ(σ,I\{I}))E:=E(Aϝ(σ,I))\Q. Since Q is a cut in Aϝ(σ,I), the resulting graph Aϝ=(M,E) is disconnected. By the latter arguments, Aϝ(σ,I\{I}) is a subgraph of Aϝ, and thus, disconnected as well. By Theorem 6, (G,σ) is an rs-Fitch graph.

Conversely, if (G,σ) is an rs-Fitch graph, then Theorem 6 implies that Aϝ(σ,I\{I}) is disconnected for some II. If Aϝ(σ,I) was already disconnected, then there is nothing to show. Hence assume that Aϝ(σ,I)=(M,E) is connected and let Aϝ(σ,I\{I})=(M,E). Moreover, let FE be the subset of edges eE with I(e). Note, F contains all edges of E that have potentially been removed from E to obtain E. However, all edges e=xy in F with |(e)|>1 must remain in Aϝ(σ,I\{I}), since there is another independent set I(e)\{I} such that x,yσ(I). Hence, only those edges e in F for which |(e)|=1 are removed from E. Hence, there is a cut QFE such that all edges eQ have the same label (e)={I} for some II.

Corollary 11

If (G,σ) with coloring σ:V(G)M is an rs-Fitch graph, then there are no two disjoint independent sets I and I of G with σ(I)=σ(I)=M.

Proof

Let I be the set of independent sets of G. If |I|=1, there is nothing to show and thus, we assume that |I|>1. Assume, for contradiction, that there are two distinct independent sets I,II such that σ(I)=σ(I)=M. For every II, the set I\{I} clearly contains at least one of the two sets I and I, both of which contain all colors in M. Therefore, Aϝ(σ,I\{I}) is the complete graph by construction and, thus, connected for every II. This together with Theorem 6 implies that (G,σ) is not an rs-Fitch graph; a contradiction.

Corollary 12

Every complete multipartite graph (G,σ) with a vertex coloring σ:V(G)M that is not surjective is an rs-Fitch graph.

Proof

If σ:V(G)M is not surjective, then Aϝ(σ,I) is disconnected, where I denotes the set of independent sets of G. Hence, if k>1, then Aϝ(σ,I\{I}) remains disconnected for all II. By Theorem 6, (G,σ) is an rs-Fitch graph.

Corollary 12 may seem surprising since it implies that the property of being an rs-Fitch graph can depend on species (colors M) for which we have no genes L in the data. The reason is that an additional lineage in the species tree provides a place to “park” interior vertices in the gene tree from which HGT-edges can emanate that could not always be accommodated within lineages that have survivors—where they may force additional HGT edges.

Corollary 13

Every Fitch graph (G,σ) that contains an independent set I and a vertex xI with σ(x)σ(I) for all other independent sets II, is an rs-Fitch graph.

Proof

Let I denote the set of independent sets of G. If there is an independent set II that contains a vertex xI with σ(x)σ(I) for all other independent sets II, then the vertex σ(x) in Aϝ(σ,I\{I}) is an isolated vertex and thus, Aϝ(σ,I\{I}) is disconnected. By Theorem 6, (G,σ) is an rs-Fitch graph.

As for LDT graphs, the property of being an rs-Fitch graph is hereditary.

Corollary 14

If (G=(L,E),σ) is an rs-Fitch graph, then the colored vertex induced subgraph (G[W],σ|W) is an rs-Fitch graph for all non-empty subsets WL.

Proof

It suffices to show the statement for W=L\{x} for an arbitrary vertex xL. If G=(L,E) is edgeless, then G[W] is edgeless and thus, by Theorem 6, an rs-Fitch graph.

Thus, assume that E and thus, for the set I of independent sets of G it holds that |I|>1. Since G does not contain an induced K2+K1, it is easy to see that G[W] cannot contain an induced K2+K1 and thus, G[W] is a complete multipartite graph. Hence, Theorem 6(i) is satisfied. Moreover, if for the set I of independent sets of G[W] it holds that |I|=1 then, Theorem 6 already shows that (G[W],σ|W) is an rs-Fitch graph.

Thus, assume that |I|>1. Now compare the labeling of the edges in Aϝ=Aϝ(σ,I) and the labeling of the edges in Aϝ=Aϝ(σ|W,I). Note, Aϝ and Aϝ have still the same vertex set M. Let II with xI. For all vertices yI with σ(x)σ(y), we have an edge e=σ(x)σ(y) in Aϝ and I(e). Consequently, for all edges e of Aϝ that are present in Aϝ we have (e)(e). In particular, Aϝ cannot have edges that are not present in Aϝ, since we reduced for one independent set the size by one. Therefore, Aϝ is a subgraph of Aϝ.

By Theorem 6, there is an independent set II, not necessarily distinct from I, such that Aϝ(σ,I\{I}) is disconnected. If I={x}, then I=I\{I} and Aϝ=Aϝ must be disconnected as well. Otherwise, AϝAϝ and similar arguments as above show that Aϝ(σ,I\{I})Aϝ(σ,I\{I}). Therefore, in both of the latter cases, Aϝ(σ,I\{I}) is disconnected and Theorem 6 implies that (G[W],σ|W) is an rs-Fitch graph.

As outlined in the main part of this paper, Corollary 14 is usually not satisfied if we restrict the codomain of σ to the observable part of colors, even if σ is surjective.

Least resolved trees for Fitch graphs

It is important to note that the characterization of rs-Fitch graphs in Theorem 6 does not provide us with a characterization of rs-Fitch graphs that share a common relaxed scenario with a given LDT graph. As a potential avenue to address this problem we investigate the structure of least-resolved trees for Fitch graphs as possible source of additional constraints.

All trees considered in this Appendix B.4are rooted and phylogenetic but not planted unless stated differently. This is no loss of generality, since we are interested in Fitch-least-resolved trees, which are never be planted because the edge incident with the planted root can be contracted without affecting the paths between the leaves.

Definition 18

The edge-labeled tree (T,λ) is Fitch-least-resolved w.r.t. ϝ(T,λ), if for all trees TT that are displayed by T and every labeling λ of T it holds that ϝ(T,λ)ϝ(T,λ).

Definition 19

Let (T,λ) be an edge-labeled tree and let e=(x,y)E(T) be an inner edge. The tree (T/e,λ/e) with L(T/e)=L(T), is obtained by contraction of the edge e in T and by keeping the edge labels of all non-contracted edges.

Note, if e is an inner edge of a phylogenetic tree T, then the tree T/e is again phylogenetic.

Definition 20

An edge e in (T,λ) is relevantly-labeled in (T,λ) if, for the tree (T,λ) with λ(f)=λ(f) for all fE(T)\{e} and λ(e)λ(e), it holds that ϝ(T,λ)ϝ(T,λ).

Lemma 14

An outer 0-edge e=(v,x) in (T,λ) is relevantly-labeled in (T,λ) if and only if zxE(ϝ(T,λ)) for some zL(T)\{x}.

Proof

Assume that e=(v,x) is a relevantly-labeled outer 0-edge. Hence, for (T,λ) with λ(f)=λ(f) for all fE(T)\{e} and λ(e)=1, it holds that ϝ(T,λ)ϝ(T,λ). Since we only changed the label of the outer edge (vx), it still holds that yyE(ϝ(T,λ)) if and only if yyE(ϝ(T,λ)) for all distinct y,yL(T)\{x}. Moreover, since λ(e)=1 and e=(v,x) is an outer edge, we have xzE(ϝ(T,λ)) for all zL(T)\{x}. Thus, ϝ(T,λ)ϝ(T,λ) implies that xzE(ϝ(T,λ)) for at least one zL(T)\{x}.

Now, suppose that zxE(ϝ(T,λ)) for some zL(T)\{x}. Clearly, this implies that the outer edges e=(v,x) and f=(w,z) must be 0-edges and changing one of them to a 1-edge would imply that xz becomes an edge in the Fitch graph. Hence, e is relevantly-labeled in (T,λ).

Lemma 15

For every tree (T,λ) and every inner 0-edge e of T, it holds ϝ(T,λ)=ϝ(T/e,λ/e).

Proof

Suppose that (T,λ) contains an inner 0-edge e=(u,v). The contraction of this edge does not change the number of 1-edges along the paths connecting any two leaves. It affects the least common ancestor of x and y, if lcaT(x,y)=u or lcaT(x,y)=v. In either case, however, the number of 1-edges between lcaT(x,y) and the leaves x and y remains unchanged. Hence, we have ϝ(T,λ)=ϝ(T/e,λ/e).

Lemma 16

If (T,λ) is a Fitch-least-resolved tree w.r.t. ϝ(T,λ), then it does neither contain inner 0-edges nor inner 1-edges that are not relevantly-labeled.

Proof

Suppose first, by contraposition, that (T,λ) contains an inner 0-edge e=(u,v). By Lemma 15, ϝ(T,λ)=ϝ(T/e,λ/e), and thus, (T,λ) is not Fitch-least-resolved.

Assume now, by contraposition, that (T,λ) contains an inner 1-edge e that is not relevantly-labeled. Hence, we can put λ(e)=0 and λ(f)=λ(f) for all fE(T)\{e} and obtain ϝ(T,λ)=ϝ(T,λ). Since (T,λ) contains an inner 0-edge, it cannot be Fitch-least-resolved. Therefore and by definition, (T,λ) cannot be Fitch-least-resolved as well.

The converse of Lemma 16 is, however, not always satisfied. To see this, consider the Fitch graph GK3 with vertices xy and z. Now, consider the tree (T,λ) where T is the triple xy|z, the two outer edges incident to y and z are 0-edges while the remaining two edges in T are 1-edges. It is easy to verify that G=ϝ(T,λ). In particular, the inner edge e is relevantly-labeled, since if λ(e)=0 we would have yzE(ϝ(T,λ)). However, (T,λ) is not Fitch-least-resolved w.r.t. G, since the star tree T on the three leaves xyz is displayed by T, and the labeling λ with λ(e)=1 for all eE(T) provides a tree (T,λ) with G=ϝ(T,λ).

Lemma 17

A tree (T,λ) is a Fitch-least-resolved tree w.r.t. ϝ(T,λ) if and only if ϝ(T,λ)ϝ(T/e,λ) holds for all labelings λ of T/e and all inner edges e in T.

Proof

Let (T,λ) be an edge-labeled tree. Suppose first that (T,λ) is Fitch-least-resolved w.r.t. ϝ(T,λ). For every inner edge e in T, the tree T/eT is displayed by T. By definition of Fitch-least-resolved trees, we have ϝ(T,λ)ϝ(T/e,λ) for every labeling λ of T/e.

For the converse, assume, for contraposition, that (T,λ) is not Fitch-least-resolved w.r.t. ϝ(T,λ). Hence, there is a tree (T,λ) such that TT is displayed by T and ϝ(T,λ)=ϝ(T,λ). Clearly, T and T must have the same leaf set. Therefore and since T<T, the tree T can be obtained from T by a sequence of contractions of inner edges e1,,e (in this order) where 1. If =1, then we have T=T/e1 and, by assumption, ϝ(T,λ)=ϝ(T/e1,λ). Thus, we are done. Now assume 2. We consider the tree (T/e1,λ) where λ(f)=λ(f) if fE(T) and λ(f)=0 otherwise. Hence, (T,λ) can be obtained from (T/e1,λ) by stepwise contraction of the 0-edges e2,,e, and by keeping the labeling of λ for the remaining edges in each step. Hence, we can repeatedly apply Lemma 15 to conclude that ϝ(T/e1,λ)=ϝ(T,λ). Together with ϝ(T,λ)=ϝ(T,λ), we obtain ϝ(T,λ)=ϝ(T/e1,λ), which completes the proof.

As a consequence of Lemma 17, it suffices to show that ϝ(T,λ)=ϝ(T/e,λ) for some inner edge eE(T) and some labeling λ for T/e to show that (T,λ) is not Fitch-least-resolved tree w.r.t. ϝ(T,λ). The next result characterizes Fitch-least-resolved trees and is very similar to the results for “directed” Fitch graphs of 0/1-edge-labeled trees (cf. Lemma 11(1,3) in Geiß et al. 2018). However, we note that we defined Fitch-least-resolved in terms of all possible labelings λ for trees T displayed by T, whereas Geiß et al. (2018) call (T,λ) least-resolved whenever (T/e,λ/e) results in a (directed) Fitch graph that differs from the one provided by (T,λ) for every eE(T).

Theorem 7

Let G be a Fitch graph, and (T,λ) be a tree such that G=ϝ(T,λ). If all independent sets of G are of size one (except possibly for one independent set), then (T,λ) is Fitch-least-resolved for G if and only if it is a star tree.

If G has at least two independent sets of size at least two, then (T,λ) is Fitch-least-resolved for G if and only if

  1. every inner edge of (T,λ) is a 1-edge,

  2. for every inner vertex vV0(T) there are (at least) two relevantly-labeled outer 0-edges (vx), (vy) in (T,λ)

In particular, if distinct x,yL(T) are in the same independent set of G, then they have the same parent in T and (par(x),x), (par(x),y) are relevantly-labeled outer 0-edges.

Proof

Suppose that every independent set of G is of size one (except possibly for one). Let (T,λ) be the star tree where λ((ρT,v))=1 if and only if v is the single element in an independent set of size one. It is now a simple exercise to verify that G=ϝ(T,λ). Since (T,λ) is a star tree, it is clearly Fitch-least-resolved. The converse follows immediately from this construction together with fact that the star tree is displayed by all trees with leaf set V(G). In the following we assume that G contains at least two independent sets of size at least two.

First suppose that (T,λ) is Fitch-least resolved w.r.t. ϝ(T,λ). By Lemma 16, Condition (a) is satisfied. We continue with showing that Condition (b) is satisfied. In particular, we show first that every inner vertex vV0(T) is incident to at least one relevantly-labeled outer 0-edge. To this end, assume, for contradiction, that (T,λ) contains an inner vertex vV0(T) for which this property is not satisfied.

That is, v is either (i) incident to 1-edges only (incl. λ((parT(v),v))=1 in case vρT by Condition (a)) or (ii) there is an outer 0-edge (vx) that is not relevantly-labeled. In Case (i), we put λ=λ. In Case (ii), we obtain a new labeling λ by changing the label of every outer 0-edge (vx) with xchildT(v)L(T) to “1” while keeping the labels of all other edges. This does not affect the Fitch graph, since every such 0-edge is not relevantly-labeled, and thus, zxE(ϝ(T,λ)) for all zL(T)\{x} by Lemma 14. Hence, for both Cases (i) and (ii), for the labeling λ all outer edges (vx) with xchild(v)L(T) are labeled as 1-edges, v is incident to 1-edges only (by Condition (a)) and ϝ(T,λ)=ϝ(T,λ). We thus have xyE(ϝ(T,λ))=E(ϝ(T,λ)) for all xL(T(v)) and yL(T)\L(T(v)). Now, if vρT let e=(u:=parT(v),v). Otherwise, if v=ρT then let e=(v,u) for some inner vertex uchildT(v). Note, such an inner edge (ρT,u) exists since G contains at least two independent sets of size at least two and T is not a star tree as shown above. Now consider the tree (T/e,λ/e), and denote by w the vertex obtained by contraction of the inner edge e. By construction, every path in T/e connecting any xL(T(v)) and yL(T)\L(T(v)) must contain some 1-edge (w,w) with wchildT/e(w)=childT(v) implying xyE(ϝ(T/e,λ/e)). Moreover, the edge contraction does not affect whether or not the path between any vertices within L(T(v)) or within L(T)\L(T(v)) contains a 1-edge. Hence, ϝ(T,λ)=ϝ(T,λ)=ϝ(T/e,λ/e), and (T,λ) is not Fitch-least-resolved; a contradiction. In summary, every inner vertex v must be incident to at least one relevantly-labeled outer 0-edge (vx). By Lemma 14, (vx) is a relevantly-labeled outer 0-edge if and only if there is a vertex zL(T)\{x} such that zxE(ϝ(T,λ)). By Condition (a), all inner edges in (T,λ) are 1-edges, and thus, there is only one place where the leaf z can be located in T, namely as a leaf adjacent to v. In particular, the outer edge (vz) is a relevantly-labeled 0-edge, since zxE(ϝ(T,λ)). Therefore, Condition (b) is satisfied for every inner vertex v of T.

The latter arguments also show that all distinct vertices x,yL(T) that are contained in the same independent set must have the same parent. Clearly, (par(x),x), (par(x),y) must be outer 0-edges, since otherwise xyE(ϝ(T,λ)). Hence, the final statement of the theorem is satisfied.

Now let (T,λ) be such that Conditions (a) and (b) are satisfied. First observe that none of the outer edges can be contracted without changing L(T). Now let e=(u,v) be an inner edge. By Condition (a), e is a 1-edge. Moreover, by Condition (b), vertex u and v are both incident to at least two relevantly-labeled outer 0-edges. Hence, there are outer 0-edges (u,x),(u,x),(v,y),(v,y) with pairwise distinct leaves x,x,y,y in T. Since (uv) is a 1-edge, we have xy,xy,xy,xyE(ϝ(T,λ)). Moreover, we have xx,yyE(ϝ(T,λ)). Now consider the tree (T/e,λ) with an arbitrary labeling λ and denote by w the vertex obtained by contraction of the inner edge (uv). In this tree, x,x,y,y all have the same parent w. If λ((w,x))=1 or λ((w,y))=1, we have xxϝ(T/e,λ) or yyE(ϝ(T/e,λ)), respectively. If λ((w,x))=0 and λ((w,y))=0, we have xyE(ϝ(T/e,λ)). Hence, it holds ϝ(T/e,λ)ϝ(T,λ) in both cases. Since the inner edge e and λ were chosen arbitrarily, we can apply Lemma 17 to conclude that (T,λ) is Fitch-least-resolved.

As a consequence of Theorem 7, Fitch-least-resolved trees can be constructed in polynomial time. To be more precise, if a Fitch graph G contains only independent sets of size one (except possibly for one), we can construct a star tree T with edge labeling λ as specified in the proof of Theorem 7 to obtain the 0/1-edge labeled tree (T,λ) that is Fitch-least-resolved w.r.t. G. This construction can be done in O(|V(G)|) time.

Now, assume that G has at least two independent sets of size at least two. Let I be the set of independent sets of G and I1,,IkI, k2 be all independent sets of size at least two. We now construct a tree (T,λ) with root ρT as follows: First we add k vertices v1=ρT and v2,,vk, and add inner edges ei=(vi,vi+1) with label λ(ei)=1, 1ik-1. Each vertex vi gets as children the leaves in Ii, 1ik and all these additional outer edges obtain label “0”. Finally, all elements in the remaining independent sets I\{I1,,Ik} are of size one and are connected as leaves via outer 1-edges to the root v1=ρT. It is an easy exercise to verify that T is a phylogenetic tree and that ϝ(T,λ)=G. In particular, Theorem 7 implies that (T,λ) is Fitch-least-resolved w.r.t. G. This construction can be done in O(|V(G)|) time. We summarize this discussion as

Proposition 7

For a given Fitch graph G, a Fitch-least-resolved tree can be constructed in O(|V(G)|) time.

Fitch-least-resolved trees, however, are only of very limited use for the construction of relaxed scenarios S=(T,S,σ,μ,τT,τS) from an underlying Fitch graph. First note that we would need to consider planted versions of Fitch-least-resolved trees, i.e., Fitch-least-resolved trees to which a planted root is added, since otherwise, such trees cannot be part of an explaining scenario, which is defined in terms of planted trees. Even though (G,σ) is an rs-Fitch graph, Example 3 shows that it is possible that there is no relaxed scenario S=(T,S,σ,μ,τT,τS) with HGT-labeling λS such that (T,λ)=(T,λS) for the planted version (T,λ) of any of its Fitch-least-resolved trees.

Example 3

Consider the rs-Fitch graph (G,σ) with V(G)={a,b,b,c}, E(G)={ab,ac,bb,bc} and surjective coloring σ such that σ(a)=A, σ(b)=σ(b)=B, σ(c)=C and ABC are pairwise distinct. The rs-Fitch graph (G,σ), a Fitch tree (T,λ) and relaxed scenario S with (T,λ)=(T,λS) as well as the planted versions (T1,λ1) and (T2,λ2) of its two Fitch-least-resolved trees are shown in Fig. 20.

Fig. 20.

Fig. 20

An rs-Fitch graph (G,σ) and a possible relaxed scenario S=(T,S,σ,μ,τT,τS) with G=ϝ(T,λS). For the planted versions (T1,λ1) and (T2,λ2) of the Fitch-least-resolved trees of (G,σ) there is no relaxed scenario S such that (Ti,λi)=(Ti,λS), i{1,2}. Red edges indicate 1-labeled (i.e., transfer) edges. See Example 3 for further details

Fitch-least-resolved trees for (G,σ) must contain an inner 1-edge, since G has two independent sets of size two and by Theorem 7. Thus, it is easy to verify that there are no other Fitch-least-resolved trees for (G,σ).

By Lemma 13, we obtain lcaS(A,B)Sμ(lcaTi(a,b)) and lcaS(B,C)Sμ(lcaTi(b,c)), i{1,2}, for both (planted versions of the) Fitch-least-resolved trees. However, for all of the possible species trees on three leaves ABC, this implies that the images μ(lcaTi(a,b)) and μ(lcaTi(b,c)) are the single inner edge or the edge (0T,ρT) in S. Therefore, μ(lcaTi(a,b)) and μ(lcaTi(b,c)) are always comparable in S. Hence, for all possible relaxed scenarios S, we have λS(e)=0 for the single inner edge e, whereas λi(e)=1 in Ti, i{1,2}. This implies that there is no relaxed scenario S with (Ti,λi)=(Ti,λS), i{1,2}.

Editing problems

Editing colored graphs to LDT graphs and Fitch graphs

We consider the following two edge modification problems for completion, deletion, and editing.

Problem 7

(LDT-Graph-Modification (LDT-M))

Input:

A colored graph (G=(V,E),σ) and an integer k.

Question:

Is there a subset FE such that |F|k and (G=(V,EF),σ) is an LDT graph where {\,,Δ}?

Problem 8

(rs-Fitch Graph-Completion/Editing (rsF-D/E))

Input:

A colored graph (G=(V,E),σ) and an integer k.

Question:

Is there a subset FE such that |F|k and (G=(V,EF),σ) is an rs-Fitch graph where {\,,Δ}?

NP-completeness of LDT-M be shown by reduction from

Problem 9

(Maximum Rooted Triple Compatibility (MaxRTC))

Input:

A set of (rooted) triples R and an integer k.

Question:

Is there a compatible subset RR such that |R||R|-k?

Theorem 8

(Jansson 2001, Thm. 1) MaxRTC is NP-complete.

Theorem 9

LDT-M is NP-complete.

Proof

Since LDT graphs can be recognized in polynomial time (cf. Corollary 2), a given solution can be verified in polynomial time. Thus, LDT-M is contained in NP.

We now show NP-hardness by reduction from MaxRTC. Let (R,k) be an instance of this problem, i.e., R is a set of triples and k is a non-negative integer. We construct a colored graph (GR=(L,E),σ) as follows: For each triple ri=xy|zR, we add three vertices xi,yi,zi, two edges xizi and yizi, and put σ(xi)=x, σ(yi)=y and σ(zi)=z. Hence, (GR,σ) is properly colored and the disjoint union of paths on three vertices P3. In particular, therefore, (GR,σ) does not contain an induced P4, and is therefore a properly colored cograph (cf. Proposition 2). By definition and construction, we have R=S(GR,σ).

First assume that MaxRTC with input (R,k) has a yes-answer. In this case let RR be a compatible subset such that |R||R|-k. For each of the triples ri=xy|zR\R, we add the edge xiyi to GR or remove the edge xizi from GR for LDT-E/C and LDT-D, respectively, to obtain the graph G. In both cases, we eliminate the corresponding triple xy|z from S(G,σ). By construction, therefore, we observe that S(G,σ)=R is compatible. Moreover, since we have never added edges between distinct P3s, all connected components of G are of size at most three. Therefore, G does not contain an induced P4, and thus remains a cograph. By Theorem 3, the latter arguments imply that (G,σ) is an LDT graph. Since (G,σ) was obtained from (GR,σ) by using |R\R|k edge modifications, we conclude that LDT-M with input (GR,σ,k) has a yes-answer.

For the converse, suppose that LDT-M with input (GR,σ,k) has a yes-answer with a solution (G=(L,EF),σ), i.e., (G,σ) is an LDT graph and |F|k. By Theorem 3, S(G,σ) is compatible. Let R be the subset of R=S(GR,σ) containing all triples of R for which the corresponding induced P3 in GR remains unmodified and thus, is still an induced P3 in G. By construction, we have RS(G,σ). Hence, R is compatible. Moreover, since |F|k, at most k of the vertex-disjoint P3s have been modified. Therefore, we conclude that |R||R|-k.

In summary, LDT-M is NP-hard.

Theorem 10

rsF-C and rsF-E are NP-complete.

Proof

Since rs-Fitch graphs can be recognized in polynomial time, a given solution can be verified as being a yes- or no-answer in polynomial time. Thus, rsF-C/ENP.

Consider an arbitrary graph G and an integer k. We construct an instance (G,σ,k) of rsF-C/E by coloring all vertices distinctly. Then condition (ii) in Theorem 6 is always satisfied. To see this, we note that for k>1 there are no edges between colors in the auxiliary graph Aϝ(σ,I) such that their corresponding unique vertices are in distinct independent sets I,II. The problem therefore reduces to completion/editing of (G,σ) to a complete multipartite graph, which is equivalent to a complementary deletion/editing of the complement of (Gk) to a disjoint union of cliques, i.e., a cluster graph. Both Cluster Deletion and Cluster Editing are NP-hard (Shamir et al. 2004).

Although Cluster Completion is polynomial (it is solved by computing the transitive closure), rsF-D remains open: Consider a colored complete multipartite graph (G,σ) that is not an rs-Fitch graph. Then solving Cluster Completion on the complement returns (G,σ), which by construction is not a solution to rsF-D.

Editing LDT graphs to Fitch graphs

Lemma 18

There is a linear-time algorithm to solve Problem 3 for every cograph G.

Proof

Instead of inserting in the cograph G the minimum number of edges necessary to reach a complete multipartite graph, we consider the equivalent problem of deleting a minimal set Q of edges from its complement G¯, which is also a cograph, to obtain the complement of a complete multipartite graph, i.e., the disjoint union of complete graphs. This problem is known as the Cluster Deletion problem (Shamir et al. 2004), which is known to have an polynomial-time solution for cographs (Gao et al. 2013): A greedy maximum clique partition of G is obtained by recursively removing a maximum clique K from G, see also Dessmark et al. (2007). For cographs, the greedy maximum clique partitions are the solutions of the Cluster Deletion problem (Gao et al. 2013, Thm. 1). The Maximum Clique problem on cographs can be solved in linear time using the co-tree of G (Corneil et al. 1981a), which can also be obtained in linear time (Corneil et al. 1981a).

An efficient algorithm to solve the Cluster Deletion problem for cographs can be devised by making use of the recursive construction of a cograph along its discriminating cotree (Tt). For all uV(T), we have

graphic file with name 285_2021_1631_Equ22_HTML.gif

Denote by P(u) the optimal clique partition of the cograph implied by the subtree T(u) of the discriminating cotree (Tt). We think of P(u):=[Q1(u),Q2(u),] as an ordered list, such that |Qi(u)||Qj(u)| if i<j. It will be convenient to assume that the list contains an arbitrary number of empty sets acting as an identity element for the join and disjoint union operation. With this convention, the optimal clique partitions P(u) satisfy the recursion:

P(u)=vchild(u)P(v)ift(u)=0vchild(u)Qi(v)|i=1,2,ift(u)=1[{u},,]ifuis a leaf

In the first case, where t(u)=0, we assume that the union operation to obtain P(u)=[Q1(u),Q2(u),] maintains the property |Qi(u)||Qj(u)| if i<j. In an implementation, this can e.g. be achieved using k-way merging where k=|child(u)|.

To see that the recursion is correct, it suffices to recall that the greedy clique partition is optimal for cographs as input (Gao et al. 2013) and to observe the following simple properties of cliques in cographs (Corneil et al. 1981a): (i) a largest clique in a disjoint union of graphs is also a largest clique in any of its components. The optimal clique partition of a disjoint union of graphs is, therefore, the union of the optimal clique partitions of the constituent connected components. (ii) For a join of two or more graphs Gi, each maximum size clique Q is the join of a maximum size clique of each constituent. The next largest clique disjoint from Inline graphic is, thus, the join of a largest cliques disjoint from Qi in each constituent graph Gi. Thus a greedy clique partition of G is obtained by size ordering the clique partitions of Gi and joining the k-largest cliques from each.

The recursive construction of P(ρT) operates directly on the discriminating cotree (Tt) of the cograph G. For each node u, the effort is proportional to |L(T(u))|log(deg(u)) for the deg(u)-wise merge sort step if t(u)=0 and proportional to |L(T(u))| for the merging of the k-th largest clusters for t(u)=1. Using udeg(u)|L(T(u))||L(T)|udeg(u)|L(T)|2|E(T)| together with |E(T)|=|V(T)|-1 and |V(T)|2|L(T)|-1 (cf. Hellmuth et al. 2015, Lemma 1), we obtain udeg(u)|L(T(u))|O(|L(T)|2)=O(|V(G)|2), that is, a quadratic upper bound on the running time.

Funding

Open access funding provided by Stockholm University.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

David Schaller, Email: sdavid@bioinf.uni-leipzig.de.

Manuel Lafond, Email: manuel.lafond@USherbrooke.ca.

Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de

Nicolas Wieseke, Email: wieseke@informatik.uni-leipzig.de.

Marc Hellmuth, Email: marc.hellmuth@math.su.se.

References

  1. Acuña R, Padilla BE, Flórez-Ramos CP, Rubio JD, Herrera JC, Benavides P, Lee SJ, Yeats TH, Egan AN, Doyle JJ, Rose JKC. Adaptive horizontal transfer of a bacterial gene to an invasive insect pest of coffee. Proc Natl Acad Sci USA. 2012;109(11):4197–4202. doi: 10.1073/pnas.1121190109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Aho A, Sagiv Y, Szymanski T, Ullman J. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421. doi: 10.1137/0210030. [DOI] [Google Scholar]
  3. Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 28:i283–i291. 10.1093/bioinformatics/bts225 [DOI] [PMC free article] [PubMed]
  4. Becq J, Churlaud C, Deschavanne P. A benchmark of parametric methods for horizontal transfers detection. PLoS ONE. 2010;5:e9989. doi: 10.1371/journal.pone.0009989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bryant D, Steel M. Extension operations on sets of leaf-labelled trees. Adv Appl Math. 1995;16(4):425–453. doi: 10.1006/aama.1995.1020. [DOI] [Google Scholar]
  6. Burzyn P, Bonomo F, Durán G. NP-completeness results for edge modification problems. Discrete Appl Math. 2006;154:1824–1844. doi: 10.1016/j.dam.2006.03.031. [DOI] [Google Scholar]
  7. Charleston MA. Jungles: a new solution to the host-parasite phylogeny reconciliation problem. Math Biosci. 1998;149:191–223. doi: 10.1016/S0025-5564(97)10012-8. [DOI] [PubMed] [Google Scholar]
  8. Charleston MA, Perkins SL. Traversing the tangle: algorithms and applications for cophylogenetic studies. J Biomed Inform. 2006;39:62–71. doi: 10.1016/j.jbi.2005.08.006. [DOI] [PubMed] [Google Scholar]
  9. Chen ZZ, Deng F, Wang L. Simultaneous identification of duplications losses and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform. 2012 doi: 10.1109/TCBB.2012.79. [DOI] [PubMed] [Google Scholar]
  10. Choi SC, Rasmussen MD, Hubisz MJ, Gronau I, Stanhope MJ, Siepel A. Replacing and additive horizontal gene transfer in Streptococcus. Mol Biol Evol. 2012;29:3309–3320. doi: 10.1093/molbev/mss138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Clarke GDP, Beiko RG, Ragan MA, Charlebois RL. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol. 2002;184:2072–2080. doi: 10.1128/JB.184.8.2072-2080.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Corneil DG, Lerchs H, Steward Burlingham L. Complement reducible graphs. Discrete Appl Math. 1981;3:163–174. doi: 10.1016/0166-218X(81)90013-5. [DOI] [Google Scholar]
  13. Corneil DG, Perl Y, Stewart KL. A linear recognition algorithm for cographs. SIAM J Comput. 1981;14:926–934. doi: 10.1137/0214065. [DOI] [Google Scholar]
  14. Crespelle C (2019) Linear-time minimal cograph editing. http://perso.ens-lyon.fr/christophe.crespelle/publications/SUB_minimal-cograph-editing.pdf
  15. Darby CA, Stolzer M, Ropp PJ, Barker D, Durand D. Xenolog classification. Bioinformatics. 2017;33:640–649. doi: 10.1093/bioinformatics/btw686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Dekker MCH (1986) Reconstruction methods for derivation trees. Master’s thesis, Vrije Universiteit, Amsterdam, NL
  17. Dessimoz C, Margadant D, Gonnet GH (2008) DLIGHT—lateral gene transfer detection using pairwise evolutionary distances in a statistical framework. In: RECOMB 2008: research in computational molecular biology, vol 4955. Springer, Heidelberg, pp 315–330. 10.1007/978-3-540-78839-3_27
  18. Dessmark A, Lingas A, Lundell EM, Persson M, Jansson J. On the approximability of maximum and minimum edge clique partition problems. Int J Found Comput Sci. 2007;18:217–226. doi: 10.1142/S0129054107004656. [DOI] [Google Scholar]
  19. Dondi R, Lafond M, El-Mabrouk N. Approximating the correction of weighted and unweighted orthology and paralogy relations. Algorithm Mol Biol. 2017;12(1):4. doi: 10.1186/s13015-017-0096-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005;33:e6. doi: 10.1093/nar/gni004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. doi: 10.1016/S0168-9525(00)02005-9. [DOI] [PubMed] [Google Scholar]
  22. Gao Y, Hare DR, Nastos J. The cluster deletion problem for cographs. Discrete Math. 2013;313(23):2763–2771. doi: 10.1016/j.disc.2013.08.017. [DOI] [Google Scholar]
  23. Geiß M, Anders J, Stadler PF, Wieseke N, Hellmuth M. Reconstructing gene trees from Fitch’s xenology relation. J Math Biol. 2018;77:1459–1491. doi: 10.1007/s00285-018-1260-8. [DOI] [PubMed] [Google Scholar]
  24. Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs. J Math Biol. 2019;78:2015–2057. doi: 10.1007/s00285-019-01332-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Geiß M, González Laffitte ME, López Sánchez A, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF. Best match graphs and reconciliation of gene trees with species trees. J Math Biol. 2020;80:1459–1495. doi: 10.1007/s00285-020-01469-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Geiß M, Stadler PF, Hellmuth M. Reciprocal best match graphs. J Math Biol. 2020;80:865–953. doi: 10.1007/s00285-019-01444-2. [DOI] [PubMed] [Google Scholar]
  27. Gorbunov KY, Lyubetsky VA. Reconstructing the evolution of genes along the species tree. Mol Biol. 2009;43:881–893. doi: 10.1134/S0026893309050197. [DOI] [Google Scholar]
  28. Górecki P. H-trees: a model of evolutionary scenarios with horizontal gene transfer. Fund Inform. 2010;103:105–128. doi: 10.3233/FI-2010-321. [DOI] [Google Scholar]
  29. Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor Comput Sci. 2006;359:378–399. doi: 10.1016/j.tcs.2006.05.019. [DOI] [Google Scholar]
  30. Górecki P, Tiuryn J (2012) Inferring evolutionary scenarios in the duplication, loss and horizontal gene transfer model. In: Constable RL, Silva A (eds) Logic and program semantics, lecture notes computer science, vol 7230. Springer, Berlin, Heidelberg, pp 83–105. 10.1007/978-3-642-29485-3_7
  31. Guigó R, Muchnik I, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6:189–213. doi: 10.1006/mpev.1996.0071. [DOI] [PubMed] [Google Scholar]
  32. Hallett MT, Lagergren J (2001) Efficient algorithms for lateral gene transfer problems. In: RECOMB ’01: proceedings of the fifth annual international conference on computational biology. Association for Computing Machinery, New York, NY, pp 149–156. 10.1145/369133.369188
  33. Hasić D, Tannier E. Gene tree reconciliation including transfers with replacement is NP-hard and FPT. J Comb Optim. 2019;38:502–544. doi: 10.1007/s10878-019-00396-z. [DOI] [Google Scholar]
  34. Hellmuth M. Biologically feasible gene trees, reconciliation maps and informative triples. Algorithms Mol Biol. 2017;12:23. doi: 10.1186/s13015-017-0114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Hellmuth M, Seemann CR. Alternative characterizations of Fitch’s xenology relation. J Math Biol. 2019;79:969–986. doi: 10.1007/s00285-019-01384-x. [DOI] [PubMed] [Google Scholar]
  36. Hellmuth M, Hernández-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66:399–420. doi: 10.1007/s00285-012-0525-x. [DOI] [PubMed] [Google Scholar]
  37. Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci USA. 2015;112:2058–2063. doi: 10.1073/pnas.1412770112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Hellmuth M, Stadler PF, Wieseke N. The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations. J Math Biol. 2017;75:199–237. doi: 10.1007/s00285-016-1084-3. [DOI] [PubMed] [Google Scholar]
  39. Hellmuth M, Long Y, Geiß M, Stadler PF. A short note on undirected Fitch graphs. Art Discrete Appl Math. 2018;1:P1.08. doi: 10.26493/2590-9770.1245.98c. [DOI] [Google Scholar]
  40. Hellmuth M, Fritz A, Wieseke N, Stadler PF (2020a) Techniques for the cograph editing problem: Module merge is equivalent to edit P4’s. Art Discrete Appl Math 3:#P2.01. 10.26493/2590-9770.1252.e71
  41. Hellmuth M, Geiß M, Stadler PF. Complexity of modification problems for reciprocal best match graphs. Theor Comput Sci. 2020;809:384–393. doi: 10.1016/j.tcs.2019.12.033. [DOI] [Google Scholar]
  42. Hernández-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From event-labeled gene trees to species trees. BMC Bioinform. 2012;13(Suppl. 19):S6. doi: 10.1186/1471-2105-13-S19-S6. [DOI] [Google Scholar]
  43. Husnik F, McCutcheon JP. Functional horizontal gene transfer from bacteria to eukaryotes. Nat Rev Microbiol. 2018;16:67–79. doi: 10.1038/nrmicro.2017.137. [DOI] [PubMed] [Google Scholar]
  44. Jansson J. On the complexity of inferring rooted evolutionary trees. Electron Notes Discrete Math. 2001;7:50–53. doi: 10.1016/S1571-0653(04)00222-7. [DOI] [Google Scholar]
  45. Jansson J, Ng JH, Sadakane K, Sung WK. Rooted maximum agreement supertrees. Algorithmica. 2005;43:293–307. doi: 10.1007/s00453-004-1147-5. [DOI] [Google Scholar]
  46. Jansson J, Lemence RS, Lingas A. The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J Comput. 2012;41:272–291. doi: 10.1137/100811489. [DOI] [Google Scholar]
  47. Kanhere A, Vingron M. Horizontal gene transfers in prokaryotes show differential preferences for metabolic and translational genes. BMC Evol Biol. 2009;9:9. doi: 10.1186/1471-2148-9-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Keeling PJ, Palmer JD. Horizontal gene transfer in eukaryotic evolution. Nat Rev Genet. 2008;9:605–618. doi: 10.1038/nrg2386. [DOI] [PubMed] [Google Scholar]
  49. Keller-Schmidt S, Klemm K. A model of macroevolution as a branching process based on innovations. Adv Complex Syst. 2012;15:1250043. doi: 10.1142/S0219525912500439. [DOI] [Google Scholar]
  50. Khan MA, Mahmudi O, Ullah I, Arvestad L, Lagergren J. Probabilistic inference of lateral gene transfer events. BMC Bioinform. 2016;17:431. doi: 10.1186/s12859-016-1268-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014;15:S12. doi: 10.1186/1471-2164-15-S6-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Lafond M, Hellmuth M. Reconstruction of time-consistent species trees. Algorithms Mol Biol. 2020;15:16. doi: 10.1186/s13015-020-00175-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Lafond M, Dondi RD, El-Mabrouk N. The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol. 2016;11:4. doi: 10.1186/s13015-016-0067-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Lawrence JG, Hartl DL. Inference of horizontal genetic transfer from molecular data: an approach using the bootstrap. Genetics. 1992;131:753–760. doi: 10.1093/genetics/131.3.753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Li FW, Villarreal JC, Kelly S, Rothfels CJ, Melkonian M, Frangedakis E, Ruhsam M, Sigel EM, Der JP, Pittermann J, Burge DO, Pokorny L, Larsson A, Chen T, Weststrand S, Thomas P, Carpenter E, Zhang Y, Tian Z, Chen L, Yan Z, Zhu Y, Sun X, Wang J, Stevenson DW, Crandall-Stotler BJ, Shaw AJ, Deyholos MK, Soltis DE, Graham SW, Windham MD, Langdale JA, Wong GKS, Mathews S, Pryer KM. Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns. Proc Natl Acad Sci USA. 2014;111(18):6672–6677. doi: 10.1073/pnas.1319929111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Ma W, Smirnov D, Forman J, Schweickart A, Slocum C, Srinivasan S, Libeskind-Hadas R. DTL-RnB: algorithms and tools for summarizing the space of DTL reconciliations. IEEE/ACM Trans Comput Biol Bioinform. 2018;15:411–421. doi: 10.1109/TCBB.2016.2537319. [DOI] [PubMed] [Google Scholar]
  57. Merkle D, Middendorf M. Reconstruction of the cophylogenetic history of related phylogenetic trees with divergence timing information. Theory Biosci. 2005;123:277–299. doi: 10.1016/j.thbio.2005.01.003. [DOI] [PubMed] [Google Scholar]
  58. Moran NA, Jarvik T. Lateral transfer of genes from fungi underlies carotenoid production in aphids. Science. 2010;328(5978):624–627. doi: 10.1126/science.1187113. [DOI] [PubMed] [Google Scholar]
  59. Nelson-Sathi S, Sousa FL, Roettger M, Lozada-Chávez N, Thiergart T, Janssen A, Bryant D, Landan G, Schönheit P, Siebers B, McInerney JO, Martin WF. Origins of major archaeal clades correspond to gene acquisitions from bacteria. Nature. 2015;517:77–80. doi: 10.1038/nature13805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Nøjgaard N, Geiß M, Merkle D, Stadler PF, Wieseke N, Hellmuth M. Time-consistent reconciliation maps and forbidden time travel. Algorithms Mol Biol. 2018;13:2. doi: 10.1186/s13015-018-0121-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Novichkov PS, Omelchenko MV, Gelfand Mikhail S, Mironov AA, Wolf YI, Koonin EV. Genome-wide molecular clock and horizontal gene transfer in bacterial evolution. J Bacteriol. 2004;186:6575–6585. doi: 10.1128/JB.186.19.6575-6585.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Ovadia Y, Fielder D, Conow C, Libeskind-Hadas R. The cophylogeny reconstruction problem is NP-complete. J Comput Biol. 2011;18:59–65. doi: 10.1089/cmb.2009.0240. [DOI] [PubMed] [Google Scholar]
  63. Page RDM. Parallel phylogenies: reconstructing the history of host-parasite assemblages. Cladistics. 1994;10:155–173. doi: 10.1111/j.1096-0031.1994.tb00170.x. [DOI] [Google Scholar]
  64. Ravenhall M, Škunca N, Lassalle F, Dessimoz C. Inferring horizontal gene transfer. PLoS Comput Biol. 2015;11:e1004095. doi: 10.1371/journal.pcbi.1004095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Sánchez-Soto D, Armijos-Jaramillo Agüero-Chapin V, Perez-Castillo Y, Tejera E, Antunes A, Sánchez-Rodríguez A. ShadowCaster: compositional methods under the shadow of phylogenetic models to detect horizontal gene transfers in prokaryotes. Genes. 2020;11:756. doi: 10.3390/genes11070756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Schaller D, Geiß M, Chávez E, González Laffitte M, López Sánchez A, Stadler BMR, Valdivia DI, Hellmuth M, Hernández Rosales M, Stadler PF (2021a) Corrigendum to “Best Match Graphs.” J Math Biol. 10.1007/s00285-021-01601-6 [DOI] [PMC free article] [PubMed]
  67. Schaller D, Geiß M, Stadler PF, Hellmuth M. Complete characterization of incorrect orthology assignments in best match graphs. J Math Biol. 2021;82:20. doi: 10.1007/s00285-021-01564-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Schaller D, Stadler PF, Hellmuth M. Complexity of modification problems for best match graphs. Theor Comput Sci. 2021;865:63–84. doi: 10.1016/j.tcs.2021.02.037. [DOI] [Google Scholar]
  69. Schönknecht G, Chen WH, Ternes CM, Barbier GG, Shrestha RP, Stanke M, Bräutigam A, Baker BJ, Banfield JF, Garavito RM, Carr K, Wilkerson C, Rensing SA, Gagneul D, Dickenson NE, Oesterhelt C, Lercher MJ, Weber APM. Gene transfer from bacteria and archaea facilitated evolution of an extremophilic eukaryote. Science. 2013;339(6124):1207–1210. doi: 10.1126/science.1231707. [DOI] [PubMed] [Google Scholar]
  70. Sevillya G, Adato O, Snir S. Detecting horizontal gene transfer: a probabilistic approach. BMC Genomics. 2020;21:106. doi: 10.1186/s12864-019-6395-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Shamir R, Sharan R, Tsur D. Cluster graph modification problems. Discrete Appl Math. 2004;144(1–2):173–182. doi: 10.1016/j.dam.2004.01.007. [DOI] [Google Scholar]
  72. Sjöstrand J, Tofigh A, Daubin V, Arvestad L, Sennblad B, Lagergren J. A Bayesian method for analyzing lateral gene transfer. Syst Biol. 2014;63:409–420. doi: 10.1093/sysbio/syu007. [DOI] [PubMed] [Google Scholar]
  73. Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: building the web of life. Nat Rev Genet. 2015;16:472–482. doi: 10.1038/nrg3962. [DOI] [PubMed] [Google Scholar]
  74. Stadler PF, Geiß M, Schaller D, López A, Gonzalez Laffitte M, Valdivia D, Hellmuth M, Hernández Rosales M (2020) From pairs of most similar sequences to phylogenetic best matches. Algorithms Mol Biol 15:5. 10.1186/s13015-020-00165-2 [DOI] [PMC free article] [PubMed]
  75. Thomas CM, Nielsen KM. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat Rev Microbiol. 2005;3:711–721. doi: 10.1038/nrmicro1234. [DOI] [PubMed] [Google Scholar]
  76. Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(2):517–535. doi: 10.1109/TCBB.2010.14. [DOI] [PubMed] [Google Scholar]
  77. Wieseke N, Bernt M, Middendorf M (2013) Unifying parsimonious tree reconciliation. In: Darling A, Stoye J (eds) Algorithms in bioinformatics. WABI 2013, Lecture notes in computer science, vol 8126. Springer, Berlin, Heidelberg. 10.1007/978-3-642-40453-5_16
  78. Williams D, Gogarten JP, Papke RT. Quantifying homologous replacement of loci between haloarchaeal species. Genome Biol Evol. 2012;4:1223–1244. doi: 10.1093/gbe/evs098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Zverovich IE (1999) Near-complete multipartite graphs and forbidden induced subgraphs. Discrete Math 207:257–262. 10.1016/S0012-365X(99)00050-3

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES