Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2021 Jul 12;37(Suppl 1):i120–i132. doi: 10.1093/bioinformatics/btab328

Gene tree and species tree reconciliation with endosymbiotic gene transfer

Yoann Anselmetti 1, Nadia El-Mabrouk 2,, Manuel Lafond 1, Aïda Ouangraoua 1
PMCID: PMC8312264  PMID: 34252921

Abstract

Motivation

It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an α−proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree.

Results

We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species (in one direction or the other). We consider both EGT events resulting in maintaining (EGTcopy) or removing (EGTcut) the gene copy in the source genome. We present a linear-time algorithm for computing the DLE (Duplication, Loss and EGT) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyze different costs settings parameters on a plant dataset and discuss the resulting reconciled trees.

Availability and implementation

EndoRex implementation and supporting data are available on the GitHub repository via https://github.com/AEVO-lab/EndoRex.

1 Introduction

Genomics and cell biology investigations have revealed that all known eukaryotes descend from a common ancestral mitochondrial-containing cell that originated from the integration of an endosymbiotic α-proteobacterium into a host cell (Dyall and Johnson, 2000). After this early event, eukaryotic gene contents have been shaped by duplications, losses and Horizontal Gene Transfers (HGT) from one species to another, but also by Endosymbiotic Gene Transfers (EGT), mainly from the mitochondrion to the nucleus, in some cases leading to the total disappearance of the mitochondrion (Roger et al., 2017; Sloan et al., 2018).

Many questions regarding the ancestral mitochondrial proteome and gene content evolution remain open (Lang and Burger, 2012). One of the reasons is that, to date, comparative genomics studies have largely focused on multicellular eukaryotes, mainly animals and plants. While imprints of global evolutionary events at the genomic level are hardly visible on multicellular eukaryotes that have diverged too much from the Last Eukaryotic Common Ancestor (LECA), protists, known to have emerged close to the eukaryotic origin, are better candidates for such a comprehensive evolutionary study. Interestingly, a recent sequencing effort on jakobids (Gray et al., 2020) and malawimonads (Derelle et al., 2015) protist genomes have been undertaken by a consortium of protistologists (DeepEuk), suggesting that soon enough data will be available to allow further investigations on early-eukaryotic evolution.

In addition to having the appropriate datasets, understanding the concerted evolution of the eukaryotic mitochondrial and nuclear genomes also requires having the appropriate algorithmic tools. This problem can be seen as related to the host-parasite coevolution inference problem (Charleston and Perkins, 2006). Given a host tree and a parasite tree, cophylogenetic analysis consists in inferring a history of codivergence, parasite duplication, host switch or extinction events explaining the coevolution of hosts and parasites. However, nuclear and mitochondrial genomes can hardly be treated by the same kind of approach, as they evolve, through a different evolutionary model, together in the same species, and thus are related through the same species tree. Rather, inferring an endosymbiotic evolutionary history requires focusing on gene families and studying the movement of genes between the mitochondrial and nuclear genomes.

Inferring the evolution of gene families is the purpose of the gene-tree-species-tree-reconciliation field, seeking for a most parsimonious (El-Mabrouk and Noutahi, 2019; Goodman et al., 1979), or a most probable (Akerborg et al., 2009; Szöllősi et al., 2015) evolutionary scenario of gene gain and loss explaining the incongruence between a gene tree and a species tree. A most parsimonious reconciliation minimizing the number of Duplications (the D-distance) or the number of Duplications and Losses (the DL-distance) can be found in linear time using the LCA (Last Common Ancestor) mapping (Chen, 2000; Zhang, 1997; Zmasek and Eddy, 2001). Such an algorithm can actually be used to solve the cophylogenetic problem if operations are restricted to coevolution, duplication and extinction. Including HGT events (i.e. finding the DTL-distance) leads to an NP-hard problem if time-consistency is required, remaining polynomial otherwise (Bansal et al., 2012; Tofigh et al., 2011).

In this article, we introduce the reconciliation model accounting for EGT events, i.e. the special case of HGT events where genes are exchanged only between the mitochondrial and nuclear genomes of the same species. Although integration of the mitochondrial content into the nucleus is the most frequent event in the course of evolution of eukaryotes, the transfer from the nucleus to the mitochondrion has also been observed (Adams and Palmer, 2003). Here, we consider the exchange of genes in both directions. Moreover, we consider EGT events resulting in maintaining a gene copy in the source genome (EGTcopy), as well as those resulting in the removal or loss of function of the gene in the source genome (EGTcut).

Formally, given a gene tree for a gene family with a known mitochondrial or nuclear location for each gene copy, we seek for a most parsimonious sequence of Duplication, Loss and EGT (DLE) events explaining the tree given a known species tree. First, based on the DL-distance and on the Fitch algorithm for weighted parsimony, we present, in Section 3, a linear-time algorithm for computing the DLE-Distance, as well as an optimal reconciled tree for the unitary cost. We then develop, in Section 4, a general dynamic programming algorithm that can be used to output all optimal reconciliations, for an arbitrary cost of operations, including possibly a different cost for an EGT from the mitochondrion to the nucleus, or conversely. This algorithm is linear in the size of the gene tree. It can be seen as an adaptation of the quadratic-time DTL algorithm for dated trees (Doyon et al., 2010), which allows transfers between any co-existing species. We finally illustrate, in Section 5, the application of our EndoRex software on clusters of orthologous mitochondrial protein-coding genes (MitoCOGs) (Kannan et al., 2014) of plants, analyze different costs settings parameters and discuss the obtained reconciled trees.

For space reasons, some of the proofs are given in Appendix.

2 Preliminaries

All trees are considered rooted. Given a tree T, we denote by r(T) its root, by V(T) its set of nodes and by (T)V(T) its leafset. A node x is a descendant of x if x is on the path from x to a leaf of T and an ancestor of x if x is on the path from r(T) to x;x is a strict descendant (respectively strict ancestor) of x if it is a descendant (respectively ancestor) of x different from x. Moreover, x is the parent of xr(T) if it directly precedes x on the path from x to r(T). In this latter case, x is a child of x. We denote by E(T) the set of edges of T, where an edge is represented by its two terminal nodes (x,x), with x being the parent of x. An internal node (a node which is not a leaf) is said to be unary if it has a single child and binary if it has two children. If not stated differently, the children of a binary node x are denoted xl and xr. Given a node x of T, the subtree of T rooted at x is denoted T[x].

A binary tree is a tree with all internal nodes being binary. If internal nodes have one or two children, then the tree is said partially binary.

The lowest common ancestor (LCA) in T of a subset L of (T), denoted lcaT(L), is the ancestor common to all the nodes in L that is the most distant from the root.

A tree R is an extension of a tree T if it is obtained from T by grafting unary or binary nodes in T, where grafting a unary node x on an edge (u, v) consists in creating a new node x, removing the edge (u, v) and creating two edges (u, x) and (x, v), and in the case of grafting a binary node, also creating a new leaf y and an edge (x, y). In the latter case, we say that y is a grafted leaf.

Species and gene trees: The species tree S for a set Σ of species represents a partially ordered set of speciation events that have led to Σ. In this article, we consider that each species of σΣ has two genomes: σ0 corresponding to its mitochondrial genome and σ1 corresponding to its nuclear genome.

A gene family is a set Γ of genes where each gene x belongs to a given species s(x) of Σ. A tree T is a gene tree for a gene family Γ if its leafset is in bijection with Γ. We will make no distinction between a leaf of T and the gene of Γ it corresponds to. We call s(x) the species labeling of the leaf x. For a subset GΓ of genes, we write s(G)={s(g):gG} as the set of species containing the genes of G.

Moreover, we assign to each gene x of Γ a Boolean value corresponding to the genome it belongs to. More precisely, b(x) = 0 if x belongs to s(x)0 and b(x) = 1 if x belongs to s(x)1. In this article, we assume that the mitochondrial or nuclear location of each extant gene is known. We call b(x) the genome labeling of the leaf representing x.

An evolutionary history is represented by an event labeled tree, where the event label e˜(x) of an internal node x is its corresponding event. The event labeling of the internal nodes of a gene tree is obtained through reconciliation.

2.1 Reconciliation

Inside the species’ genomes, genes undergo Speciation (Spe) when the species to which they belong do, but also Duplication (Dup) i.e. the creation of a new gene copy, Loss of a gene copy and Horizontal Gene Transfer (HGT) when a gene is transmitted from a source to a target genome. In this article, we consider special cases of HGTs, called EGTs, only allowing the transmission of genes from the mitochondrial genome to the nuclear genome of the same species, or vice-versa. Moreover, we consider two types of EGTs: EGTcopy and EGTcut defined as follows (see Fig. 1):

Fig. 1.

Fig. 1.

The effect of an event on a node x of a gene tree representing the gene a belonging to the genome si (denoted x(si)), where s is a species and i{0,1} (for a species s, so is the mitochondrial genome and s1 the nuclear genome of s). The tree S up-right is the species tree, where u and v are the two species arising from the speciation of s. (Spe): Gives rise to a copy au in ui and av in vi; (Dup): Preserves the copy a in si and gives rise to a new copy b in si; (EGTcopy): Represents a transfer event from si to sj, where j{0,1} and ji, preserving the copy a in si and giving rise to a new copy aj in sj; (EGTcut): Represents a transposition event from si to sj removing the copy a in si and creating a copy aj in sj

  • A gene x belonging to σi is copied (or transferred) by an EGTcopy event to σj for {i,j}={0,1} if it is copied from σi and inserted in σj.

  • A gene x belonging to σi is transposed by an EGTcut event to σj for {i,j}={0,1} if it is cut from σi and inserted in σj.

Thus, in this article, the set of considered events is:

DLE={Spe,Dup,Loss,EGTcopy,EGTcut}

Notice that we do not consider general HGT events. To define a DLE-Reconciliation, assume that we are given a species tree S, a gene tree T, a mapping s from (T) to (S) and a mapping b from (T) to {0, 1}. We need to define how to extend s and b to the internal nodes of T. Given an extension R of T (R can be equal to T) an extension of s is a function s˜ from V(R) to V(S) such that, for each leaf x of T, s˜(x)=s(x). Moreover, an extension of b is a function b˜ from V(R) to {0, 1} such that, for each leaf x of T, b˜(x)=b(x).

Definition 1

(DLE-Reconciliation). LetΓbe a gene family where eachxΓbelongs to the genome b(x) of a species s(x) ofΣ. Let T be a rooted binary gene tree forΓand S be a rooted binary species tree forΣ. A DLE-Reconciliation is a quadrupletR,s˜,b˜,e˜where R is a partially binary extension of T, s˜is an extension of s andb˜is an extension of b such that:

  • a. s˜(xl)ands˜(xr)are the two children ofs˜(x)in S andb˜(xl)=b˜(xr)=b˜(x), in which casee˜(x)=Spe;

  • b. s˜(xl)=s˜(xr)=s˜(x)=σandb˜(xl)=b˜(xr)=b˜(x)in which casee˜(x)=Duprepresenting a duplication inσb˜(x);

  • c. s˜(xl)=s˜(xr)=s˜(x)=σandb˜(xl)b˜(xr)in which casee˜(x)=EGTcopy;let y be the element of{xl,xr}such thatb˜(x)b˜(y), thene˜(x)is a transfer with source genomeσb˜(x)and target genomeσb˜(y).

A grafted leaf on a newly created node x corresponds to a loss ins˜(x).

As R is as an extension of T, each node in T has a corresponding node in R. In other words, we can consider that V(T)V(R). In particular, the species labeling on R induces a species labeling on T.

Given a cost function c on DLE and a reconciliation R=R,s˜,b˜,e˜, the cost c(R) is the sum of costs of the induced events. In this article, we assume a 0 cost for speciations and positive costs for all the other events.

We are now ready to formally define the considered optimization problem.

DLE-Reconciliation Problem:

Input: A species tree S for a set of species Σ, a gene family Γ on Σ, a gene tree T for Γ, a species labeling s and a genome labeling b of (T), and a cost function c on DLE.

Output: A most parsimonious DLE-Reconciliation, i.e. a DLE-Reconciliation R,s˜,b˜,e˜ of minimum cost.

In the next section, we first consider the case of a unitary cost, thus reducing the problem to minimizing the number of operations induced by a reconciliation. The cost DLE(T, S) of the most parsimonious DLE-Reconciliation for T and S in the case of a unitary cost c is called the DLE-Distance. We then extend the algorithmic developments to arbitrary costs, allowing in particular to consider an EGTcopy or an EGTcut event copying a gene from the mitochondria to the nucleus differently from a similar event copying a gene from the nucleus to the mitochondria.

In the following section, we will refer to the DL-Reconciliation of T and S. Recall that it is a triplet RDL,s˜,e˜ defined by only considering the cases of speciations, duplications and losses in Definition 1, and ignoring the binary assignment of genes. We denote by DL(T, S) the DL-Distance, i.e. the minimum number of duplications and losses induced by a DL-reconciliation. The DL-Reconciliation RDL,s˜,e˜ of cost DL(T, S) is unique and verifies, for any internal node x of V(RDL)V(T):

  1. s˜(x)=lcaS(s((T[x])));

  2. if s˜(x)s˜(xl) and s˜(x)s˜(xr) then v is a Speciation; otherwise x is a Duplication.

We finally need to make the link between the species labeling s˜ of an optimal reconciliation and the well-known LCA-Mapping. This is formally stated in the following lemma.

Lemma 1

(LCA-Mapping). LetR,s˜,b˜,e˜be a DLE-Reconciliation of minimum cost between T and S. Then, for eachxV(T)V(R),s˜(x)=lcaS(s((T[x]))).

Note that in the above statement, V(T)V(R)=V(T), and thus the intersection is redundant. We write it this way to emphasize that x is a vertex of R (which happens to also be in T), i.e. the LCA-Mapping here applies to the reconciled trees, not to the original gene tree T.

3 A linear-time algorithm for the DLE-distance

In this section, we consider a unitary cost c on DLE.

Consider a given extension b˜T of b to the internal nodes of T. We first present an algorithm for computing a DLE-Reconciliation R,s˜,b˜,e˜ of minimum cost, under the condition that b˜(x)=b˜T(x) for each xV(T)V(R). We will then show how a b˜T minimizing the DLE-Distance can be obtained.

Algorithm 1 computes the DLE-Reconciliation R,s˜,b˜,e˜ from the DL-Reconciliation RDL,s˜DL,e˜DL (see Fig. 2 for an example).

Inline graphic

Fig. 2.

Fig. 2.

The tree RDL up left, together with its node labeling, is the optimal DL-Reconciliation for the gene tree T represented by the plain edges of RDL and the species tree S up right. The two down trees are obtained by Algorithm 1 for two different b˜ labeling of internal nodes: the left labeling is obtained by the Fitch algorithm for weighted parsimony, while the right labeling is obtained by applying Algorithm 2. The left labeling gives rise to a non-optimal reconciliation with seven operations (two losses, one duplication, two EGTcopy and two EGTcut), while the right labeling gives rise to the DLE-Distance which is equal to six (two losses, three EGTcopy and one EGTcut). Rectangles represent duplications; triangles represent either EGTcopy or EGTcut events depending whether the labeled node is binary or unary; dotted lines represent losses; A leaf xi represent a gene x belonging to the genome i (0 for mitochondrial and 1 for nuclear) of species X

Lemma 2

(Optimality of Algorithm 1). Given a binary assignmentb˜Tof the nodes of T, Algorithm 1 outputs a DLE-ReconciliationR,s˜,b˜,e˜of minimum cost with the constraint thatb˜(x)=b˜T(x)forxV(R)V(T).

It follows from Lemma 2 that if b˜ is known in advance for the nodes of T, a DLE-Reconciliation of minimum cost is obtained from Algorithm 1 with b˜ as input. We now focus on finding such a labeling b˜.

Lemma 3

(Necessary condition for b˜) There exists a DLE-ReconciliationR,s˜,b˜,e˜of minimum cost DLE(T, S) such that, for any node x of T and its children xl and xr in T, b˜(x)=b˜(xl)orb˜(x)=b˜(xr).

Proof.

Assume R,s˜,b˜,e˜ is a most parsimonious DLE-Reconciliation with a lowest node x not satisfying condition (1): b˜(x)=b˜(xl) or b˜(x)=b˜(xr). Thus we should have b˜(x)b˜(xl)=b˜(xr). Note that an EGTcut event must be present on at least one of the (x,xl) or (x,xr) branches. A reconciliation of lower or equal cost can be obtained by assigning b˜(x)=b˜(xl)=b˜(xr) and removing this EGTcut event, reducing the cost by one. Let px be the parent of x in R (note that if x is the root, px might not exist, in which case there is nothing else to do). If b˜(x) is now different from b˜(px), we add an EGTcut event between px and x, yielding an alternate reconciliation of equal or lower cost.

We can reproduce the same transformation iteratively in a bottom-up fashion until condition (1) is satisfied for every node.

For a node xV(T), define d(x) = 1 if x is a duplication in the DL-Reconciliation of minimum cost, and d(x) = 0 otherwise. Let b˜ be a binary labeling of V(T). For any node x of T, denote Δb˜(x)=0 if x(T), otherwise

Δb˜(x)=max(0,|b˜(x)b˜(xl)|+|b˜(x)b˜(xr)|d(x))

and define:

cost(T,S,b˜)=xV(T)Δb˜(x)

Roughly speaking, Δb˜(x) reflects the number of label changes between x and its children xl and xr in T, with the exception that a duplication is allowed a ‘free’ change since it can be turned into an EGTcopy node. For example, in Figure 2, cost(T,S,b˜)=2 for the labeling b˜ of T consistent with that of the left tree R (Algo1+Fitch), and cost(T,S,b˜)=1 for the labeling b˜ of T consistent with that of the right tree R (Algo1+Algo2), reflecting, for each one, the number of requested EGTcut.

Lemma 4.

The minimum cost of a DLE-Reconciliation between a gene tree T and a species tree S is

DLE(T,S)=DL(T,S)+minb˜cost(T,S,b˜)

Proof. By Lemma 2, Algorithm 1 correctly infers a minimum cost DLE-Reconciliation for a given b˜. Note that this DLE-Reconciliation is obtained from a DL-Reconciliation by turning some duplication nodes into EGTcopy nodes (which do not change the cost), and by grafting some EGTcut nodes. Thus, the latter are responsible for any possible change in cost from DL(T, S) to DLE(T, S). It follows that the cost of the returned DLE-Reconciliation is DL(T, S), plus the number of grafted EGTcut nodes.

Let b˜ be a binary assignment of T that minimizes DLE(T, S) when b˜ is passed to Algorithm 1. By Lemma 3, we may assume that for any node x and its children xl and xr, b˜(x)=b˜(xl) or b˜(x)=b˜(xr). Thus Δb˜(x){0,1} for every x. Furthermore, Δb˜(x)=1 if and only if x is a speciation node and an EGTcut node is grafted on the edge (x,xl) (if b˜(x)b˜(xl)) or on the edge (x,xr) (if b˜(x)b˜(xr)). In consequence, cost(T,S,b˜) counts exactly the number of graftings of EGTcut nodes. □ □

Since the most-parsimonious DL-Reconciliation is unique, the DL(T, S) term in the above lemma is an invariant. Our goal is therefore to find the labeling b˜ that minimizes cost(T,S,b˜).

This can be achieved by a slight modification of the Fitch (1971) algorithm (Fitch, 1971) computing, for a given tree with leaf labels, all possible label assignments of internal nodes minimizing the number of label changes along the edges of the tree. We first need to recall some concepts on parsimony. Given a tree T on a leafset L of residues (generally nucleotides or amino-acids, but in this article L={0,1} corresponding to the possible b˜ labeling), the weighted parsimony problem consists in assigning a residue b˜(u)L to each internal node u of T in a way minimizing the total weight of the tree. More precisely, given a cost matrix M on residues, the weight of T is the sum of weights M(b˜(u),b˜(v)) for all (u,v)E(T). An assignment of T refers to the assignment of a residue to each internal node of T.

The Sankoff and Cedergren (1983) algorithm (Sankoff and Cedergren, 1983) allows to compute, in quadratic time, the minimum cost min(T) of an assignment of T. Moreover, it allows to find all the assignments T˜ of T leading to min(T). When M(a,a)=0 for all aL and M(a,b)=1 for ab, weighted parsimony can be computed in linear time using the Fitch algorithm.

The Fitch algorithm consists of two phases. The first phase is recursive and reconstructs possible ancestral labels L(x) for each node x of T and the overall minimum number of label changes required as follows: For each node x of T in a bottom-up traversal, (1) if x is a leaf, then L(x)={b˜(x)} and cost(T[x])=0. (2) Else, let xl and xr be the children of x. If L(xl)L(xr)=, then L(x)=L(xl)L(xr) and cost(T[x])=cost(T[xl])+cost(T[xr])+1; else L(x)=L(xl)L(xr) and cost(T[x])=cost(T[xl])+cost(T[xr]). The second phase of the algorithm reconstructs an assignment b˜ of T that has a minimum cost, by computing b˜(x) as follows: For each node x of T in a top-down traversal, (1) if x is the root, assign b˜(x) to any label in L(x). (2) Else, let xp be the parent of x. If b˜(xp)L(x), then assign b˜(x)=b˜(xp), else assign b˜(x) to any label in L(x).

The Fitch algorithm does not always find an optimal b˜ assignment because of duplications that can be turned into EGTcopy events. Algorithm 2 modifies the first phase of the Fitch algorithm to compute the DLE-Distance and an assignment b˜ of T that leads to the DLE-Distance. The modification reflects the fact that a duplication node is allowed a ‘free’ change since it can be turned into an EGTcopy node (see Fig. 2 for an illustration).

Inline graphic

Lemma 5.

Algorithm 2 outputs, in linear time, the DLE-Distance DLE(T, S) and a binary assignmentb˜of T that leads to a most parsimonious DLE-Reconciliation.

Proof. It suffices to prove that the following statement holds for any node x of T: for any label β in L(x), there exists a binary assignment b˜ of T[x] such that b˜(x)=β and b˜ minimizes cost(T[x],S,b˜).

  1. If βL(xl)L(xr), then b˜(xl)=b˜(xr)=b˜(x)=β, and Δb˜(x)=0. Thus cost(T[x],S,b˜)=cost(T[xl],S,b˜l)+cost(T[xr],S,b˜r), without any increment.

  2. If βL(xl)L(xr), then βL(xl) or βL(xr), and b˜(xl)=b˜(x)=β or b˜(xr)=b˜(x)=β, and Δb˜(x)=0. Thus cost(T[x],S,b˜)=cost(T[xl],S,b˜l)+cost(T[xr],S,b˜r), without any increment.

    In both cases, Algorithm 1 computes a DLE-Reconciliation with minimum cost DLE(T[xl],S)+DLE(T[xr],S)+1 with a minimum increment of 1 for a Dup node in case (1), or by making x an EGTcopy node in case (2), but no additional EGTcut node is required.

  3. If x is a speciation node in the DL-reconciliation.

  1. If L(x)L(xl)L(xr), then L(xl)L(xr)=, and βL(xl) or βL(xr). So b˜(xl)=b˜(x)=β or b˜(xr)=b˜(x)=β, and Δb˜(x)=1. Thus cost(T[x],S,b˜)=cost(T[xl],S,b˜l)+cost(T[xr],S,b˜r)+1, with a minimum increment of 1, obtained by grafting an EGTcut node on one of the (x,xl) or (x,xr) branches. In this case, Algorithm 1 computes a DLE-Reconciliation with minimum cost DLE(T[xl],S)+DLE(T[xr],S)+1.

  2. If L(x)=L(xl)L(xr), then βL(xl) and βL(xr). So b˜(xl)=b˜(xr)=b˜(x)=β, and Δb˜(x)=0. Thus cost(T[x],S,b˜)=cost(T[xl],S,b˜l)+cost(T[xr],S,b˜r) without any additional cost. Algorithm 1 computes a DLE-Reconciliation with minimum cost DLE(T[xl],S)+DLE(T[xr],S) when given b˜.

It is easy to see that both the first and the second phases of the algorithm have linear time complexity, thus the overall algorithm has a linear time complexity.

As for the Fitch Algorithm, Algorithm 2 does not allow to output all the solutions of the DLE-Reconciliation problem leading to the DLE-Distance. However, this can be achieved by adapting the Sankoff and Cedergren’s dynamic programming algorithm. Rather, we choose to introduce, in the next section, a more general dynamic programming algorithm allowing to output all optimal solutions for an arbitrary cost of the DLE events, not only for the unitary cost.

4 Solving the DLE-reconciliation problem with arbitrary DLE costs

We now introduce a dynamic programming algorithm for general costs. We use δ and λ to denote the cost of a duplication and a loss, respectively. We use ρ0 (respectively τ0) for the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and ρ1 (respectively τ1) for the cost of an EGTcut (respectively EGTcopy) from the nuclear genome to the mitochondrial genome. Note that the subscripts of the EGT costs indicate the source of the switch. Also denote

ρ0*=min(ρ0,τ0+λ)ρ1*=min(ρ1,τ1+λ)

Roughly speaking, ρ0* represents the minimum cost required to switch from mitochondrial to nuclear genome inside a branch of T, and ρ1* the minimum cost required in the other direction. The purpose of ρ0* and ρ1* is that a switch can be accomplished by an EGTcut event, but also by an EGTcopy event followed by a loss.

Let xV(T). Note that s˜(x) does not need to be inferred, since by Lemma 1, we can assume that s˜(x)=lcaS(s((T[x]))). Our dynamic programming table only needs to store the optimal cost on T[x] for each possible b˜(x){0,1}. This requires testing each of three possible events e˜(x) at x, and the number of scenarios to consider at x is therefore constant [this is the main reason for the gain in time compared to the algorithm of Doyon et al. (2010), which requires adding a dimension to the table corresponding to all possible species at x]. Let bx{0,1}. We denote by D[x,bx] the minimum cost of a DLE-Reconciliation R,s˜,b˜,e˜ of T[x] with S in which b˜(x)=bx (or if no such reconciliation exists). Trivially, if x is a leaf of T, we have

D[x,bx]={0ifbx=b(x)otherwise

Assume now that x is an internal node of T. Let xl, xr be the children of x. For s1,s2V(S), let path(s1,s2) denote the number of vertices on the path between s1 and s2 in S, including s1 and s2. Then define

lx=path(s˜(x),s˜(xl))+path(s˜(x),s˜(xr))

which counts the number of mandatory losses on the child branches of a node x of T.

To compute D[x,bx], we use three auxiliary values D[x,bx,ex], where ex{Spe,Dup,EGTcopy} represents the event label of x (note that ex cannot be an EGTcut event, since x has two children).

If s˜(x)=s˜(xl) or s˜(x)=s˜(xr), then D[x,bx,Spe]=. Assuming this check has been performed, we have

D[x,bx,Spe]=λ(lx4)+x{xl,xr}min(D[x,bx],ρbx*+D[x,1bx])D[x,bx,Dup]=δ+λ(lx2)+x{xl,xr}min(D[x,bx],ρbx*+D[x,1bx])D[x,bx,EGTcopy]=τbx+λ(lx2)+min{D[xl,bx]+D[xr,1bx]D[xl,1bx]+D[xr,bx]ρ1bx*+D[xl,bx]+D[xr,bx]ρbx*+D[xl,1bx]+D[xr,1bx]

Put D[x,bx]=min(D[x,bx,Spe],D[x,bx,Dup],D[x,bx,EGTcopy]). The value of interest is min(D[r(T),0],D[r(T),1]).

Theorem 1.

For anyxV(T)andbx{0,1}, the value ofD[x,bx], as defined above, is equal to the minimum cost of a DLE-ReconciliationR,s˜,b˜,e˜ofT[x]with S satisfyingb˜(x)=bx.

Moreover, the minimum costmin(D[r(T),0],D[r(T),1])of a reconciliation of T with S can be computed in timeO(|V(T)|+|V(S)|).

Let us note that once the D table is computed, a standard backtracking procedure allow to reconstruct every optimal DLE-Reconciliation.

5 Experimental results

We implemented the above dynamic programming procedure in python in a software called EndoRex, which supports arbitrary costs as input and returns a reconciled gene tree in Newick format. The python source can be accessed at https://github.com/AEVO-lab/EndoRex. We then performed a variety of experiments on a dataset obtained from (Kannan et al., 2014), as described bellow.

5.1 Kannan et al. (2014) dataset

For the reconstruction of evolutionary histories with EGT events, we used a dataset from Kannan et al. (2014) available at ftp://ftp.ncbi.nih.gov/pub/koonin/MitoCOGs. The dataset consists of 140 MitoCOGs extended with paralogs and nuclear protein-coding homologs from 2486 eukaryotes with complete mitochondrial genomes. MitoCOGs are clusters of orthologous genes for mitochondrial-encoded proteins generated using COG construction (Makarova et al., 2007; Yutin et al., 2009). Full description of the MitoCOG generation procedure is described in Kannan et al. (2014). Among the 140 MitoCOGs, 73 correspond to protein-coding gene families, 49 are hypothetical proteins and 18 are clusters for which the protein function is identified but not the gene name. Among these 73 MitoCOGs, 13 are core-mitochondrial proteins that are shared by most of the 2486 mitochondrial genomes. Statistics on MitoCOGs of the Kannan et al. dataset are given in Table 1.

Table 1.

Statistics on the Kannan et al. (2014) dataset

Gene set Nb of MitoCOGs Nb of species Nb of genes
Mitochondrial-encoded 140 2486 34 755
Nuclear-encoded 45 52 1317
Whole set 140 2486 36 072

Note: Notice that MitoCOGs have been designed for mitochondrial-encoded genes, and nuclear-encoded genes have been included later. This explains why all nuclear-encoded MitoCOGs, and the corresponding species, are included in the mitochondrial-encoded sets of MitoCOGs and species.

5.2 Dataset preprocessing

Among the 140 MitoCOGs of the initial Kannan et al. dataset, we first selected the 45 clusters involving nuclear-encoded protein sequences. Within these MitoCOGs, 52 eukaryotes are represented including 28 Opisthokonta (10 Fungi, 17 Metazoa and 1 Choanoflagellata), 9 Viridiplantae, 1 Rhodophyta, 1 Glaucophyta, 5 Alveolata, 1 Amoebozoa, 2 Euglenozoa, 1 Heterolobosea, 1 Rhizaria and 3 Stramenopiles. Based on Figure 1 in Kannan et al. (2014) and the analysis of the dataset, for the EGT evolutionary history inference with EndoRex, we selected the 11 plant species, including the 9 Viridiplantae, Cyanidioschyzon merolae (Rhodophyta) and Cyanophora paradoxa (Glaucophyta), as gene-content location is more diversified among this species group.

The 11 plant species are represented in 68 MitoCOGs with mitochondrial-encoded proteins and 41 MitoCOGs with nuclear-encoded proteins. We selected the clusters for which there were mitochondrial and nuclear encoded genes, yielding 28 MitoCOGS containing 326 protein-coding genes, including 184 encoded in the mitochondria and 142 in the nucleus. All the 28 MitoCOGs correspond to gene names that are present in the mitochondrial gene content review of Sloan et al. (2018).

Table 2 gives information about the 28 MitoCOGs of the 11 plants dataset specifying the gene name, the protein metabolic pathway and the number of genes and species for each MitoCOG.

Table 2.

Statistics on the 28 MitoCOGs of the 11 plants dataset

MitoCOG Gene Metabolic Nb of genes Nb of
ID name pathway (mito+nuc) species
MitoCOG0006 nad3 Complex I 11 (10 + 1) 11
MitoCOG0007 nad4L Complex I 13 (12 + 1) 11
MitoCOG0031 nad7 Complex I 11 (9 + 2) 11
MitoCOG0043 nad9 Complex I 11 (9 + 2) 11
MitoCOG0029 nad10 Complex I 13 (1 + 12) 10
MitoCOG0052 sdh2 Complex II 22 (1 + 21) 10
MitoCOG0051 sdh3 Complex II 8 (3 + 5) 6
MitoCOG0075 sdh4 Complex II 9 (4 + 5) 9
MitoCOG0003 cox2 Complex IV 13 (10 + 3) 11
MitoCOG0005 cox3 Complex IV 13 (10 + 3) 11
MitoCOG0059 atp1 Complex V 9 (7 + 2) 8
MitoCOG0076 atp4 Complex V 12 (11 + 1) 10
MitoCOG0004 atp6 Complex V 13 (12 + 1) 11
MitoCOG0014 atp9 Complex V 13 (10 + 3) 11
MitoCOG0027 rpl2 Translation 14 (5 + 9) 10
MitoCOG0053 rpl6 Translation 10 (4 + 6) 8
MitoCOG0092 rpl10 Translation 5 (2 + 3) 5
MitoCOG0048 rpl14 Translation 15 (5 + 10) 11
MitoCOG0039 rpl16 Translation 12 (8 + 4) 11
MitoCOG0070 rpl20 Translation 11 (2 + 9) 8
MitoCOG0080 rps2 Translation 9 (5 + 4) 9
MitoCOG0067 rps4 Translation 8 (7 + 1) 7
MitoCOG0061 rps7 Translation 12 (8 + 4) 11
MitoCOG0072 rps10 Translation 12 (3 + 9) 8
MitoCOG0054 rps11 Translation 12 (6 + 6) 10
MitoCOG0064 rps13 Translation 10 (7 + 3) 10
MitoCOG0055 rps14 Translation 9 (5 + 4) 8
MitoCOG0026 rps19 Translation 16 (8 + 8) 8

Note: For the ‘Nb of gene’ column, the number of mitochondria-encoded (mito) and nucleus-encoded (nuc) gene are specified.

For each MitoCOG, we applied a pipeline to infer the evolutionary history of EGTs with DLE-Reconciliation along the 11 plants species tree. The topology of the species tree was taken from Kannan et al. (2014). We added the species Micromonas sp. RCC299 as the sister species of Ostreococcus tauri as only these 2 among the 11 plants species belong to the Mamiellophyceae class. We also swapped the position between P. patens and S. moellendorffi according to (Puttick et al., 2018) (Fig. 3).

Fig. 3.

Fig. 3.

Species tree of the 11 plants considered in our experimental analysis. Topology of the tree is based on (Kannan et al., 2014)

As for constructing gene trees, the first step of the pipeline was to align the protein sequences with MUSCLE (Edgar, 2004). In the second step, a maximum likelihood protein tree was infered using RAxML (v8.2.4) with the PROTGAMMAGTRX evolutionary model (Stamatakis et al., 2014). NOTUNG (v.2.9.1.5) was then used to root the trees by minimizing the cost of a duplication-loss reconciliation with default parameter (loss cost: 1.0 and duplication cost: 1.5) (Stolzer et al., 2012).

The rooted protein trees obtained with this pipeline and the 11 plants species tree were given as input of the EndoRex software to infer a most parsimonious DLE-Reconciliation allowing for arbitrary costs for duplications, losses and EGTs.

5.3 EndoRex evolutionary events cost setting

As a reminder, we consider six parameters corresponding to the different evolutionary event costs: δ and λ the cost of, respectively, a gene duplication and loss; ρ0 (respectively τ0) the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and ρ1 (respectively τ1) the cost of an EGTcut (respectively EGTcopy) from the nuclear genome to the mitochondrial genome.

We test five different cost settings for the application of EndoRex on the 11 plants dataset. The setting S1 corresponds to the default values for parameters, with a unitary cost for evolutionary events (allowing to compute the DLE-Distance). For setting S2, the gene loss and duplication costs are those used in NOTUNG for rooting the protein trees, and EGTcopy and EGTcut costs are set higher to reflect the fact that these evolutionary events are less frequent than gene duplications: λ=1.0,δ=1.5 and ρ0=ρ1=τ0=τ1=2.0. In setting S3, we consider EGTcopy as less likely than EGTcut: λ=1.0,δ=1.5,ρ0=ρ1=2.0 and τ0=τ1=3.0. For setting S4, we differentiate the cost of the mitochondria to the nucleus from the nucleus to the mitochondria gene move, and account for the fact that, during the evolution of eukaryotes, mitochondrial genes are integrated into the nuclear genome, while the reverse is extremely rare: λ=1.0,δ=1.5,ρ0=2.0,ρ1=3.0,τ0=3.0 and τ1=4.0. Finally, setting S5 is the same as setting S4 except we make no difference between the costs of EGTcopy and EGTcut events: λ=1.0,δ=1.5,ρ0=2.0,ρ1=3.0,τ0=2.0 and τ1=3.0.

Applied to the 28 MitoCOGs trees, EndoRex infers the same DLE-Reconciliation with the five different settings for 21 of the 28 MitoCOGs.

All the seven MitoCOGs with more that one inferred DLE-Reconciliation, depending on the considered setting, lead to two different DLE-Reconciliations: for MitoCOG0014, MitoCOG0051 and MitoCOG0053, setting S1 gives a DEL-reconciliation different from the other settings; for MitoCOG0027, it is setting S3 that gives a different DEL-reconciliation; for MitoCOG0005 and MitoCOG0039, it is setting S4; and finally for MitoCOG0072, the settings S4 ans S5 give a DEL-reconciliation different from S1, S2 and S3. We analyzed the two DLE-Reconciliations of MitoCOG0014 (atp9), MitoCOG0027 (rpl2), MitoCOG0039 (rpl16) and MitoCOG0072 (rps10) to illustrate the dynamic of the score settings (see Fig. 4).

Fig. 4.

Fig. 4.

DLE-Reconciliations obtained for MitoCOG0014, MitoCOG0027, MitoCOG0039 and MitoCOG0072 with the EndoRex scores settings S1, S2, S3, S4 and S5. The blue part of the tree indicates that the genetic material is located in the mitochondrion, while the red part indicates location in the nucleus. The shape of an internal node represents its associated event, as represented in Figure 1 (circle for a speciation, rectangle for a duplication and triangle for an EGT event). Loss events are not represented. Genes are formatted as follow: [species name]__[gene-encoding location]__[gene id]. Moreover, 0 indicates a location in the mitochondrion, while 1 indicates a location in the nucleus

According to these case studies, it seems that setting S1 is inappropriate as it leads to the prediction of higher number of EGTs which are rare evolutionary events (see MitoCOG0014 in Fig. 4, and MitoCOGs 51 and 53 in Appendix Fig. A1). For MitoCOG0027, setting S3 leads to the prediction of numerous EGTs from the nucleus to the mitochondria, which is very unrealistic as a very few number of gene movements from the nucleus to the mitochondria have been described in the literature. DLE-Reconciliations predicted with setting S4 are the scenarii most in line with the literature as it only infers EGTs from the mitochondria to the nucleus (except for MitoCOG0072), with transpositions located close to the leaves of the tree, indicating an ongoing process of endosymbiotic gene transfer in plants for this gene family (see MitoCOGs 39 and 72 in Fig. 4, and MitoCOG0005 in Appendix Fig. A1).

6 Conclusion

Investigating the origin, evolution and characteristics of gene coding capacity of eukaryotes has been among the central themes in the Life Sciences. In this context, the endosymbiotic origin of mitochondrial genomes and the gradual integration of the mitochondrial gene content to the nucleus are important evolutionary parameters expected to shed light on features of eukaryotic gene evolution and function.

From a computational point of view, detecting the footprint of endosymbiosis in the gene repertoires of the mitochondrial and nuclear genomes of eukaryotes requires new evolutionary prediction methods. This article is a first effort toward developing the appropriate algorithmic tools for analyzing the movement of genes inside a gene family between the mitochondrial and nuclear genome of the same species. We presented a linear-time algorithm computing a most parsimonious history of Duplication, Loss and EGT (DLE) events explaining a gene tree with leaves identified as mitochondrial or nuclear genes. We also presented a general dynamic programming algorithm, implemented in the EndoRex software, to compute all optimal DLE-Reconciliations for any arbitrary cost scheme of operations.

By applying EndoRex to a plant dataset, we showed that it is well-designed to infer the evolutionary histories of EGT events, considering a variety of cost settings. Some reconciled trees (not shown) of the 11 plants dataset produced evolutionary histories that could be considered unrealistic as leading to an unexpected high number of gene duplications and losses. As our algorithm is exact and thus guaranteed to infer the minimum number of events given a gene tree, this is likely due to errors in protein sequence alignment and/or gene tree inference, leading to erroneous gene trees (Hahn, 2007). A better gene tree inference pipeline should be designed in the future to get more accurate gene trees. In particular, gene trees have been rooted according to the DL-distance and standing on the default NOTUNG parameters. Instead, we could have rooted the trees according to our DLE-model, with the 5 considered cost settings. In addition, the obtained RAxML binary gene trees contain many weakly supported edges. Those edges may be contracted, and a polytomy resolution tool such as PolytomySolver (Lafond et al., 2016) may be used to better resolve multifurcations. On the other hand, simulations studies should also be conducted, in the future, to better evaluate the quality of the obtained solutions.

In fact, our method relies on a deterministic parsimony approach to compute all optimal DLE-reconciliations given a cost scheme for DLE events. This model has many limitations. In particular, parsimony does not allow to model multiple state changes along a branch of the phylogeny, or uncertainty in phylogenetic reconstructions. An alternative is to rely on approaches using stochastic state mapping models such as the mutational mapping approach (Bollback, 2006; Huelsenbeck et al., 2003). Since our method outputs all optimal DLE-reconciliations, it can also be used to compute the probabilities of all possible events over all optimal solutions.

Future algorithmic extensions of the optimization problem considered in this article may concern extending the model to account for both EGT and HGT events, toward inferring a Duplication, HGT, loss and EGT (DTLE) evolutionary scenario for a gene family. Another direction would be to infer common episodes of EGT events for a set of gene families. This may be handled by generalizing the Super-Reconciliation (Delabre et al., 2020) model to account for segmental DLE events.

Future developments will define an EGT simulation model to provide EGT evolutionary histories to assess the accuracy of our algorithm. Some efforts have been made to provide EGT simulation model. Brandvain and Wade (2009) provides a model to explore the influence of population-genetic parameters (such as selection, dominance, mutation rates and population size with a rate of self-fertilization) on the rate and probability of functional gene transfer from mitochondrial genome (haploid) to nuclear genome (diploid). (Kelly, 2020) defines an EGT simulation model based on the ATP biosynthesis cost for the encoding of a mitochondrial/chloroplast gene in the nuclear genome and the import of the resulting in the organelle. These prior works provide useful insights to design a model for the simulation of EGT evolutionary histories that would be strongly inspired from existing model for the simulation of HGT evolutionary histories.

Future applications will also concern a thorough analysis of protein-coding genes involved in common metabolic pathways. As an example, the oxydative phophorylation (OXPHOS) is a series of protein complexes (I, II, III, IV and V) leading to an electrochemical proton gradient activating the ATP synthase (complex V) that produces ATP. These protein-coding genes involved in OXPHOS are expected to share common mitochondrial-nuclear movements, as nucleus and mitochondria are two compartments with different biological dynamics.

Finally, the recent sequencing effort conducted toward jakobids and malawimonads protists genomes known to have emerged close to the eukaryotic origin will provide a valuable dataset that can be analyzed with the new developed algorithms, helping to shed light on a number of important biological questions, among them resolving the root of the eukaryote tree. In fact, as EGTs are rare events, candidate topologies for which DLE-Reconciliations infer the lowest number of EGT events, may provide evidence for a correct rooting.

Financial Support: Natural Sciences and Engineering Research Council of Canada;Fonds de recherche Nature et Technologie, Québec.

Conflict of Interest: none declared.

Acknowledgements

The authors thank B. Franz Lang (Biochemistry Department, University of Montreal) for his insights and clever advices on the algorithmic needs and open questions regarding eukaryotes’ evolution.

Appendix A

Proof of Lemma 1

Let R,s˜,b˜,e˜ be a DLE-Reconciliation of minimum cost between T and S. Let λ be the cost of a loss event. Let us first make an observation. Let vV(R) and let l(R[v])(T), assuming that l exists. Let P=(v=p1,p2,,pk=l) be the path from v to l in R. It is easy to see from the definition of reconciliation that s˜(v)=s˜(p1),s˜(p2),,s˜(pk)=s˜(l) is a path of S, but with some vertices possibly being repeated (i.e. s˜(pi)=s˜(pi+1) is possible, but otherwise s˜(pi+1) is a child of s˜(pi)). It follows that s˜(v) must be an ancestor of s(l). Since v and l were chosen arbitrarily, we have that for any vV(R),s˜(v) is an ancestor of s(l) for every leaf l(R[v])(T).

Now suppose that, for some xV(R)V(T),s˜(x)lcaS(s((T[x]))). Moreover, choose x as a lowest node of V(R)V(T) with this property (i.e. s˜(x)=lcaS(s((T[x]))) for all descendants xV(R)V(T) of x in R). Note that x is an internal node of T since s˜(x)=s(x) for every leaf x of T.

As we argued, s˜(x) is an ancestor of s(l) for every leaf l(T[x]). Since s˜(x)lcaS(s((T[x]))), it follows that s˜(x) is a strict ancestor of lcaS(s((T[x]))). We first argue that x cannot be a speciation. Assume this is the case and let xl,xr be the children of x in R (but not necessarily in T). We use xl and xr to denote the children of x in T. By the definition of speciation, s˜(xl) and s˜(xr) are the two children of s˜(x). Because s˜(x) is a strict ancestor of lcaS(s((T[x]))), only one of s˜(xl) or s˜(xr) has descendants in {s(l):l(T[x])}. Assume without loss of generality that only s˜(xl) has such descendants. But then, s˜(xr) is not an ancestor of any member of s((T[x])). In particular, s˜(xr) is not an ancestor of any member of s((R[xr])(T)), and the latter is easily seen to be non-empty (this is because xr is an ancestor of xr and T[xr] has leaves from T). As we argued before, this is not possible, since there should be a path from s˜(xr) to any s(l) with l(T[xr])(T).

Assume that x is a duplication or EGTcopy event (x cannot be an EGTcut event because it is binary). As before, let xl and xr be the children of x in T (but not necessarily in R). By the choice of x, s˜(xl)=lcaS(s((T[xl]))) and s˜(xr)=lcaS(s((T[xr]))). Thus s˜(x) must be a strict ancestor of both s˜(xl) and s˜(xr). Let s be the child of s˜(x) that is on the path from s˜(x) to lcaS(s((T[x]))). We obtain an alternate reconciliation by modifying R to obtain another extension R of T. We do not change any event labeling. We map x to s and graft a loss in s˜(x) on the edge between x and its parent in R (if any). In that manner, the parent of x in R still has a child mapped to s˜(x) in R. This increases the cost by λ, the cost of one loss.

Now let x1,x2,,xk be the nodes on the path from x to xl in R (excluding x and xl). Note that since x is a duplication or EGTcopy, s˜(x)=s˜(x1). Moreover, at most one node among x1,,xk can be an EGTcopy or an EGTcut, since there is no point in making more than one switch within an edge.

If present, we may assume without loss of generality that such an event occurs at xk, the parent of xl in R, since the timing of the switch does not affect the reconciliation cost. In this case, s˜(xk)=s˜(xl)=lcaS(s((T[xl]))). On the other hand, s˜(x1)=s˜(x)lcaS(s((T[x]))). This implies that x1xk, and thus x1 is not an EGTcopy or an EGTcut. It follows that x1 is a node inserted because of a grafted loss, and s˜(x2)=s. In R, we can remove x1 and its loss leaf, and by doing so, the left child of x becomes x2. This preserves all properties of a valid reconciliation because both x and x2 are mapped to s. We can apply the same procedure on the path from x to xr.

In R, we have created one loss above x, but have removed two losses on both sides of x. No other event labeling has changed. Since we assume that losses have a non-zero cost, R has a strictly lower cost than R, a contradiction.

Proof of Lemma 2

We first show that the reconciliation R,s˜,b˜,e˜ obtained from Algorithm 1 is a valid DLE-Reconciliation. Note that the tree R returned by the algorithm is the same as RDL, but with some grafted unary nodes for EGTcut events where needed. Consider some xV(RDL). In R, we put e˜(x)=Spe if e˜DL(x)=Spe, and e˜(x){Dup,EGTcopy} if e˜DL(x)=Dup. If no additional node was grafted as a new child of x, all properties of reconciliation would be preserved since we keep s˜ as in s˜DL. If some node x was grafted as a new child of x, we ensure that s˜(x) is the same as the previous child of x, which ensures that we satisfy the properties of reconciliation. Therefore, we only need to check whether the tree RDL is modified in an appropriate way in the case of a different b˜ value for a node x of T and one of its two children xl or xr.

Lines 2–8 first ensure that the starting tree R is such that, for each node x of T, b˜(x)=b˜T(x), and for any edge (x, y) in T such that b˜T(x)b˜T(y), the corresponding path (x,v1,v2,vn,y) on R is such that for all i, b˜(vi)=b˜(y). Subsequently, in the case of a different b˜ value for a node x of T and its child y, the node x is either modified to an EGTcopy node, ensuring that the switch between b˜(x) and b˜(v1) is correctly explained by this EGTcopy, or a new EGTcut node v is grafted on the edge (x,v1), also correctly explaining the switch between b˜(x) and b˜(v1).

We now show that the DLE-Reconciliation output by Algorithm 1 is of minimum cost. First Note that, from the initialization done in Line 8, for each leaf x which is on RDL but not in T (lost gene), the algorithm ensures that b˜(x)=b˜(px) were px is x’s parent. Thus, grafted loss leaves never require an extra EGTcopy event on an ‘inserted edge’ of RDL.

Assume another reconciliation R,s˜,b˜,e˜ has a strictly lower cost than R,s˜,b˜,e˜ output by Algorithm 1. We first show that, for any node of T, the corresponding node in R and R have the same event label. Assume this is not the case. Let x be the lowest node of T such that e˜(x)e˜(x). Let xl and xr be its two children in T and vl and vr be the two non-unary descendant of x in R the closest from x. Note that xl and xr do not necessarily correspond to vl and vr in R. Rather, they may be strict descendants of these nodes in R.

1. If e˜DL(x)=Dup, then from Algorithm 1, e˜(x)=Dup if b˜(xl)=b˜(x) and b˜(xr)=b˜(x), and e˜(x)=EGTcopy otherwise. As e˜(x)e˜(x), we should have e˜(x){Spe,EGTcopy} in the first case, or e˜(x){Spe,Dup} in the second case.

Assume e˜(x)=Spe. From Lemma 1, as R,s˜,b˜,e˜ is a reconciliation of minimum cost, s˜(x)=lcaS(s((T[x]))), and as x is a speciation node in R, one of vl and vr should be mapped to s˜(x)l and the other to s˜(x)r. Assume w.l.o.g. that s˜(vl)=s˜(x)l and s˜(vr)=s˜(x)r. Now, as x is a duplication node in RDL, then s˜(xl)=s˜(x) or s˜(xr)=s˜(x). Assume w.l.o.g. that s˜(xl)=s˜(x). As xl is a node of the subtree of R rooted at vl, by definition of a reconciliation, s˜(xl) should be a descendant of s˜(vl), which is not the case as s˜(vl)=s˜(x)l is rather a strict descendant of s˜(x)=s˜(xl)=s˜(xl). Therefore, x cannot be a speciation node in R,s˜,b˜,e˜. We deduce that e˜(x){Dup,EGTcopy}.

Now assume that b˜(xl)b˜(x) or b˜(xr)b˜(x). In this case, the algorithm puts e˜(x)=EGTcopy and, as x is not a speciation, it should be a duplication node in R,s˜,b˜,e˜. But then an a unary EGTcut node v should be present in one of the two paths from x to xl or from x to xr in R, contradicting the fact that R,s˜,b˜,e˜ is a reconciliation of minimum cost, since labeling x as an EGTcopy node and removing v would reduce the cost of the reconciliation by one.

Finally, assume that b˜(xl)=b˜(x) and b˜(xr)=b˜(x). In this case, the algorithm puts e˜(x)=Dup and, as x is not a speciation, it should be an EGTcopy node in R,s˜,b˜,e˜, which induces, by definition of an EGTcopy event, that one of the two children y of x in R is such that b˜(y)b˜(x). Now, as b˜(x)=b˜(xl)=b˜(xr), one unary EGTcut node v should change the b˜ labeling of y to the b˜ labeling of its descendant in {xl,xr}. But then relabeling x as a duplication node would allow removing v and thus reducing the cost of the reconciliation by one, contradicting the fact that R,s˜,b˜,e˜ is a reconciliation of minimum cost.

2. If e˜DL(x)=Spe, then from the properties of a DL-Reconciliation, we should have s˜(xl)s˜(x) and s˜(xr)s˜(x). From Algorithm 1, x remains a speciation node in R,s˜,b˜,e˜.

As e˜(x)e˜(x), we should have e˜(x)=Dup or e˜(x)=EGTcopy. In both cases, s˜(vl)=s˜(vr)=s(x). This implies that xlvl and xrvr, and thus vl and vr are grafted because of losses. Since R uses the LCA-mapping by Lemma 1, we can remove vl, vr and their corresponding grafted loss leaves and make x a speciation, while preserving a valid reconciliation. This saves a cost of three (two losses and a Dup or EGTcopy event). In the worst case, we had e˜(x)=EGTcopy, in which case we can add an EGTcut event on the appropriate branch to enforce the same switch.

Thus replacing the Dup or EGTcopy label of x by a speciation reduces the cost of R by at least two, contradicting the fact that R is a reconciliation of minimum cost.

Since we have the same number of Dup and ETTr events as R, it remains to show that we cannot graft less nodes than those induced by Algorithm 1. The grafted nodes are either binary nodes corresponding to losses, or EGTcut unary nodes. Suppose R has less grafted nodes than R. Then there is an edge (x, y) in T such that the corresponding path Px,y=(x,v1,v2,vn,y) in R is shorter than the corresponding path Px,y=(x,v1,v2,vn,y) in R. We consider a lowest edge (x, y) of T verifying this condition, and we assume, without loss of generality, that y = xl. Recall that by Lemma 1, s˜(x)=s˜(x) and s˜(y)=s˜(y).

  • If e˜DL(x)=Dup, then x is a duplication or an EGTcopy node in both R and R. Then, by definition of a reconciliation, s˜(v1)=s˜(x). Moreover, from the fact that R is obtained from RDL, Algorithm 1 leads to a path Px,y with as many nodes as the path from s˜(x) to s˜(xl) in S if x is a duplication node, and an additional EGTcut node if bT(x)bT(xl)=bT(xr). Moreover, it is easy to see that the number of losses crafted on (x, y) must be equal to the number of nodes on the path from s˜(x) and s˜(y), excluding s˜(y), either in R or R, and that the EGTcut event added by the algorithm cannot be avoided. And thus, the path Px,y should be at least as long as Px,y, contradicting the hypothesis that Px,y is shorter than Px,y.

  • If e˜DL(x)=Spe, then x is a speciation node in both R and R. Then, by definition of a reconciliation, s˜(v1)=s˜(v1)=s˜(x)l. Thus, from the fact that R is obtained from RDL, Algorithm 1 leads to a path Pv1,y with as many nodes as the path from s˜(x)l to s˜(xl) in S, with an additional EGTcut node if b˜(x)b˜(xl). Moreover, it is easy to see that no other operation (Spe, Dup, RGT or EGTcut) can allow making less losses or avoid the EGTcut event. And thus, the path Pv1,y should be at least as long as Pv1,y, contradicting the hypothesis that Px,y is shorter than Px,y.

Proof of Theorem 1

Let us first argue on the complexity of computing D[x,bx] for every xV(T) and every bx{0,1} (including D[r(T),0] and D[r(T),1]), our values of interest). The LCA-mapping s˜ can be computed in time O(|V(T)|+|V(S)|) using classical approaches from DL-reconciliation. We can compute D[x,0] and D[x,1] for every xV(T) in a post-order traversal of T (because their value only depends on xl and xr), and thus there are O(|V(T)|) values to compute. If we assume that if we have access to lx for each x, it is clear from the recurrences that D[x,bx,Spe],D[x,bx,Dup] and D[x,bx,EGTcopy] can be computed in O(1) time. To access lx in time O(1) for any x, we can preprocess S by labeling each vV(S) by its depth (i.e. its distance to the root). Then, path(s˜(x),s˜(xl) is simply the difference in depth between s˜(x) and s˜(xl) (because s˜(xl) must be a descendant of s˜(x)). This difference can be obtained in constant time, and it follows that lx can be obtained in O(1). Therefore, each D[x,bx] entry takes O(1) time to compute. Including the time to compute the preprocessing and the LCA-mapping, the total time of the algorithm is O(|V(T)|+|V(S)|).

Let us now argue that the algorithm is correct. Let xV(T), let bx{0,1}, and let R=R,s˜,b˜,e˜ be a DLE-Reconciliation of minimum cost between T[x] and S that satisfies b˜(x)=bx. The proof is by induction on the height of T[x]. If x is a leaf, it is easy to see that D[x,bx] is correct. Assume that x is an internal node with children xl and xr. We may inductively assume that D[xl,bl] and D[xr,br] are computed correctly for bl,br{0,1}.

In what follows, let Rl=Rl,s˜l,b˜l,e˜l be the reconciliation between T[xl] and S obtained by taking R[xl], and restricting s˜,b˜ and e˜ to V(R[xl]). Similarly, let Rr be the reconciliation of T[xr] with S obtained by taking R[xr] and restricting s˜,b˜ and e˜ to R[xr].

We show two useful claims, the first being that these sub-reconciliations must be optimal with respect to their subtrees.

Claim 1.1. c(Rl)=D[xl,b˜(xl)] and c(Rr)=D[xr,b˜(xr)].

Proof. By induction and by the definition of D, we have D[xl,b˜(xl)]c(Rl). Moreover, in R we may replace the R[xl] subtree by Rl (more precisely, replace R[xl] by Rl, and use s˜l,b˜l and e˜l for the vertices of Rl). Since s˜l(xl)=s˜(xl) and b˜l(xl)=b˜(xl), all conditions of a valid reconciliation are met after such a replacement. Furthermore, no additional loss, EGTcopy or EGTcut is required on the path between x to xl. If D[xl,b˜(xl)]<c(R) held, this transformation would yield a lower cost reconciliation and contradict the optimality of R. Therefore, D[xl,b˜(xl)]c(R). It follows that D[xl,b˜(xl)]=c(Rl]). By a symmetric argument, D[xr,b˜(xr)]=c(R).

Claim 1.2. If e˜(x)=Spe, then there are at least lx4 losses grafted on the (x,xl) and (x,xr) branches, and otherwise, there are at least lx2 such grafted losses.

Proof. If e˜(x)=Spe, in R there must be a loss grafted on the (x,xl) (respectively (x,xr)) branch for each node of path(s˜(x),s˜(xl)) (respectively path(s˜(x),s˜(xr))), excluding s˜(x) and s˜(xl) (respectively s˜(xr)). The number of such losses is lx4 and induce a cost of λ(lx4). If e˜(x){Dup,EGTcopy}, the required losses are the same, except that we do not exclude x from both paths, and thus lx2 losses are required for a cost of λ(lx2).

  • We now argue that D[x,bx]c(R). First assume that e˜(x){Spe,Dup}. We then consider the four possible b˜ labelings of xl and xr.

• If b˜(x)=b˜(xl)=b˜(xr), then no cost other than the losses is required on the (x,xl) and (x,xr) branches. Thus using claims 1.1 and 1.2,

  • c(R){λ(lx4)+c(Rl)+c(Rr) ife˜(x)=Speδ+λ(lx2)+c(Rl)+c(Rr) ife˜(x)=Dup

  • ={λ(lx4)+D[xl,bx]+D[xr,bx] ife˜(x)=Speδ+λ(lx2)+D[xl,bx]+D[xr,bx] ife˜(x)=Dup

  • Since for both e˜(x){Spe,Dup},D[x,bx,e˜(x)] adds the losses, plus the minimum of D[x,bx] and ρbx*+D[x,1bx] for each child x{xl,xr}, we see that D[x,bx]D[x,bx,e˜(x)]c(R).

  • • If b˜(x)=b˜(xl) and b˜(x)=1b˜(xr), then no additional cost is required on the (x,xl) branch, but a switch is required on (x,xr). The minimum possible cost of such a switch is ρbx*, and thus using the two claims as the previous case (we omit the step replacing c(Rl) by D[xl,bx] and c(Rr) by D[xr,1bx], which is implicit by claim 1.1), if e˜(x)=Spe, we have
    c(R)λ(lx4)+D[xl,bx]+ρbx*+D[xr,1bx]

and if e˜(x)=Dup, we have

c(R)δ+λ(lx2)+D[xl,bx]+ρbx*+D[xr,1bx]

Again, the above expressions are considered by the minimization of D[x,bx,e˜(x)], and so D[x,bx]D[x,bx,e˜(x)]c(R).

• If b˜(x)=1b˜(xl) and b˜(x)=b˜(xr), this case is symmetric to the previous one.

• If b˜(x)=1b˜(xl) and If b˜(x)=1b˜(xr), then a switch with host bx is needed on both branches (x,xl) and (x,xr). Thus, if e˜(x)=Spe, we have

c(R)λ(lx4)+ρbx*+D[xl,1bx]+ρbx*+D[xr,1bx]

and if e˜(x)=Dup, we have

c(R)δ+λ(lx2)+ρbx*+D[xl,1bx]+ρbx*+D[xr,1bx]

Again, these are considered in D[x,bx,e˜(x)], and we get D[x,bx]D[x,bx,e˜(x)]c(R).

In all cases, D[x,bx]c(R). It remains to show that this holds for e˜(x)=EGTcopy. In this case, a cost of τbx must be counted for the x node, plus the cost for lx2 losses by claim 1.2. Next, we consider all values of b˜(xl) and b˜(xr).

• if b˜(xl)b˜(xr), then as we argued

c(R)τbx+λ(lx2)+c(Rl)+c(Rr)=τbx+λ(lx2)+D[xl,b˜(xl)]+D[xr,b˜(xr)]

The latter expression is among the expressions that D[x,bx,EGTcopy] minimizes and thus D[x,bx]D[x,bx,EGTcopy]c(R).

• if bx=b˜(xr)=b˜(xr), then since x is an EGTcopy event, one of the (x,xl) or (x,xr) branches must switch to 1bx, then switch back to bx, implying a an EGTcut from 1bx to bx of cost ρ1bx*. In this situation,

c(R)τbx+λ(lx2)+ρ1bx*+D[xl,bx]+D[xr,bx]

which is considered among the expressions minimized by D[x,bx,EGTcopy]. Again, D[x,bx]D[x,bx,EGTcopy]c(R).

• if bx=1b˜(xl)=1b˜(xr), then one of the (x,xl) or (x,xr) branches stays in bx, and thus must switch to 1bx for a cost of ρbx*. In this situation,

c(R)τbx+λ(lx2)+ρbx*+D[xl,bx]+D[xr,bx]

which is considered among the expressions minimized by D[x,bx,EGTcopy]. Again, D[x,bx]D[x,bx,EGTcopy]c(R).

In every possible case, D[x,bx]c(R).

We must now prove the complementary bound, i.e. that D[x,bx]c(R). Let e{Spe,Dup,EGTcopy} such that D[x,bx]=D[x,bx,e]. If e = Spe, the expression D[x,bx,Spe] corresponds to making x a speciation (which is possible since we check that neither of s˜(x)=s˜(xl) nor s˜(x)=s˜(xr) holds) and adding the minimum number of mandatory losses on (x,xl) and (x,xr). Let bl{0,1} that minimizes min(D[xl,bx],ρbx*+D[xl,1bx]), and define br for xr analogously. Thus consider the reconciliation R in which x is a speciation, on which we graft the lx4 mandatory losses on (x,xl) and (x,xr) and then, for each of bl or br that differs from bx, adds an EGTcut on the corresponding branch. Then, for T[xl] subtree, take an optimal reconciliation Rl for T[xl] and for the T[xr] subtree, take the optimal reconciliation Rr for T[xr]. By induction, Rl and Rr are of costs D[xl,bl] and D[xr,br] respectively. Since all optimal reconciliations use the LCA-mapping, such a reconciliation is valid and its cost is as defined in D[x,bx,Spe]. It follows that D[x,bx,Spe]=c(R)c(R) (the latter inequality owing to the optimality of R).

If e = Dup, the argument is exactly the same, except that to construct R, we make x a duplication and add lx2 losses instead.

Finally, assume that e = EGTcopy. It is not hard to see that each expression that D[x,bx,EGTcopy] may choose when minimizing corresponds to a valid reconciliation. Indeed, consider the reconciliation R where e˜(x)=EGTcopy for a cost of τbx. We add lx2 mandatory losses on the (x,xl) and (x,xr) branches. Then, the first two cases of the minimization in D[x,bx,EGTcopy] correspond to having no additional switch needed, and hence we can use the optimal reconciliation for T[xl] and T[xr]. The third case corresponds to having both xl and xr mapped to bx, in which case we can choose to apply the EGTcopy on (x,xl), but need to switch back for a cost of ρ1bx*. The last case corresponds to having both xl and xr mapped to 1bx, in which case the EGTcopy applies one switch, and we add an EGTcut for the other switch of cost ρbx*.

Since each possible case represents the cost of a valid reconciliation R, we get D[x,bx,EGTcopy]=c(R)c(R). Thus for every possible value of e, we have D[x,bx]=D[x,bx,e]c(R).

To conclude, the two complementary bounds show that D[x,bx]=c(R).

Fig. A1.

Fig. A1.

DLE-Reconciliations obtained forMitoCOG0005, MitoCOG0051 and MitoCOG0053 with the EndoRex scores settings S1, S2, S3, S4 and S5. The blue part of the tree indicates that the genetic material is located in the mitochondrion, while the red part indicates location in the nucleus. The shape of an internal node represents its associated event, as represented in Figure 1 (circle for a speciation, rectangle for a duplication and triangle for an EGT event). Loss events are not represented. Genes are formatted as follow: [species name]__[gene-encoding location]__[gene id]. Moreover, 0 indicates a location in the mitochondrion, while 1 indicates a location in the nucleus

References

  1. Adams K.L., Palmer J.D. (2003) Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol. Phylogenet. Evol. Plant Mol. Evol., 29, 380–395. [DOI] [PubMed] [Google Scholar]
  2. Akerborg O. et al. (2009) Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc. Natl. Acad. Sci. USA, 106, 5714–5719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bansal M.S. et al. (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28, i283–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bollback J.P. (2006) SIMMAP: stochastic character mapping of discrete traits on phylogenies. BMC Bioinformatics, 7, 88–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brandvain Y., Wade M.J. (2009) The functional transfer of genes from the mitochondria to the nucleus: the effects of selection, mutation, population size and rate of self-fertilization. Genetics, 182, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Charleston M.A., Perkins S.L. (2006) Traversing the tangle: algorithms and applications for cophylogenetic studies. J. Biomed. Inf., 39, 62–71. [DOI] [PubMed] [Google Scholar]
  7. Chen K. et al. (2000) NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol., 7, e429–e447. [DOI] [PubMed] [Google Scholar]
  8. Derelle R. et al. (2015) Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl. Acad. Sci. USA, 112, E693–E699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Delabre M. et al. (2020) Evolution through segmental duplications and losses: a Super-Reconciliation approach. Algorithms. Mol. Biol., 15, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Doyon J.P. et al. (2010) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Lecture notes in computer science, Proceedings of RECOMB International Workshop on Comparative Genomics, vol. 6398, pp. 93–108.
  11. Dyall S.D., Johnson P.J. (2000) Origins of hydrogenosomes and mitochondria: evolution and organelle biogenesis. Curr. Opin. Microbiol., 3, 404–411. [DOI] [PubMed] [Google Scholar]
  12. Edgar R. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. El-Mabrouk N., Noutahi E. (2019) Gene family evolution-an algorithmic framework. In: Warnow T. (ed.) Bioinformatics and Phylogenetics. Computational Biology, vol 29., Springer International Publishing, pp. 87–119. [Google Scholar]
  14. Fitch W.A. (1971) Minimum change for a specific tree topology. Syst. Biol., 20, 406–416. [Google Scholar]
  15. Goodman M. et al. (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool., 28, 132–163. [Google Scholar]
  16. Gray M.W. et al. (2020) The draft nuclear genome sequence and predicted mitochondrial proteome of Andalucia godoyi, a protist with the most gene-rich and bacteria-like mitochondrial genome. BMC Biol., 18, 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hahn M.W. (2007) Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biology, 8, R141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Huelsenbeck J.P. et al. (2003) Stochastic mapping of morphological characters. Syst. Biol., 52, 131–158. [DOI] [PubMed] [Google Scholar]
  19. Kelly S. (2020) The economics of endosymbiotic gene transfer and the evolution of organellar genomes. bioRxiv, doi:10.1101/2020.10.01.322487. [Google Scholar]
  20. Lafond M. et al. (2016) Efficient non-binary gene tree resolution with weighted reconciliation cost. InLeibniz International Proceedings in Informatics, 27th Annual Symposium on Combinatorial Pattern Matching (CPM), num 14, p 14:1-14:12.
  21. Lang B.F., Burger G. (2012) Mitochondrial and eukaryotic origins: a critical review. Bot. Res., 63, 1–20. [Google Scholar]
  22. Kannan S. et al. (2014) MitoCOGs: clusters of orthologous genes from mitochondria and implications for the evolution of eukaryotes. BMC Evol. Biol., 14, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Makarova K. et al. (2007) Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol. Direct, 2, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Puttick M. et al. (2018) The interrelationships of land plants and the nature of the ancestral embryophyte. Curr. Biol., 28, 733–745.e2. [DOI] [PubMed] [Google Scholar]
  25. Roger A.J. et al. (2017) The origin and diversification of mitochondria. Curr. Biol., 27, R1177–R1192. [DOI] [PubMed] [Google Scholar]
  26. Sankoff D., Cedergren R.J. (1983) Simultaneous comparison of three or more sequences related by a tree. In: Sankoff D., Kruskal J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, Chapter 9. Addison-Wesley, pp. 253–264. [Google Scholar]
  27. Sloan D.B. et al. (2018) Cytonuclear integration and co-evolution. Nat. Rev. Genet., 19, 635–648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stolzer M. et al. (2012) Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics, 28, i409–415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Szöllősi G.J. et al. (2015) The inference of gene trees with species trees. Syst. Biol., 64, e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Tofigh A. et al. (2011) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans. Comput. Biol. Bioinf., 8, 517–535. [DOI] [PubMed] [Google Scholar]
  32. Yutin N. et al. (2009) Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J., 6, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhang L. (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol., 4, 177–187. [DOI] [PubMed] [Google Scholar]
  34. Zmasek C.M., Eddy S.R. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17, 821–828. [DOI] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES