Abstract
Motivation
It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an α−proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree.
Results
We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species (in one direction or the other). We consider both EGT events resulting in maintaining (EGTcopy) or removing (EGTcut) the gene copy in the source genome. We present a linear-time algorithm for computing the DLE (Duplication, Loss and EGT) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyze different costs settings parameters on a plant dataset and discuss the resulting reconciled trees.
Availability and implementation
EndoRex implementation and supporting data are available on the GitHub repository via https://github.com/AEVO-lab/EndoRex.
1 Introduction
Genomics and cell biology investigations have revealed that all known eukaryotes descend from a common ancestral mitochondrial-containing cell that originated from the integration of an endosymbiotic α-proteobacterium into a host cell (Dyall and Johnson, 2000). After this early event, eukaryotic gene contents have been shaped by duplications, losses and Horizontal Gene Transfers (HGT) from one species to another, but also by Endosymbiotic Gene Transfers (EGT), mainly from the mitochondrion to the nucleus, in some cases leading to the total disappearance of the mitochondrion (Roger et al., 2017; Sloan et al., 2018).
Many questions regarding the ancestral mitochondrial proteome and gene content evolution remain open (Lang and Burger, 2012). One of the reasons is that, to date, comparative genomics studies have largely focused on multicellular eukaryotes, mainly animals and plants. While imprints of global evolutionary events at the genomic level are hardly visible on multicellular eukaryotes that have diverged too much from the Last Eukaryotic Common Ancestor (LECA), protists, known to have emerged close to the eukaryotic origin, are better candidates for such a comprehensive evolutionary study. Interestingly, a recent sequencing effort on jakobids (Gray et al., 2020) and malawimonads (Derelle et al., 2015) protist genomes have been undertaken by a consortium of protistologists (DeepEuk), suggesting that soon enough data will be available to allow further investigations on early-eukaryotic evolution.
In addition to having the appropriate datasets, understanding the concerted evolution of the eukaryotic mitochondrial and nuclear genomes also requires having the appropriate algorithmic tools. This problem can be seen as related to the host-parasite coevolution inference problem (Charleston and Perkins, 2006). Given a host tree and a parasite tree, cophylogenetic analysis consists in inferring a history of codivergence, parasite duplication, host switch or extinction events explaining the coevolution of hosts and parasites. However, nuclear and mitochondrial genomes can hardly be treated by the same kind of approach, as they evolve, through a different evolutionary model, together in the same species, and thus are related through the same species tree. Rather, inferring an endosymbiotic evolutionary history requires focusing on gene families and studying the movement of genes between the mitochondrial and nuclear genomes.
Inferring the evolution of gene families is the purpose of the gene-tree-species-tree-reconciliation field, seeking for a most parsimonious (El-Mabrouk and Noutahi, 2019; Goodman et al., 1979), or a most probable (Akerborg et al., 2009; Szöllősi et al., 2015) evolutionary scenario of gene gain and loss explaining the incongruence between a gene tree and a species tree. A most parsimonious reconciliation minimizing the number of Duplications (the D-distance) or the number of Duplications and Losses (the DL-distance) can be found in linear time using the LCA (Last Common Ancestor) mapping (Chen, 2000; Zhang, 1997; Zmasek and Eddy, 2001). Such an algorithm can actually be used to solve the cophylogenetic problem if operations are restricted to coevolution, duplication and extinction. Including HGT events (i.e. finding the DTL-distance) leads to an NP-hard problem if time-consistency is required, remaining polynomial otherwise (Bansal et al., 2012; Tofigh et al., 2011).
In this article, we introduce the reconciliation model accounting for EGT events, i.e. the special case of HGT events where genes are exchanged only between the mitochondrial and nuclear genomes of the same species. Although integration of the mitochondrial content into the nucleus is the most frequent event in the course of evolution of eukaryotes, the transfer from the nucleus to the mitochondrion has also been observed (Adams and Palmer, 2003). Here, we consider the exchange of genes in both directions. Moreover, we consider EGT events resulting in maintaining a gene copy in the source genome (EGTcopy), as well as those resulting in the removal or loss of function of the gene in the source genome (EGTcut).
Formally, given a gene tree for a gene family with a known mitochondrial or nuclear location for each gene copy, we seek for a most parsimonious sequence of Duplication, Loss and EGT (DLE) events explaining the tree given a known species tree. First, based on the DL-distance and on the Fitch algorithm for weighted parsimony, we present, in Section 3, a linear-time algorithm for computing the DLE-Distance, as well as an optimal reconciled tree for the unitary cost. We then develop, in Section 4, a general dynamic programming algorithm that can be used to output all optimal reconciliations, for an arbitrary cost of operations, including possibly a different cost for an EGT from the mitochondrion to the nucleus, or conversely. This algorithm is linear in the size of the gene tree. It can be seen as an adaptation of the quadratic-time DTL algorithm for dated trees (Doyon et al., 2010), which allows transfers between any co-existing species. We finally illustrate, in Section 5, the application of our EndoRex software on clusters of orthologous mitochondrial protein-coding genes (MitoCOGs) (Kannan et al., 2014) of plants, analyze different costs settings parameters and discuss the obtained reconciled trees.
For space reasons, some of the proofs are given in Appendix.
2 Preliminaries
All trees are considered rooted. Given a tree T, we denote by r(T) its root, by V(T) its set of nodes and by its leafset. A node x is a descendant of if x is on the path from to a leaf of T and an ancestor of if x is on the path from r(T) to x is a strict descendant (respectively strict ancestor) of if it is a descendant (respectively ancestor) of different from . Moreover, x is the parent of if it directly precedes on the path from to r(T). In this latter case, is a child of x. We denote by E(T) the set of edges of T, where an edge is represented by its two terminal nodes , with x being the parent of . An internal node (a node which is not a leaf) is said to be unary if it has a single child and binary if it has two children. If not stated differently, the children of a binary node x are denoted xl and xr. Given a node x of T, the subtree of T rooted at x is denoted .
A binary tree is a tree with all internal nodes being binary. If internal nodes have one or two children, then the tree is said partially binary.
The lowest common ancestor (LCA) in T of a subset of , denoted , is the ancestor common to all the nodes in that is the most distant from the root.
A tree R is an extension of a tree T if it is obtained from T by grafting unary or binary nodes in T, where grafting a unary node x on an edge (u, v) consists in creating a new node x, removing the edge (u, v) and creating two edges (u, x) and (x, v), and in the case of grafting a binary node, also creating a new leaf y and an edge (x, y). In the latter case, we say that y is a grafted leaf.
Species and gene trees: The species tree S for a set of species represents a partially ordered set of speciation events that have led to . In this article, we consider that each species of has two genomes: σ0 corresponding to its mitochondrial genome and σ1 corresponding to its nuclear genome.
A gene family is a set of genes where each gene x belongs to a given species s(x) of . A tree T is a gene tree for a gene family if its leafset is in bijection with . We will make no distinction between a leaf of T and the gene of Γ it corresponds to. We call s(x) the species labeling of the leaf x. For a subset of genes, we write as the set of species containing the genes of G.
Moreover, we assign to each gene x of a Boolean value corresponding to the genome it belongs to. More precisely, b(x) = 0 if x belongs to and b(x) = 1 if x belongs to . In this article, we assume that the mitochondrial or nuclear location of each extant gene is known. We call b(x) the genome labeling of the leaf representing x.
An evolutionary history is represented by an event labeled tree, where the event label of an internal node x is its corresponding event. The event labeling of the internal nodes of a gene tree is obtained through reconciliation.
2.1 Reconciliation
Inside the species’ genomes, genes undergo Speciation (Spe) when the species to which they belong do, but also Duplication (Dup) i.e. the creation of a new gene copy, Loss of a gene copy and Horizontal Gene Transfer (HGT) when a gene is transmitted from a source to a target genome. In this article, we consider special cases of HGTs, called EGTs, only allowing the transmission of genes from the mitochondrial genome to the nuclear genome of the same species, or vice-versa. Moreover, we consider two types of EGTs: EGTcopy and EGTcut defined as follows (see Fig. 1):
Fig. 1.
The effect of an event on a node x of a gene tree representing the gene a belonging to the genome si (denoted ), where s is a species and (for a species s, so is the mitochondrial genome and s1 the nuclear genome of s). The tree S up-right is the species tree, where u and v are the two species arising from the speciation of s. (Spe): Gives rise to a copy au in ui and av in vi; (Dup): Preserves the copy a in si and gives rise to a new copy b in si; (EGTcopy): Represents a transfer event from si to sj, where and , preserving the copy a in si and giving rise to a new copy aj in sj; (EGTcut): Represents a transposition event from si to sj removing the copy a in si and creating a copy aj in sj
A gene x belonging to σi is copied (or transferred) by an EGTcopy event to σj for if it is copied from σi and inserted in σj.
A gene x belonging to σi is transposed by an EGTcut event to σj for if it is cut from σi and inserted in σj.
Thus, in this article, the set of considered events is:
Notice that we do not consider general HGT events. To define a DLE-Reconciliation, assume that we are given a species tree S, a gene tree T, a mapping s from to and a mapping b from to {0, 1}. We need to define how to extend s and b to the internal nodes of T. Given an extension R of T (R can be equal to T) an extension of s is a function from V(R) to V(S) such that, for each leaf x of T, . Moreover, an extension of b is a function from V(R) to {0, 1} such that, for each leaf x of T, .
Definition 1
(DLE-Reconciliation). Let Γ be a gene family where each belongs to the genome b(x) of a species s(x) of Σ. Let T be a rooted binary gene tree for Γ and S be a rooted binary species tree for Σ. A DLE-Reconciliation is a quadruplet where R is a partially binary extension of T, is an extension of s and is an extension of b such that:
a. and are the two children of in S and , in which case
b. and in which case representing a duplication in
c. and in which case let y be the element of such that , then is a transfer with source genome and target genome .
A grafted leaf on a newly created node x corresponds to a loss in .
As R is as an extension of T, each node in T has a corresponding node in R. In other words, we can consider that . In particular, the species labeling on R induces a species labeling on T.
Given a cost function c on DLE and a reconciliation , the cost is the sum of costs of the induced events. In this article, we assume a 0 cost for speciations and positive costs for all the other events.
We are now ready to formally define the considered optimization problem.
DLE-Reconciliation Problem:
Input: A species tree S for a set of species , a gene family on , a gene tree T for , a species labeling s and a genome labeling b of , and a cost function c on DLE.
Output: A most parsimonious DLE-Reconciliation, i.e. a DLE-Reconciliation of minimum cost.
In the next section, we first consider the case of a unitary cost, thus reducing the problem to minimizing the number of operations induced by a reconciliation. The cost DLE(T, S) of the most parsimonious DLE-Reconciliation for T and S in the case of a unitary cost c is called the DLE-Distance. We then extend the algorithmic developments to arbitrary costs, allowing in particular to consider an EGTcopy or an EGTcut event copying a gene from the mitochondria to the nucleus differently from a similar event copying a gene from the nucleus to the mitochondria.
In the following section, we will refer to the DL-Reconciliation of T and S. Recall that it is a triplet defined by only considering the cases of speciations, duplications and losses in Definition 1, and ignoring the binary assignment of genes. We denote by DL(T, S) the DL-Distance, i.e. the minimum number of duplications and losses induced by a DL-reconciliation. The DL-Reconciliation of cost DL(T, S) is unique and verifies, for any internal node x of :
if and then v is a Speciation; otherwise x is a Duplication.
We finally need to make the link between the species labeling of an optimal reconciliation and the well-known LCA-Mapping. This is formally stated in the following lemma.
Lemma 1
(LCA-Mapping). Let be a DLE-Reconciliation of minimum cost between T and S. Then, for each .
Note that in the above statement, , and thus the intersection is redundant. We write it this way to emphasize that x is a vertex of R (which happens to also be in T), i.e. the LCA-Mapping here applies to the reconciled trees, not to the original gene tree T.
3 A linear-time algorithm for the DLE-distance
In this section, we consider a unitary cost c on DLE.
Consider a given extension of b to the internal nodes of T. We first present an algorithm for computing a DLE-Reconciliation of minimum cost, under the condition that for each . We will then show how a minimizing the DLE-Distance can be obtained.
Algorithm 1 computes the DLE-Reconciliation from the DL-Reconciliation (see Fig. 2 for an example).
Fig. 2.

The tree RDL up left, together with its node labeling, is the optimal DL-Reconciliation for the gene tree T represented by the plain edges of RDL and the species tree S up right. The two down trees are obtained by Algorithm 1 for two different labeling of internal nodes: the left labeling is obtained by the Fitch algorithm for weighted parsimony, while the right labeling is obtained by applying Algorithm 2. The left labeling gives rise to a non-optimal reconciliation with seven operations (two losses, one duplication, two EGTcopy and two EGTcut), while the right labeling gives rise to the DLE-Distance which is equal to six (two losses, three EGTcopy and one EGTcut). Rectangles represent duplications; triangles represent either EGTcopy or EGTcut events depending whether the labeled node is binary or unary; dotted lines represent losses; A leaf xi represent a gene x belonging to the genome i (0 for mitochondrial and 1 for nuclear) of species X
Lemma 2
(Optimality of Algorithm 1). Given a binary assignment of the nodes of T, Algorithm 1 outputs a DLE-Reconciliation of minimum cost with the constraint that for .
It follows from Lemma 2 that if is known in advance for the nodes of T, a DLE-Reconciliation of minimum cost is obtained from Algorithm 1 with as input. We now focus on finding such a labeling .
Lemma 3
(Necessary condition for ) There exists a DLE-Reconciliation of minimum cost DLE(T, S) such that, for any node x of T and its children xl and xr in T, or .
Proof.
Assume is a most parsimonious DLE-Reconciliation with a lowest node x not satisfying condition (1): or . Thus we should have . Note that an EGTcut event must be present on at least one of the or branches. A reconciliation of lower or equal cost can be obtained by assigning and removing this EGTcut event, reducing the cost by one. Let px be the parent of x in R (note that if x is the root, px might not exist, in which case there is nothing else to do). If is now different from , we add an EGTcut event between px and x, yielding an alternate reconciliation of equal or lower cost.
We can reproduce the same transformation iteratively in a bottom-up fashion until condition (1) is satisfied for every node. □
For a node , define d(x) = 1 if x is a duplication in the DL-Reconciliation of minimum cost, and d(x) = 0 otherwise. Let be a binary labeling of V(T). For any node x of T, denote if , otherwise
and define:
Roughly speaking, reflects the number of label changes between x and its children xl and xr in T, with the exception that a duplication is allowed a ‘free’ change since it can be turned into an EGTcopy node. For example, in Figure 2, for the labeling of T consistent with that of the left tree R (Algo1+Fitch), and for the labeling of T consistent with that of the right tree R (Algo1+Algo2), reflecting, for each one, the number of requested EGTcut.
Lemma 4.
The minimum cost of a DLE-Reconciliation between a gene tree T and a species tree S is
Proof. By Lemma 2, Algorithm 1 correctly infers a minimum cost DLE-Reconciliation for a given . Note that this DLE-Reconciliation is obtained from a DL-Reconciliation by turning some duplication nodes into EGTcopy nodes (which do not change the cost), and by grafting some EGTcut nodes. Thus, the latter are responsible for any possible change in cost from DL(T, S) to DLE(T, S). It follows that the cost of the returned DLE-Reconciliation is DL(T, S), plus the number of grafted EGTcut nodes.
Let be a binary assignment of T that minimizes DLE(T, S) when is passed to Algorithm 1. By Lemma 3, we may assume that for any node x and its children xl and xr, or . Thus for every x. Furthermore, if and only if x is a speciation node and an EGTcut node is grafted on the edge (if ) or on the edge (if ). In consequence, counts exactly the number of graftings of EGTcut nodes. □ □
Since the most-parsimonious DL-Reconciliation is unique, the DL(T, S) term in the above lemma is an invariant. Our goal is therefore to find the labeling that minimizes .
This can be achieved by a slight modification of the Fitch (1971) algorithm (Fitch, 1971) computing, for a given tree with leaf labels, all possible label assignments of internal nodes minimizing the number of label changes along the edges of the tree. We first need to recall some concepts on parsimony. Given a tree T on a leafset L of residues (generally nucleotides or amino-acids, but in this article corresponding to the possible labeling), the weighted parsimony problem consists in assigning a residue to each internal node u of T in a way minimizing the total weight of the tree. More precisely, given a cost matrix M on residues, the weight of T is the sum of weights for all . An assignment of T refers to the assignment of a residue to each internal node of T.
The Sankoff and Cedergren (1983) algorithm (Sankoff and Cedergren, 1983) allows to compute, in quadratic time, the minimum cost of an assignment of T. Moreover, it allows to find all the assignments of T leading to . When for all and for , weighted parsimony can be computed in linear time using the Fitch algorithm.
The Fitch algorithm consists of two phases. The first phase is recursive and reconstructs possible ancestral labels L(x) for each node x of T and the overall minimum number of label changes required as follows: For each node x of T in a bottom-up traversal, (1) if x is a leaf, then and . (2) Else, let xl and xr be the children of x. If , then and ; else and . The second phase of the algorithm reconstructs an assignment of T that has a minimum cost, by computing as follows: For each node x of T in a top-down traversal, (1) if x is the root, assign to any label in L(x). (2) Else, let xp be the parent of x. If , then assign , else assign to any label in L(x).
The Fitch algorithm does not always find an optimal assignment because of duplications that can be turned into EGTcopy events. Algorithm 2 modifies the first phase of the Fitch algorithm to compute the DLE-Distance and an assignment of T that leads to the DLE-Distance. The modification reflects the fact that a duplication node is allowed a ‘free’ change since it can be turned into an EGTcopy node (see Fig. 2 for an illustration).
Lemma 5.
Algorithm 2 outputs, in linear time, the DLE-Distance DLE(T, S) and a binary assignment of T that leads to a most parsimonious DLE-Reconciliation.
Proof. It suffices to prove that the following statement holds for any node x of T: for any label β in L(x), there exists a binary assignment of such that and minimizes .
If , then , and . Thus , without any increment.
If , then or , and or , and . Thus , without any increment.
In both cases, Algorithm 1 computes a DLE-Reconciliation with minimum cost with a minimum increment of 1 for a Dup node in case (1), or by making x an EGTcopy node in case (2), but no additional EGTcut node is required.
If x is a speciation node in the DL-reconciliation.
If , then , and or . So or , and . Thus , with a minimum increment of 1, obtained by grafting an EGTcut node on one of the or branches. In this case, Algorithm 1 computes a DLE-Reconciliation with minimum cost .
If , then and . So , and . Thus without any additional cost. Algorithm 1 computes a DLE-Reconciliation with minimum cost when given .
It is easy to see that both the first and the second phases of the algorithm have linear time complexity, thus the overall algorithm has a linear time complexity. □
As for the Fitch Algorithm, Algorithm 2 does not allow to output all the solutions of the DLE-Reconciliation problem leading to the DLE-Distance. However, this can be achieved by adapting the Sankoff and Cedergren’s dynamic programming algorithm. Rather, we choose to introduce, in the next section, a more general dynamic programming algorithm allowing to output all optimal solutions for an arbitrary cost of the DLE events, not only for the unitary cost.
4 Solving the DLE-reconciliation problem with arbitrary DLE costs
We now introduce a dynamic programming algorithm for general costs. We use δ and λ to denote the cost of a duplication and a loss, respectively. We use ρ0 (respectively τ0) for the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and ρ1 (respectively τ1) for the cost of an EGTcut (respectively EGTcopy) from the nuclear genome to the mitochondrial genome. Note that the subscripts of the EGT costs indicate the source of the switch. Also denote
Roughly speaking, represents the minimum cost required to switch from mitochondrial to nuclear genome inside a branch of T, and the minimum cost required in the other direction. The purpose of and is that a switch can be accomplished by an EGTcut event, but also by an EGTcopy event followed by a loss.
Let . Note that does not need to be inferred, since by Lemma 1, we can assume that . Our dynamic programming table only needs to store the optimal cost on for each possible . This requires testing each of three possible events at x, and the number of scenarios to consider at x is therefore constant [this is the main reason for the gain in time compared to the algorithm of Doyon et al. (2010), which requires adding a dimension to the table corresponding to all possible species at x]. Let . We denote by the minimum cost of a DLE-Reconciliation of with S in which (or if no such reconciliation exists). Trivially, if x is a leaf of T, we have
Assume now that x is an internal node of T. Let xl, xr be the children of x. For , let denote the number of vertices on the path between s1 and s2 in S, including s1 and s2. Then define
which counts the number of mandatory losses on the child branches of a node x of T.
To compute , we use three auxiliary values , where represents the event label of x (note that ex cannot be an EGTcut event, since x has two children).
If or , then . Assuming this check has been performed, we have
Put . The value of interest is .
Theorem 1.
For any and , the value of , as defined above, is equal to the minimum cost of a DLE-Reconciliation of with S satisfying .
Moreover, the minimum cost of a reconciliation of T with S can be computed in time .
Let us note that once the D table is computed, a standard backtracking procedure allow to reconstruct every optimal DLE-Reconciliation.
5 Experimental results
We implemented the above dynamic programming procedure in python in a software called EndoRex, which supports arbitrary costs as input and returns a reconciled gene tree in Newick format. The python source can be accessed at https://github.com/AEVO-lab/EndoRex. We then performed a variety of experiments on a dataset obtained from (Kannan et al., 2014), as described bellow.
5.1 Kannan et al. (2014) dataset
For the reconstruction of evolutionary histories with EGT events, we used a dataset from Kannan et al. (2014) available at ftp://ftp.ncbi.nih.gov/pub/koonin/MitoCOGs. The dataset consists of 140 MitoCOGs extended with paralogs and nuclear protein-coding homologs from 2486 eukaryotes with complete mitochondrial genomes. MitoCOGs are clusters of orthologous genes for mitochondrial-encoded proteins generated using COG construction (Makarova et al., 2007; Yutin et al., 2009). Full description of the MitoCOG generation procedure is described in Kannan et al. (2014). Among the 140 MitoCOGs, 73 correspond to protein-coding gene families, 49 are hypothetical proteins and 18 are clusters for which the protein function is identified but not the gene name. Among these 73 MitoCOGs, 13 are core-mitochondrial proteins that are shared by most of the 2486 mitochondrial genomes. Statistics on MitoCOGs of the Kannan et al. dataset are given in Table 1.
Table 1.
Statistics on the Kannan et al. (2014) dataset
| Gene set | Nb of MitoCOGs | Nb of species | Nb of genes |
|---|---|---|---|
| Mitochondrial-encoded | 140 | 2486 | 34 755 |
| Nuclear-encoded | 45 | 52 | 1317 |
| Whole set | 140 | 2486 | 36 072 |
Note: Notice that MitoCOGs have been designed for mitochondrial-encoded genes, and nuclear-encoded genes have been included later. This explains why all nuclear-encoded MitoCOGs, and the corresponding species, are included in the mitochondrial-encoded sets of MitoCOGs and species.
5.2 Dataset preprocessing
Among the 140 MitoCOGs of the initial Kannan et al. dataset, we first selected the 45 clusters involving nuclear-encoded protein sequences. Within these MitoCOGs, 52 eukaryotes are represented including 28 Opisthokonta (10 Fungi, 17 Metazoa and 1 Choanoflagellata), 9 Viridiplantae, 1 Rhodophyta, 1 Glaucophyta, 5 Alveolata, 1 Amoebozoa, 2 Euglenozoa, 1 Heterolobosea, 1 Rhizaria and 3 Stramenopiles. Based on Figure 1 in Kannan et al. (2014) and the analysis of the dataset, for the EGT evolutionary history inference with EndoRex, we selected the 11 plant species, including the 9 Viridiplantae, Cyanidioschyzon merolae (Rhodophyta) and Cyanophora paradoxa (Glaucophyta), as gene-content location is more diversified among this species group.
The 11 plant species are represented in 68 MitoCOGs with mitochondrial-encoded proteins and 41 MitoCOGs with nuclear-encoded proteins. We selected the clusters for which there were mitochondrial and nuclear encoded genes, yielding 28 MitoCOGS containing 326 protein-coding genes, including 184 encoded in the mitochondria and 142 in the nucleus. All the 28 MitoCOGs correspond to gene names that are present in the mitochondrial gene content review of Sloan et al. (2018).
Table 2 gives information about the 28 MitoCOGs of the 11 plants dataset specifying the gene name, the protein metabolic pathway and the number of genes and species for each MitoCOG.
Table 2.
Statistics on the 28 MitoCOGs of the 11 plants dataset
| MitoCOG | Gene | Metabolic | Nb of genes | Nb of |
|---|---|---|---|---|
| ID | name | pathway | (mito+nuc) | species |
| MitoCOG0006 | nad3 | Complex I | 11 (10 + 1) | 11 |
| MitoCOG0007 | nad4L | Complex I | 13 (12 + 1) | 11 |
| MitoCOG0031 | nad7 | Complex I | 11 (9 + 2) | 11 |
| MitoCOG0043 | nad9 | Complex I | 11 (9 + 2) | 11 |
| MitoCOG0029 | nad10 | Complex I | 13 (1 + 12) | 10 |
| MitoCOG0052 | sdh2 | Complex II | 22 (1 + 21) | 10 |
| MitoCOG0051 | sdh3 | Complex II | 8 (3 + 5) | 6 |
| MitoCOG0075 | sdh4 | Complex II | 9 (4 + 5) | 9 |
| MitoCOG0003 | cox2 | Complex IV | 13 (10 + 3) | 11 |
| MitoCOG0005 | cox3 | Complex IV | 13 (10 + 3) | 11 |
| MitoCOG0059 | atp1 | Complex V | 9 (7 + 2) | 8 |
| MitoCOG0076 | atp4 | Complex V | 12 (11 + 1) | 10 |
| MitoCOG0004 | atp6 | Complex V | 13 (12 + 1) | 11 |
| MitoCOG0014 | atp9 | Complex V | 13 (10 + 3) | 11 |
| MitoCOG0027 | rpl2 | Translation | 14 (5 + 9) | 10 |
| MitoCOG0053 | rpl6 | Translation | 10 (4 + 6) | 8 |
| MitoCOG0092 | rpl10 | Translation | 5 (2 + 3) | 5 |
| MitoCOG0048 | rpl14 | Translation | 15 (5 + 10) | 11 |
| MitoCOG0039 | rpl16 | Translation | 12 (8 + 4) | 11 |
| MitoCOG0070 | rpl20 | Translation | 11 (2 + 9) | 8 |
| MitoCOG0080 | rps2 | Translation | 9 (5 + 4) | 9 |
| MitoCOG0067 | rps4 | Translation | 8 (7 + 1) | 7 |
| MitoCOG0061 | rps7 | Translation | 12 (8 + 4) | 11 |
| MitoCOG0072 | rps10 | Translation | 12 (3 + 9) | 8 |
| MitoCOG0054 | rps11 | Translation | 12 (6 + 6) | 10 |
| MitoCOG0064 | rps13 | Translation | 10 (7 + 3) | 10 |
| MitoCOG0055 | rps14 | Translation | 9 (5 + 4) | 8 |
| MitoCOG0026 | rps19 | Translation | 16 (8 + 8) | 8 |
Note: For the ‘Nb of gene’ column, the number of mitochondria-encoded (mito) and nucleus-encoded (nuc) gene are specified.
For each MitoCOG, we applied a pipeline to infer the evolutionary history of EGTs with DLE-Reconciliation along the 11 plants species tree. The topology of the species tree was taken from Kannan et al. (2014). We added the species Micromonas sp. RCC299 as the sister species of Ostreococcus tauri as only these 2 among the 11 plants species belong to the Mamiellophyceae class. We also swapped the position between P. patens and S. moellendorffi according to (Puttick et al., 2018) (Fig. 3).
Fig. 3.

Species tree of the 11 plants considered in our experimental analysis. Topology of the tree is based on (Kannan et al., 2014)
As for constructing gene trees, the first step of the pipeline was to align the protein sequences with MUSCLE (Edgar, 2004). In the second step, a maximum likelihood protein tree was infered using RAxML (v8.2.4) with the PROTGAMMAGTRX evolutionary model (Stamatakis et al., 2014). NOTUNG (v.2.9.1.5) was then used to root the trees by minimizing the cost of a duplication-loss reconciliation with default parameter (loss cost: 1.0 and duplication cost: 1.5) (Stolzer et al., 2012).
The rooted protein trees obtained with this pipeline and the 11 plants species tree were given as input of the EndoRex software to infer a most parsimonious DLE-Reconciliation allowing for arbitrary costs for duplications, losses and EGTs.
5.3 EndoRex evolutionary events cost setting
As a reminder, we consider six parameters corresponding to the different evolutionary event costs: δ and λ the cost of, respectively, a gene duplication and loss; ρ0 (respectively τ0) the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and ρ1 (respectively τ1) the cost of an EGTcut (respectively EGTcopy) from the nuclear genome to the mitochondrial genome.
We test five different cost settings for the application of EndoRex on the 11 plants dataset. The setting S1 corresponds to the default values for parameters, with a unitary cost for evolutionary events (allowing to compute the DLE-Distance). For setting S2, the gene loss and duplication costs are those used in NOTUNG for rooting the protein trees, and EGTcopy and EGTcut costs are set higher to reflect the fact that these evolutionary events are less frequent than gene duplications: and . In setting S3, we consider EGTcopy as less likely than EGTcut: and . For setting S4, we differentiate the cost of the mitochondria to the nucleus from the nucleus to the mitochondria gene move, and account for the fact that, during the evolution of eukaryotes, mitochondrial genes are integrated into the nuclear genome, while the reverse is extremely rare: and . Finally, setting S5 is the same as setting S4 except we make no difference between the costs of EGTcopy and EGTcut events: and .
Applied to the 28 MitoCOGs trees, EndoRex infers the same DLE-Reconciliation with the five different settings for 21 of the 28 MitoCOGs.
All the seven MitoCOGs with more that one inferred DLE-Reconciliation, depending on the considered setting, lead to two different DLE-Reconciliations: for MitoCOG0014, MitoCOG0051 and MitoCOG0053, setting S1 gives a DEL-reconciliation different from the other settings; for MitoCOG0027, it is setting S3 that gives a different DEL-reconciliation; for MitoCOG0005 and MitoCOG0039, it is setting S4; and finally for MitoCOG0072, the settings S4 ans S5 give a DEL-reconciliation different from S1, S2 and S3. We analyzed the two DLE-Reconciliations of MitoCOG0014 (atp9), MitoCOG0027 (rpl2), MitoCOG0039 (rpl16) and MitoCOG0072 (rps10) to illustrate the dynamic of the score settings (see Fig. 4).
Fig. 4.
DLE-Reconciliations obtained for MitoCOG0014, MitoCOG0027, MitoCOG0039 and MitoCOG0072 with the EndoRex scores settings S1, S2, S3, S4 and S5. The blue part of the tree indicates that the genetic material is located in the mitochondrion, while the red part indicates location in the nucleus. The shape of an internal node represents its associated event, as represented in Figure 1 (circle for a speciation, rectangle for a duplication and triangle for an EGT event). Loss events are not represented. Genes are formatted as follow: [species name]__[gene-encoding location]__[gene id]. Moreover, 0 indicates a location in the mitochondrion, while 1 indicates a location in the nucleus
According to these case studies, it seems that setting S1 is inappropriate as it leads to the prediction of higher number of EGTs which are rare evolutionary events (see MitoCOG0014 in Fig. 4, and MitoCOGs 51 and 53 in Appendix Fig. A1). For MitoCOG0027, setting S3 leads to the prediction of numerous EGTs from the nucleus to the mitochondria, which is very unrealistic as a very few number of gene movements from the nucleus to the mitochondria have been described in the literature. DLE-Reconciliations predicted with setting S4 are the scenarii most in line with the literature as it only infers EGTs from the mitochondria to the nucleus (except for MitoCOG0072), with transpositions located close to the leaves of the tree, indicating an ongoing process of endosymbiotic gene transfer in plants for this gene family (see MitoCOGs 39 and 72 in Fig. 4, and MitoCOG0005 in Appendix Fig. A1).
6 Conclusion
Investigating the origin, evolution and characteristics of gene coding capacity of eukaryotes has been among the central themes in the Life Sciences. In this context, the endosymbiotic origin of mitochondrial genomes and the gradual integration of the mitochondrial gene content to the nucleus are important evolutionary parameters expected to shed light on features of eukaryotic gene evolution and function.
From a computational point of view, detecting the footprint of endosymbiosis in the gene repertoires of the mitochondrial and nuclear genomes of eukaryotes requires new evolutionary prediction methods. This article is a first effort toward developing the appropriate algorithmic tools for analyzing the movement of genes inside a gene family between the mitochondrial and nuclear genome of the same species. We presented a linear-time algorithm computing a most parsimonious history of Duplication, Loss and EGT (DLE) events explaining a gene tree with leaves identified as mitochondrial or nuclear genes. We also presented a general dynamic programming algorithm, implemented in the EndoRex software, to compute all optimal DLE-Reconciliations for any arbitrary cost scheme of operations.
By applying EndoRex to a plant dataset, we showed that it is well-designed to infer the evolutionary histories of EGT events, considering a variety of cost settings. Some reconciled trees (not shown) of the 11 plants dataset produced evolutionary histories that could be considered unrealistic as leading to an unexpected high number of gene duplications and losses. As our algorithm is exact and thus guaranteed to infer the minimum number of events given a gene tree, this is likely due to errors in protein sequence alignment and/or gene tree inference, leading to erroneous gene trees (Hahn, 2007). A better gene tree inference pipeline should be designed in the future to get more accurate gene trees. In particular, gene trees have been rooted according to the DL-distance and standing on the default NOTUNG parameters. Instead, we could have rooted the trees according to our DLE-model, with the 5 considered cost settings. In addition, the obtained RAxML binary gene trees contain many weakly supported edges. Those edges may be contracted, and a polytomy resolution tool such as PolytomySolver (Lafond et al., 2016) may be used to better resolve multifurcations. On the other hand, simulations studies should also be conducted, in the future, to better evaluate the quality of the obtained solutions.
In fact, our method relies on a deterministic parsimony approach to compute all optimal DLE-reconciliations given a cost scheme for DLE events. This model has many limitations. In particular, parsimony does not allow to model multiple state changes along a branch of the phylogeny, or uncertainty in phylogenetic reconstructions. An alternative is to rely on approaches using stochastic state mapping models such as the mutational mapping approach (Bollback, 2006; Huelsenbeck et al., 2003). Since our method outputs all optimal DLE-reconciliations, it can also be used to compute the probabilities of all possible events over all optimal solutions.
Future algorithmic extensions of the optimization problem considered in this article may concern extending the model to account for both EGT and HGT events, toward inferring a Duplication, HGT, loss and EGT (DTLE) evolutionary scenario for a gene family. Another direction would be to infer common episodes of EGT events for a set of gene families. This may be handled by generalizing the Super-Reconciliation (Delabre et al., 2020) model to account for segmental DLE events.
Future developments will define an EGT simulation model to provide EGT evolutionary histories to assess the accuracy of our algorithm. Some efforts have been made to provide EGT simulation model. Brandvain and Wade (2009) provides a model to explore the influence of population-genetic parameters (such as selection, dominance, mutation rates and population size with a rate of self-fertilization) on the rate and probability of functional gene transfer from mitochondrial genome (haploid) to nuclear genome (diploid). (Kelly, 2020) defines an EGT simulation model based on the ATP biosynthesis cost for the encoding of a mitochondrial/chloroplast gene in the nuclear genome and the import of the resulting in the organelle. These prior works provide useful insights to design a model for the simulation of EGT evolutionary histories that would be strongly inspired from existing model for the simulation of HGT evolutionary histories.
Future applications will also concern a thorough analysis of protein-coding genes involved in common metabolic pathways. As an example, the oxydative phophorylation (OXPHOS) is a series of protein complexes (I, II, III, IV and V) leading to an electrochemical proton gradient activating the ATP synthase (complex V) that produces ATP. These protein-coding genes involved in OXPHOS are expected to share common mitochondrial-nuclear movements, as nucleus and mitochondria are two compartments with different biological dynamics.
Finally, the recent sequencing effort conducted toward jakobids and malawimonads protists genomes known to have emerged close to the eukaryotic origin will provide a valuable dataset that can be analyzed with the new developed algorithms, helping to shed light on a number of important biological questions, among them resolving the root of the eukaryote tree. In fact, as EGTs are rare events, candidate topologies for which DLE-Reconciliations infer the lowest number of EGT events, may provide evidence for a correct rooting.
Financial Support: Natural Sciences and Engineering Research Council of Canada;Fonds de recherche Nature et Technologie, Québec.
Conflict of Interest: none declared.
Acknowledgements
The authors thank B. Franz Lang (Biochemistry Department, University of Montreal) for his insights and clever advices on the algorithmic needs and open questions regarding eukaryotes’ evolution.
Appendix A
Proof of Lemma 1
Let be a DLE-Reconciliation of minimum cost between T and S. Let λ be the cost of a loss event. Let us first make an observation. Let and let , assuming that l exists. Let be the path from v to l in R. It is easy to see from the definition of reconciliation that is a path of S, but with some vertices possibly being repeated (i.e. is possible, but otherwise is a child of ). It follows that must be an ancestor of s(l). Since v and l were chosen arbitrarily, we have that for any is an ancestor of s(l) for every leaf .
Now suppose that, for some . Moreover, choose x as a lowest node of with this property (i.e. for all descendants of x in R). Note that x is an internal node of T since for every leaf x of T.
As we argued, is an ancestor of s(l) for every leaf . Since , it follows that is a strict ancestor of . We first argue that x cannot be a speciation. Assume this is the case and let be the children of x in R (but not necessarily in T). We use xl and xr to denote the children of x in T. By the definition of speciation, and are the two children of . Because is a strict ancestor of , only one of or has descendants in . Assume without loss of generality that only has such descendants. But then, is not an ancestor of any member of . In particular, is not an ancestor of any member of , and the latter is easily seen to be non-empty (this is because is an ancestor of xr and has leaves from T). As we argued before, this is not possible, since there should be a path from to any s(l) with .
Assume that x is a duplication or EGTcopy event (x cannot be an EGTcut event because it is binary). As before, let xl and xr be the children of x in T (but not necessarily in R). By the choice of x, and . Thus must be a strict ancestor of both and . Let be the child of that is on the path from to . We obtain an alternate reconciliation by modifying R to obtain another extension of T. We do not change any event labeling. We map x to and graft a loss in on the edge between x and its parent in R (if any). In that manner, the parent of x in R still has a child mapped to in . This increases the cost by λ, the cost of one loss.
Now let be the nodes on the path from x to xl in R (excluding x and xl). Note that since x is a duplication or EGTcopy, . Moreover, at most one node among can be an EGTcopy or an EGTcut, since there is no point in making more than one switch within an edge.
If present, we may assume without loss of generality that such an event occurs at xk, the parent of xl in R, since the timing of the switch does not affect the reconciliation cost. In this case, . On the other hand, . This implies that , and thus x1 is not an EGTcopy or an EGTcut. It follows that x1 is a node inserted because of a grafted loss, and . In , we can remove x1 and its loss leaf, and by doing so, the left child of x becomes x2. This preserves all properties of a valid reconciliation because both x and x2 are mapped to . We can apply the same procedure on the path from x to xr.
In , we have created one loss above x, but have removed two losses on both sides of x. No other event labeling has changed. Since we assume that losses have a non-zero cost, has a strictly lower cost than R, a contradiction.
Proof of Lemma 2
We first show that the reconciliation obtained from Algorithm 1 is a valid DLE-Reconciliation. Note that the tree R returned by the algorithm is the same as RDL, but with some grafted unary nodes for EGTcut events where needed. Consider some . In R, we put if , and if . If no additional node was grafted as a new child of x, all properties of reconciliation would be preserved since we keep as in . If some node was grafted as a new child of x, we ensure that is the same as the previous child of x, which ensures that we satisfy the properties of reconciliation. Therefore, we only need to check whether the tree RDL is modified in an appropriate way in the case of a different value for a node x of T and one of its two children xl or xr.
Lines 2–8 first ensure that the starting tree R is such that, for each node x of T, , and for any edge (x, y) in T such that , the corresponding path on R is such that for all i, . Subsequently, in the case of a different value for a node x of T and its child y, the node x is either modified to an EGTcopy node, ensuring that the switch between and is correctly explained by this EGTcopy, or a new EGTcut node v is grafted on the edge , also correctly explaining the switch between and .
We now show that the DLE-Reconciliation output by Algorithm 1 is of minimum cost. First Note that, from the initialization done in Line 8, for each leaf x which is on RDL but not in T (lost gene), the algorithm ensures that were px is x’s parent. Thus, grafted loss leaves never require an extra EGTcopy event on an ‘inserted edge’ of RDL.
Assume another reconciliation has a strictly lower cost than output by Algorithm 1. We first show that, for any node of T, the corresponding node in R and have the same event label. Assume this is not the case. Let x be the lowest node of T such that . Let xl and xr be its two children in T and vl and vr be the two non-unary descendant of x in the closest from x. Note that xl and xr do not necessarily correspond to vl and vr in . Rather, they may be strict descendants of these nodes in .
1. If , then from Algorithm 1, if and , and otherwise. As , we should have in the first case, or in the second case.
Assume . From Lemma 1, as is a reconciliation of minimum cost, , and as x is a speciation node in , one of vl and vr should be mapped to and the other to . Assume w.l.o.g. that and . Now, as x is a duplication node in RDL, then or . Assume w.l.o.g. that . As xl is a node of the subtree of rooted at vl, by definition of a reconciliation, should be a descendant of , which is not the case as is rather a strict descendant of . Therefore, x cannot be a speciation node in . We deduce that .
Now assume that or . In this case, the algorithm puts and, as x is not a speciation, it should be a duplication node in . But then an a unary EGTcut node v should be present in one of the two paths from x to xl or from x to xr in , contradicting the fact that is a reconciliation of minimum cost, since labeling x as an EGTcopy node and removing v would reduce the cost of the reconciliation by one.
Finally, assume that and . In this case, the algorithm puts and, as x is not a speciation, it should be an EGTcopy node in , which induces, by definition of an EGTcopy event, that one of the two children y of x in is such that . Now, as , one unary EGTcut node v should change the labeling of y to the labeling of its descendant in . But then relabeling x as a duplication node would allow removing v and thus reducing the cost of the reconciliation by one, contradicting the fact that is a reconciliation of minimum cost.
2. If , then from the properties of a DL-Reconciliation, we should have and . From Algorithm 1, x remains a speciation node in .
As , we should have or . In both cases, . This implies that and , and thus vl and vr are grafted because of losses. Since uses the LCA-mapping by Lemma 1, we can remove vl, vr and their corresponding grafted loss leaves and make x a speciation, while preserving a valid reconciliation. This saves a cost of three (two losses and a Dup or EGTcopy event). In the worst case, we had , in which case we can add an EGTcut event on the appropriate branch to enforce the same switch.
Thus replacing the Dup or EGTcopy label of x by a speciation reduces the cost of by at least two, contradicting the fact that is a reconciliation of minimum cost.
Since we have the same number of Dup and ETTr events as , it remains to show that we cannot graft less nodes than those induced by Algorithm 1. The grafted nodes are either binary nodes corresponding to losses, or EGTcut unary nodes. Suppose has less grafted nodes than R. Then there is an edge (x, y) in T such that the corresponding path in is shorter than the corresponding path in R. We consider a lowest edge (x, y) of T verifying this condition, and we assume, without loss of generality, that y = xl. Recall that by Lemma 1, and .
If , then x is a duplication or an EGTcopy node in both R and . Then, by definition of a reconciliation, . Moreover, from the fact that R is obtained from RDL, Algorithm 1 leads to a path with as many nodes as the path from to in S if x is a duplication node, and an additional EGTcut node if . Moreover, it is easy to see that the number of losses crafted on (x, y) must be equal to the number of nodes on the path from and , excluding , either in R or , and that the EGTcut event added by the algorithm cannot be avoided. And thus, the path should be at least as long as , contradicting the hypothesis that is shorter than .
If , then x is a speciation node in both R and . Then, by definition of a reconciliation, . Thus, from the fact that R is obtained from RDL, Algorithm 1 leads to a path with as many nodes as the path from to in S, with an additional EGTcut node if . Moreover, it is easy to see that no other operation (Spe, Dup, RGT or EGTcut) can allow making less losses or avoid the EGTcut event. And thus, the path should be at least as long as , contradicting the hypothesis that is shorter than .
Proof of Theorem 1
Let us first argue on the complexity of computing for every and every (including and , our values of interest). The LCA-mapping can be computed in time using classical approaches from DL-reconciliation. We can compute and for every in a post-order traversal of T (because their value only depends on xl and xr), and thus there are values to compute. If we assume that if we have access to lx for each x, it is clear from the recurrences that and can be computed in O(1) time. To access lx in time O(1) for any x, we can preprocess S by labeling each by its depth (i.e. its distance to the root). Then, is simply the difference in depth between and (because must be a descendant of ). This difference can be obtained in constant time, and it follows that lx can be obtained in O(1). Therefore, each entry takes O(1) time to compute. Including the time to compute the preprocessing and the LCA-mapping, the total time of the algorithm is .
Let us now argue that the algorithm is correct. Let , let , and let be a DLE-Reconciliation of minimum cost between and S that satisfies . The proof is by induction on the height of . If x is a leaf, it is easy to see that is correct. Assume that x is an internal node with children xl and xr. We may inductively assume that and are computed correctly for .
In what follows, let be the reconciliation between and S obtained by taking , and restricting and to . Similarly, let be the reconciliation of with S obtained by taking and restricting and to .
We show two useful claims, the first being that these sub-reconciliations must be optimal with respect to their subtrees.
Claim 1.1. and .
Proof. By induction and by the definition of D, we have . Moreover, in we may replace the subtree by (more precisely, replace by Rl, and use and for the vertices of Rl). Since and , all conditions of a valid reconciliation are met after such a replacement. Furthermore, no additional loss, EGTcopy or EGTcut is required on the path between x to xl. If held, this transformation would yield a lower cost reconciliation and contradict the optimality of . Therefore, . It follows that . By a symmetric argument, . □
Claim 1.2. If , then there are at least losses grafted on the and branches, and otherwise, there are at least such grafted losses.
Proof. If , in R there must be a loss grafted on the (respectively ) branch for each node of (respectively ), excluding and (respectively ). The number of such losses is and induce a cost of . If , the required losses are the same, except that we do not exclude x from both paths, and thus losses are required for a cost of . □
We now argue that . First assume that . We then consider the four possible labelings of xl and xr.
• If , then no cost other than the losses is required on the and branches. Thus using claims 1.1 and 1.2,
Since for both adds the losses, plus the minimum of and for each child , we see that .
- • If and , then no additional cost is required on the branch, but a switch is required on . The minimum possible cost of such a switch is , and thus using the two claims as the previous case (we omit the step replacing by and by , which is implicit by claim 1.1), if , we have
and if , we have
Again, the above expressions are considered by the minimization of , and so .
• If and , this case is symmetric to the previous one.
• If and If , then a switch with host bx is needed on both branches and . Thus, if , we have
and if , we have
Again, these are considered in , and we get .
In all cases, . It remains to show that this holds for . In this case, a cost of must be counted for the x node, plus the cost for losses by claim 1.2. Next, we consider all values of and .
• if , then as we argued
The latter expression is among the expressions that minimizes and thus .
• if , then since x is an EGTcopy event, one of the or branches must switch to , then switch back to bx, implying a an EGTcut from to bx of cost . In this situation,
which is considered among the expressions minimized by . Again, .
• if , then one of the or branches stays in bx, and thus must switch to for a cost of . In this situation,
which is considered among the expressions minimized by . Again, .
In every possible case, .
We must now prove the complementary bound, i.e. that . Let such that . If e = Spe, the expression corresponds to making x a speciation (which is possible since we check that neither of nor holds) and adding the minimum number of mandatory losses on and . Let that minimizes , and define br for xr analogously. Thus consider the reconciliation in which x is a speciation, on which we graft the mandatory losses on and and then, for each of bl or br that differs from bx, adds an EGTcut on the corresponding branch. Then, for subtree, take an optimal reconciliation for and for the subtree, take the optimal reconciliation for . By induction, and are of costs and respectively. Since all optimal reconciliations use the LCA-mapping, such a reconciliation is valid and its cost is as defined in . It follows that (the latter inequality owing to the optimality of ).
If e = Dup, the argument is exactly the same, except that to construct , we make x a duplication and add losses instead.
Finally, assume that e = EGTcopy. It is not hard to see that each expression that may choose when minimizing corresponds to a valid reconciliation. Indeed, consider the reconciliation where for a cost of . We add mandatory losses on the and branches. Then, the first two cases of the minimization in correspond to having no additional switch needed, and hence we can use the optimal reconciliation for and . The third case corresponds to having both xl and xr mapped to bx, in which case we can choose to apply the EGTcopy on , but need to switch back for a cost of . The last case corresponds to having both xl and xr mapped to , in which case the EGTcopy applies one switch, and we add an EGTcut for the other switch of cost .
Since each possible case represents the cost of a valid reconciliation , we get . Thus for every possible value of e, we have .
To conclude, the two complementary bounds show that . □
Fig. A1.
DLE-Reconciliations obtained forMitoCOG0005, MitoCOG0051 and MitoCOG0053 with the EndoRex scores settings S1, S2, S3, S4 and S5. The blue part of the tree indicates that the genetic material is located in the mitochondrion, while the red part indicates location in the nucleus. The shape of an internal node represents its associated event, as represented in Figure 1 (circle for a speciation, rectangle for a duplication and triangle for an EGT event). Loss events are not represented. Genes are formatted as follow: [species name]__[gene-encoding location]__[gene id]. Moreover, 0 indicates a location in the mitochondrion, while 1 indicates a location in the nucleus
References
- Adams K.L., Palmer J.D. (2003) Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol. Phylogenet. Evol. Plant Mol. Evol., 29, 380–395. [DOI] [PubMed] [Google Scholar]
- Akerborg O. et al. (2009) Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc. Natl. Acad. Sci. USA, 106, 5714–5719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bansal M.S. et al. (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28, i283–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bollback J.P. (2006) SIMMAP: stochastic character mapping of discrete traits on phylogenies. BMC Bioinformatics, 7, 88–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandvain Y., Wade M.J. (2009) The functional transfer of genes from the mitochondria to the nucleus: the effects of selection, mutation, population size and rate of self-fertilization. Genetics, 182, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charleston M.A., Perkins S.L. (2006) Traversing the tangle: algorithms and applications for cophylogenetic studies. J. Biomed. Inf., 39, 62–71. [DOI] [PubMed] [Google Scholar]
- Chen K. et al. (2000) NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol., 7, e429–e447. [DOI] [PubMed] [Google Scholar]
- Derelle R. et al. (2015) Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl. Acad. Sci. USA, 112, E693–E699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delabre M. et al. (2020) Evolution through segmental duplications and losses: a Super-Reconciliation approach. Algorithms. Mol. Biol., 15, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doyon J.P. et al. (2010) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Lecture notes in computer science, Proceedings of RECOMB International Workshop on Comparative Genomics, vol. 6398, pp. 93–108.
- Dyall S.D., Johnson P.J. (2000) Origins of hydrogenosomes and mitochondria: evolution and organelle biogenesis. Curr. Opin. Microbiol., 3, 404–411. [DOI] [PubMed] [Google Scholar]
- Edgar R. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- El-Mabrouk N., Noutahi E. (2019) Gene family evolution-an algorithmic framework. In: Warnow T. (ed.) Bioinformatics and Phylogenetics. Computational Biology, vol 29., Springer International Publishing, pp. 87–119. [Google Scholar]
- Fitch W.A. (1971) Minimum change for a specific tree topology. Syst. Biol., 20, 406–416. [Google Scholar]
- Goodman M. et al. (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool., 28, 132–163. [Google Scholar]
- Gray M.W. et al. (2020) The draft nuclear genome sequence and predicted mitochondrial proteome of Andalucia godoyi, a protist with the most gene-rich and bacteria-like mitochondrial genome. BMC Biol., 18, 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn M.W. (2007) Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biology, 8, R141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huelsenbeck J.P. et al. (2003) Stochastic mapping of morphological characters. Syst. Biol., 52, 131–158. [DOI] [PubMed] [Google Scholar]
- Kelly S. (2020) The economics of endosymbiotic gene transfer and the evolution of organellar genomes. bioRxiv, doi:10.1101/2020.10.01.322487. [Google Scholar]
- Lafond M. et al. (2016) Efficient non-binary gene tree resolution with weighted reconciliation cost. InLeibniz International Proceedings in Informatics, 27th Annual Symposium on Combinatorial Pattern Matching (CPM), num 14, p 14:1-14:12.
- Lang B.F., Burger G. (2012) Mitochondrial and eukaryotic origins: a critical review. Bot. Res., 63, 1–20. [Google Scholar]
- Kannan S. et al. (2014) MitoCOGs: clusters of orthologous genes from mitochondria and implications for the evolution of eukaryotes. BMC Evol. Biol., 14, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makarova K. et al. (2007) Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol. Direct, 2, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puttick M. et al. (2018) The interrelationships of land plants and the nature of the ancestral embryophyte. Curr. Biol., 28, 733–745.e2. [DOI] [PubMed] [Google Scholar]
- Roger A.J. et al. (2017) The origin and diversification of mitochondria. Curr. Biol., 27, R1177–R1192. [DOI] [PubMed] [Google Scholar]
- Sankoff D., Cedergren R.J. (1983) Simultaneous comparison of three or more sequences related by a tree. In: Sankoff D., Kruskal J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, Chapter 9. Addison-Wesley, pp. 253–264. [Google Scholar]
- Sloan D.B. et al. (2018) Cytonuclear integration and co-evolution. Nat. Rev. Genet., 19, 635–648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stolzer M. et al. (2012) Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics, 28, i409–415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szöllősi G.J. et al. (2015) The inference of gene trees with species trees. Syst. Biol., 64, e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tofigh A. et al. (2011) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans. Comput. Biol. Bioinf., 8, 517–535. [DOI] [PubMed] [Google Scholar]
- Yutin N. et al. (2009) Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J., 6, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang L. (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol., 4, 177–187. [DOI] [PubMed] [Google Scholar]
- Zmasek C.M., Eddy S.R. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17, 821–828. [DOI] [PubMed] [Google Scholar]



