Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2013 Feb 21;62(3):386–397. doi: 10.1093/sysbio/syt003

Lateral Gene Transfer from the Dead

Gergely J Szöllősi 1,2,*, Eric Tannier 1,2,3, Nicolas Lartillot 4,5, Vincent Daubin 1,2
PMCID: PMC3622898  PMID: 23355531

Abstract

In phylogenetic studies, the evolution of molecular sequences is assumed to have taken place along the phylogeny traced by the ancestors of extant species. In the presence of lateral gene transfer, however, this may not be the case, because the species lineage from which a gene was transferred may have gone extinct or not have been sampled. Because it is not feasible to specify or reconstruct the complete phylogeny of all species, we must describe the evolution of genes outside the represented phylogeny by modeling the speciation dynamics that gave rise to the complete phylogeny. We demonstrate that if the number of sampled species is small compared with the total number of existing species, the overwhelming majority of gene transfers involve speciation to and evolution along extinct or unsampled lineages. We show that the evolution of genes along extinct or unsampled lineages can to good approximation be treated as those of independently evolving lineages described by a few global parameters. Using this result, we derive an algorithm to calculate the probability of a gene tree and recover the maximum-likelihood reconciliation given the phylogeny of the sampled species. Examining 473 near-universal gene families from 36 cyanobacteria, we find that nearly a third of transfer events (28%) appear to have topological signatures of evolution along extinct species, but only approximately 6% of transfers trace their ancestry to before the common ancestor of the sampled cyanobacteria. [Gene tree reconciliation; lateral gene transfer; macroevolution; phylogeny.]


From the first growth of the tree, many a limb and branch has decayed and dropped off; and these lost branches of various sizes may represent those whole orders, families, and genera which have now no living representatives, and which are known to us only from having been found in a fossil state.

Charles Darwin, On the Origins of Species. London, 1859

Most of the diversity of life that ever existed on earth has gone extinct and can only be glimpsed from the fossil record. Although the comparative approach allows the reconstruction of some morphological and genetical characteristics of ancestral species, it is only informative for species that have founded extant lineages. Yet, the information enclosed in genome sequences is abundant and particularly meaningful for the reconstruction of the descent and evolution of their carriers (Zuckerkandl and Pauling 1965; Boussau and Daubin 2010; David and Alm 2011), so much so that it may have recorded accounts of extinct lineages. This possibility exists because the success of lateral gene transfer (LGT) as an evolutionary process implies that each gene possesses its own, unique history, which is not necessarily confined to the history of those species that have survived (Maddison 1997; Galtier and Daubin 2008; Fournier et al. 2009; Abby et al. 2012).

Several models have recently been developed to reconcile seemingly contradictory gene phylogenies with the species phylogeny by tracing the path on the species phylogeny along which they evolved as a result of a series of speciations, gene duplications, LGT, and losses (Tofigh 2009; Doyon et al. 2010; David and Alm 2011; Szöllősi and Daubin 2012; Szöllősi et al. 2012). None of these models, however, take into consideration the fact that, in the presence of LGT, gene trees record evolutionary paths along the complete species tree, including extinct and unsampled branches, and not only along the phylogeny of the species in which they reside today. This is the case because, as first noted by Maddison (1997) and later elaborated by Gogarten and coworkers (Zhaxybayeva and Gogarten 2004; Fournier et al. 2009), while LGT events imply that the donor and receiver lineages existed at the same time, the donor lineage might have subsequently become extinct, or more generally, might not have been sampled.

Here, we demonstrate that, if the number of species considered in the species phylogeny is small compared with the total number of species, the overwhelming majority of gene transfers involve speciation to and evolution along extinct or unsampled species. Furthermore, we show that, if this condition is met, the evolution of genes along the unrepresented parts of the species phylogeny can to good approximation be treated as those of independently evolving lineages, the behavior of which depends only on the global parameters of the speciation dynamics. This in turn allows us to derive the probability of observing a gene phylogeny by extending the ODT model introduced previously (Szöllősi et al. 2012). Applying our model to a data set derived from 36 cyanobacterial species, we perform a preliminary assessment of the phylogenetic signal for the evolution of transferred genes along extinct species.

A Minimal Model of Speciation and Gene Birth and Death

It is not feasible to specify, much less to reconstruct, the complete phylogeny of all species that ever existed. To describe the evolution of genes outside the represented phylogeny—along lineages that have become extinct or whose descendants have not been sampled—we must resort to modeling the speciation dynamics that gave rise to the complete phylogeny. Modeling the dynamics of speciation provides a stochastic model of the evolution of unrepresented lineages that can be used to describe gene histories given knowledge of the represented phylogeny and a few global parameters.

As a minimal model of speciation, here, we assume that the number of species N is constant, and that the dynamics of speciation is modeled by a continuous time Moran process (Moran 1962). That is, for each species at rate σ, a speciation occurs during which the species gives rise to two descendants and a randomly chosen species goes extinct (cf. Fig. 1a). The central assumption we make is that, of the N species existing at present (i.e., t = 0), we sample only a small fraction nN. In general, the validity of this assumption depends on the phylogenetic problem considered, but should almost always be met for major groups of bacteria and archaea, where the number of species that potentially exchange genes by LGT is inevitably much larger than the number of sampled species, even in large-scale studies (Ochman et al. 2000; Torsvik et al. 2002).

Figure 1.

Figure 1

Gene trees are the result of the combination of speciation and gene birth and death. As a minimal description we consider: a) that for each of the N species at a rate σ, a speciation occurs, during which the species is succeeded by two descendants, and a random species suffers extinction; b) at a rate δ per gene, a gene duplicates, that is, it is succeeded by two gene copies in the same genome, at a rate τ/(N – 1) per gene per host species, a gene is transferred, resulting in one copy each in the donor and host species, and finally, with a rate λ per gene, a gene is lost. The represented phylogeny c) corresponds to the tree spanned by the n sampled species. A branch of the represented tree corresponds to a series of speciation events, but only the last of these, the speciation event that gives rise to two represented lineages (filled circles, green online) is explicitly present for internal branches as the speciation node terminating the branch. The number of unrepresented species (dashed circles) is always much larger than the number of represented species (full circles).

To describe the evolution of genes within the genomes of species, we assume genes to evolve independently according to a birth-and-death process that consists of gene duplication, transfer, and loss (Tofigh 2009; Szöllősi and Daubin 2012; Szöllősi et al. 2012). As shown in Figure 1b, a gene in the genome of any of the N species can: (i) be duplicated at rate δ; (ii) be transferred from a donor species to any of the other N – 1 possible host species at a rate τ/(N – 1); or (iii) be lost at a rate λ. Gene copies can also be born and be lost as a result of the speciation dynamics: (iv) at the species level, lineages experience speciation at a rate σ, in which case they are replaced by two copies in the two new species or; (v) suffer extinction at an identical rate σ. A branch e of the represented tree S in general corresponds to a series of speciation events, however, as shown in Figure 1c, only the last one of these, the speciation event that gave rise to two represented lineages, is explicitly present for internal branches as the (green online) speciation node terminating the branch.

Almost All Transfers Involve Speciation

To understand what fraction of transfers involves evolution along unrepresented species, we must compare the relative rate of transfers that are direct transfers between branches of the represented phylogeny S and indirect transfers that result in a gene returning to S after exiting it via speciation or transfer to unrepresented species.

To compare the contribution of indirect transfers and direct transfers to observed gene histories, we consider first only direct transfers and indirect transfers that involve a speciation to an unrepresented species. To describe the shape of the species tree generated by the Moran process introduced above, we can use the coalescent approach. Here, under Kingman's coalescent, the time to the most recent common ancestor of the n sampled species is of the order of 2N/σ (1–1/n) ≈ 2N/σ (Kingman 1982). This implies that the expected number of unrepresented speciation events per branch of the species tree is much larger than one, being of the order of σ×2N/σ/(2n – 2) ≈ N/n ≫ 1, as there are (2n – 2) branches of S. This suggests that for any pair of coexisting branches of the represented tree, a gene that descends from one of the branches and is transferred to the other, is likely to have experienced a speciation event “away” from the represented phylogeny spanned by the n sampled species before being transferred back to it.

To quantify the above argument, we can compare the expected number of transfers from branch f to branch e of the represented phylogeny, resulting from either a direct transfer or a more complex history involving a speciation event. Clearly, if the branches do not overlap in time, the expected number of direct transfers is zero. To consider overlapping branches, let us consider for simplicity that both e and f are terminal branches—similar results can be derived for any other pair of overlapping branches. The expected branch lengths are then E(te) = E(tf) ≈ N/σn, with overlap min(te,tf) ≲ N/σn. Integrating over possible transfer times, the expected number of direct transfers is then

graphic file with name syt003m1.jpg (1)

To estimate the expected number of indirect transfers that are topologically indistinguishable from the above direct transfers, we can reason backwards in time as illustrated in Figure 2: (i) the rate at which a transfer occurs from each of the (Nn) unrepresented species to branch e is τ/(N – 1), (ii) the probability of this gene lineage not coalescing back to any of the n branches of the represented tree during a time interval t is exp(−/Nt), and (iii) the rate at which it coalesces with branch f is σ/N. Integrating over possible speciation and transfer times gives:

graphic file with name syt003m2.jpg (2)

Equations (1) and (2) show that if the number of sampled species is small compared with the total number of species (nN), then the expected number of direct transfers is small compared with indistinguishable indirect ones (TdirectTindirect), that is, the contribution of direct transfers to observed gene histories is negligible.

Figure 2.

Figure 2

The overwhelming majority of transfers involve evolution along unrepresented species. A direct transfers (dark gray, blue online) between two terminal branches of the represented phylogeny occurs with rate τ/(N – 1) and involves a single transfer event. An indirect transfer (light gray, red online) that leaves an indistinguishable record in the gene tree topology. To count indirect transfers, we trace their history backwards in time: transfer back to the host branch on the represented tree (branch e) occur with a rate τ/(N– 1) from each of the Nn unrepresented species, of these we are only concerned with ones which descend from the relevant donor branch (branch f), the number of these can be calculated using the exponential coalescence probability and the rate of unrepresented speciations σ/N from the donor branch (branch f).

To compare the two types of possible indirect transfers back to S—those exiting via speciation and those via transfer—we must contrast the rate σ at which gene copies exit branch f as a result of speciation and the rate τ/(N – 1)×(N – 1 – n) ≈ τ that gene copies exit as a result of transfer. Estimates of τ, and more generally gene birth and death rates, are available from several sources, all of which agree that the expected number of gene birth and death events per branch is below unity. Models that consider the dynamics of the number of homologous gene copies along a species phylogeny (referred to as phylogenetic profiles) (Csűrös and Miklós 2009) have consistently found that birth and death rate is of the same order, with an excess of loss compensated by origination of new families, in agreement with phenomenological models of gene family size distribution (Karev et al. 2002; Szöllősi and Daubin 2012). In a detailed study, Csűrös et al. found for 28 archaea that the expected number of birth events (duplication and gain) is 0.12 and that the expected number of losses is 0.36 (Csűrös and Miklós 2009) per branch per gene. More recently, the ODT model that attempts to explicitly explain the evolution of multicopy gene trees (representative of complete genomes) along an ultrametric species tree has arrived at similar results (Szöllősi et al. 2012), finding for 36 cyanobacterial genomes δτ ≈ 0.2, λ ≈ 1, in units corresponding to a tree with unit height. Assuming, as above, that the time to the most recent common ancestor of the sampled species is of the order of 2N/σ, that is, the expected number of gene copies (per gene) exiting a branch of S is proportional to N/n, while the number exiting as a result of transfer is less than one. Since the rate at which a gene that has exited the represented phylogeny returns to S as a results of transfer at some point in the future is independent of the mode of exit from S, we can conclude that indirect transfers are dominated by paths that include a speciation.

In summary, if the number of sampled species is small compared with the total number of species, transfers in observed gene histories are dominated by paths that include a speciation to an unrepresented species and subsequent transfer back to the represented tree.

The Probability of Observing a Gene Tree

Reconciling gene trees with the species tree requires iterating over possible paths along which a gene tree may have been generated by a series of speciations, duplications, transfers, and losses (Fig. 3). In existing methods (Tofigh 2009; Doyon et al. 2010; Szöllősi and Daubin 2012; Szöllősi et al. 2012), this is accomplished by only considering paths along the represented phylogeny and using a dynamic programming approach exploiting the independence of gene birth and death events and by extension gene lineages.

Figure 3.

Figure 3

Reconciling gene trees with the complete phylogeny. a) An evolutionary scenario that involves a transfer event from an unrepresented species. The represented phylogeny is shown as a solid tube with filled circles (green online) corresponding to represented speciations. The unrepresented phylogeny is indicated by dashed tubes, with white circles corresponding to unrepresented speciations (cf. Fig. 1c). The continuous line traces the gene tree spanned by genes in sampled species that is the result of a series of birth and death events along the complete phylogeny; b) a reconciliation of the gene phylogeny from (a), corresponding to the evolutionary scenario depicted in (a). In general, we do not know the evolutionary scenario that has generated the gene phylogeny. However, we can use the dynamic programming algorithm described in the text to calculate the likelihood of the gene tree by summing over all possible reconciliations, that is, all ways to draw the gene tree into the species using speciation, duplication, transfer, and loss events [cf. Eqs. (4)(7) and Fig. A1] in the Appendix. The likelihood calculation uses the rate of different events (σ, δ, τ, and λ) together with functions describing the extinction (Ee and Ē) and the propagation (Ge and Inline graphic) of gene linages [cf. Eqs. (A.1)(A.4)].

Although gene duplication, transfer, and loss can reasonably be modeled as independent birth and death events, speciation and extinction necessarily involve the simultaneous birth and death of many genes. Along the represented phylogeny, speciation events are fully specified and can be explicitly taken into account (Szöllősi et al. 2012). This is not the case, however, for speciation and extinction events that occur in the unrepresented part of the phylogeny, or do not correspond to speciation nodes of the represented phylogeny. Therefore, unrepresented speciations result in nonindependence of gene lineages.

Consider for instance the probability Ēk (t) that k genes present at time t in a species not ancestral to the sample of n extant species leave no observed descendant. Conditional on the complete phylogeny, ϕ including all extinct species lineages, gene lineages are independent, and therefore Ēk (t|ϕ) = {Ē (t|ϕ)}k. Averaging over all complete phylogenies compatible with the phylogeny reconstructed based on the n species, however, results in 〈Ēk(t|ϕ)〉 = 〈{Ē(t|ϕ)}k〉 ≠ 〈Ēk (t|ϕ)〉k, which is not a product of k independent factors.

On the other hand, nN implies that Ē(t) ≈ 1. Introducing the notation Ē (t|ϕ) = 1 – (t|ϕ) and Ē (t) = 1– (t), and neglecting second- and higher order terms in ϵ(t|ϕ) and ϕ (t), we have

graphic file with name syt003m3.jpg (3)

A similar argument can be derived for k-gene propagator Inline graphic (see the Appendix). Therefore, if nN, then to good approximation, the evolution of two genes observed in the same unrepresented species can be treated as independent without specifying the full phylogeny.

Under the above assumption that unrepresented speciation and extinction events can be considered in a genewise independent manner, we can describe the evolution of gene copies that appear as single gene lineages when observed from the present. We can calculate: (i) the extinction probability Ee(t) that a gene seen at time t on branch e of S leaves no observed descendant, that is, no descendant exists at time t = 0 in the genome of any of the n sampled species; (ii) the extinction probability Ē (t) that a gene seen at time t in an unrepresented species leaves no observed descendant; (iii) the single gene propagation probabilities Ge(s,t) that all observed descendants of a gene seen at time s on branch e descend from a descendant seen at a later time t < s on branch e; and (iv) Inline graphic the probability that all observed descendants of a gene seen at time s in an unrepresented species descend from a descendant seen at time t < s in an unrepresented species. Each of the above functions can be expressed as differential equations describing evolution backwards in time by considering the set of possible events that change the relevant probability. These can be derived analogously to (Tofigh 2009; Stadler 2011; Szöllősi et al. 2012) and can be found in the Appendix.

Given a rooted gene tree topology G, we can now calculate the probability p(G|S,ℳ) of observing G, where denotes the parameters of the model, by summing over all possible paths along S and over all complete phylogenies compatible with the species tree spanning the n species of the sample. We can sum over all paths by recursively mapping the branches of G onto branches of S generalizing the ODT models algorithm (Szöllősi et al. 2012) to include evolution along unrepresented species (cf. Figs 3 and A1 in the Appendix).

A branch of G represents the evolution of a gene copy for which (i) if the branch is nonterminal, all observed descendants descend from one of the two daughter gene lineages which emerge from the gene tree node in which the branch terminates or (ii) if the branch is terminal, a gene is observed in one of the genomes mapping to a leaf of S. To describe possible paths along S that this gene copy may take before arriving at the gene tree node in which it terminates, we must consider five events: (i) single-copy evolution along branch e of S described by Ge, (ii) single-copy evolution outside S described by Inline graphic, (iii) speciation from a branch of S to an unrepresented species such that only descendants of this copy are observed, (iv) transfer such that only descendants of the transferred copy are observed, and (v) speciation represented in S such that only one of the descending copies leaves an observed descendant. Each of these events leads to a single gene copy with observed descendants. The gene tree node in which the branch terminates can correspond to three possible events: (i) a duplication, a speciation represented in S; (ii) a speciation not represented in S; or (iii) a transfer. Each of these events leads to two gene copies with observed descendants.

To derive the recursion expressing the probability of G as the sum over possible paths along S, we discretize time along S keeping track of speciation times ti along S. Speciations represented in S define the time intervals [0,t1),...,[ti,ti+1),...[tn–1,tn–1) referred to as time slices (Tofigh 2009; Doyon et al. 2010) with indices 0,...,i,...n. We further divide each time slice into D equal time intervals of height Δti = (ti+1ti)/D.

The probability of the gene lineage leading to node u of G being seen on branch e of S at time tt given the probabilities at time t = ti + Δti is

graphic file with name syt003m4.jpg (4)

where Inline graphic denotes the probability of the gene lineage leading to node u of G being seen in an unrepresented species at time t, v and w descend from u in G. As shown in Figure A1a in the Appendix, the terms correspond to (i) no event with an observed descendent; (ii) birth of two gene linages by duplication, such that both leave observed descendants; (iii) and (iv) birth of two gene linages with observed descendants as a result of an unrepresented speciation; and finally, (v) unrepresented speciation followed by the loss of the copy in branch e such that only the copy in the unrepresented phylogeny leaves an observed descendant. In the above expression, we only consider indirect transfers that involve a speciation, see the Appendix for the full expression.

The probability of being seen in such an unrepresented species is

graphic file with name syt003m5.jpg (5)

where i(S) denotes the set of branches of S in time slice i. As shown in Figure A1b, the terms correspond to (i) no event with an observed descendent; (ii) birth of two gene linages by speciation, duplication, or transfer, such that both leave observed descendants; (iii) and (iv) birth of two gene linages with observed descendants as a result of transfer back to the represented phylogeny; and finally, (v) transfer back to the represented phylogeny following which the copy in the unrepresented donor linage does not leave an observed descendant. Terms involving gene lineages v, w are zero if u is a leaf of G in both the above expressions.

At speciation times t = ti where branches f and g descend from e in S, a represented speciation takes place that may be followed by a loss:

graphic file with name syt003m6.jpg (6)

The terms (cf. Fig. A1c) correspond to (i) and (ii) represented speciation such that both resulting gene lineages lead to observed descendants; and (iii) and (iv) represented speciation such that only one of them do. Finally, at time t = 0 on each terminal branch e of S, the presence of observed genes is expressed as:

graphic file with name syt003m7.jpg (7)

As illustrated in Figures 3b and A1, each term in Equations (4)(7) above corresponds to a series of speciation, duplication, and transfer events that recursively draw the gene phylogeny into the species tree. The recursion calculates the probability of a gene tree with m genes in O(Dn2m) steps, as there are fewer than n branches in each time slice and n time slices. Summing over roots of G can be accomplished with identical complexity using double recursion. The most likely reconciliation can be recovered by tracing back along the sum choosing at each step the event with the highest probability.

Calculating the probability of a gene tree requires knowledge of the ultrametric species tree S, with branch lengths corresponding to time, the rate of duplication δ, transfer τ, and loss λ, as well as the parameters of the speciation dynamics, the species replacement rate σ, and the total number of species N. The number of parameters is reduced, if we assume the time to the common ancestor of the sampled species to correspond to its expected value under speciation dynamics. Choosing units such that S is of unit height, this corresponds to the choice σ = 2N. Furthermore, under the present choice of parameters and time scale, the probability of a gene tree and its maximum-likelihood reconciliation depends only very weakly on N, as long as the condition nN is satisfied. This is the case because the expected number of transfers between branches of S is nearly independent of N. In particular, if we assume that a gene lineage returns at most once to S we arrive at the result derived in Equation (2) according to which the number of transfers is independent of N.

Routes to Cyanobacterial Genomes

To carry out a preliminary analysis of the signal for evolution outside the represented phylogeny in real data, we considered a set of 473 single-copy gene families present in the genome of at least 34 of 36 cyanobacteria and use the dated species tree reconstructed in Szöllősi et al. (2012). We choose single-copy near-universal gene families as they are expected to be (i) relatively slowly evolving and hence to harbor a strong signal of homology and yield high-quality alignments, and (ii) they can be assumed to be well described by a single set of uniform duplication, transfer, and loss rates, at least in contrast to more complex data sets composed of multicopy families. For each family, gene tree topologies and duplication, transfer, and loss rates that maximize the joint likelihood (Maddison 1997; Szöllősi and Daubin 2012) were inferred as described in the Appendix. Using these results, 1000 reconciliations per family were sampled by stochastic backtracking along the sum over reconciliations.

On average, we found 0 duplication, 2.15 transfers, and 2.56 losses per family. The distribution in time of transfer events and the preceding speciations to unrepresented species are shown in Figure 4a. The majority of transfers occur between branches of S that overlap in time, hence the resulting gene tree carries no topological signature of the length of time spent evolving along unrepresented lineages. Transfers between branches that do not overlap in time, for which the gene tree topologies explicitly record evolution outside the represented tree, correspond to 27.8% of all transfers. About a fifth of these (5.9% of all transfers) branch above the root indicating transfer from outside the sampled diversity of cyanobacteria. The median interval of time spent evolving in unrepresented lineages is 0.083 (or 222 million years, henceforth myr) for transfers between overlapping branches and 0.39 (or 1000 myr) for transfers between nonoverlapping branches. Similar values are obtained if we consider only the maximum-likelihood reconciliations, except for the median interval of time spent evolving in unrepresented lineage for transfers between overlapping branches which is only 0.0028 (or 8.1 myr corresponding to the minimum length allowed by time discretization). The corresponding value for transfers between nonoverlapping branches, 0.36 (or 990) myr, is nearly identical to the value above.

Figure 4.

Figure 4

LGT events for 36 cyanobacteria. For 473 near universal single-copy families from 36 cyanobacterial genomes gene trees that maximize the joint likelihood were reconstructed. For the trees obtained 1000 reconciliations were sampled. a) The distribution of transfer events (light bars, green online) and the preceding speciation events (dark bars, blue online). The final bin summarizes all events occurring above the root of S. b) The distribution of the time spent by transferred genes evolving along unrepresented species for transfers between overlapping branches (dark bars, red online, 72.2% of transfers) and transfers between nonoverlapping branches (light bars, yellow online, 27.8% of all transfers). Both sets of bins sum to unity. Time units are chosen such that the height of the root of S is 1.0. The age of the root falls in the 3500–2700 My interval (Falcón et al. 2010; Szöllősi et al. 2012). Data are available from Dryad under doi:10.5061/dryad.27d0g.

We emphasize that an important caveat of these results is that the accuracy of our method to infer correct reconciliations and gene topologies has not been assessed. This could be accomplished by explicit simulations of gene family evolution along the complete phylogeny. Such simulations are, however, outside the scope of the current publication, as they are technically challenging due to the large number of species in the complete phylogeny, and because they must address a potentially long list of possible questions. In lieu of simulation, it is possible to examine the posterior support of individual transfer events, which, as described in the Appendix, can be calculated as the fraction of times we find a given transfer event among the sampled reconciliations for each family. Using this measure, we find that transfers are well supported with 66.8% of transfer events having support over 0.95.

It is also important to discuss to what extent we can expect observed transfers between nonoverlapping branches to be robust to increasing the number of sampled species. Consider the extreme case that all N extant species are sampled. It is clear that transfers between overlapping branches of S (red in Fig. 4b) may correspond to transfer between nonoverlapping branches of the full phylogeny spanned by all N extant species. To ascertain how often we expect the opposite to occur, to have a transfer between nonoverlapping branches of S correspond to transfers between overlapping branches of the full phylogeny spanned by all N extant species, we need to estimate how often we expect to sample an extant descendant of the unrepresented donor lineage involved in a transfer between nonoverlapping branches of S (light bars, yellow online, in Fig. 4b). Assuming a tree with unit height, the total branch length of the full phylogeny under Kingman's coalescent is of the order of log(N), while the total branch length including extinct species is of the order N. Thus, we expect that only a vanishing fraction of the order log(N)/N of donor lineages have left extant descendants. This implies that not only do most transfers involve speciation to, and evolution along branches of the complete phylogeny, but also the majority of these donor lineages have gone extinct. Consequently, most transfers between nonoverlapping branches of S correspond to transfers between nonoverlapping branches of the full phylogeny where the donor lineage has gone extinct.

In summary, we find that nearly a third (27.8%) of transfers evolve on average a billion years along lineages unrepresented in the phylogeny—most often, in fact, along extinct lineages, and only a moderate fraction of transfers originates from outside the cyanobacteria. Furthermore, both of these estimates are conservative, as increasing the number of sampled species is expected to lead to an increase in the ratio of transfers between nonoverlapping branches, and to a decrease in the fraction of transfers from outside of cyanobacteria. The first of the above results, however, apply only to transfers between branches of S, that is, transfers observed for the n = 36 cyanobacteria considered. For the complete set of transfers between branches of the full phylogeny, the fraction of transfers evolving along extinct linages is potentially different, for example, a macroscopic fraction of transfers are expected to correspond to direct transfers between its branches.

Discussion

The results developed above are conditional on two crucial assumptions: (i) that the number of sampled species is small compared with the total number of species, and (ii) the evolution of gene lineages can be treated as independent, both in the represented and in the unrepresented part of the phylogeny. As we argue above, if genes are duplicated, transferred, and lost independently, the former assumption (i.e., nN) implies that the evolution of genes outside the represented phylogeny can also be treated as independent, even if the complete phylogeny is not specified.

We also make the assumptions that (iii) transfer occurs with identical rate between any two species and (iv) that the time to the last common ancestor of the sampled species corresponds to its expected value under the speciation dynamics. These conditions serve to simplify the development of the above arguments and can be relaxed without affecting our conclusion that the majority of transfers involve evolution along extinct or unsampled species. Relaxing Condition (iv) is straightforward. Concerning Assumption (iii), if, for example, transfer occurs preferentially between species that are more closely related (Andam and Gogarten 2011), the scenarios shown in Figure 2 are affected to an identical extent because the last common ancestor of branch e and either branch f (the donor lineage for dark gray paths, blue online) or any extinct species that descends from an unrepresented speciation along f (a donor lineage along light gray paths, red online) is the same. Conversely, there are known cases, for example, the transfer of thermostable enzymes from thermophilic archaea to thermophilic bacteria (Nelson et al. 1999; Nesbo et al. 2001; Brochier-Armanet and Forterre 2007), of preferential transfer between distantly related taxa due to shared ecology. In this second case, we expect to observe genes preferentially transferred from phylogenetically distant taxa to lead to an excess of transfers descending from above the root of the sampled species for which topologically equivalent direct transfers do not exist. On a more practical ground, however, relaxing the assumption of homogeneous rates of transfer between lineages might seriously complicate the computation of the likelihood, as it would require modeling the distribution of the rates of transfers from and to unrepresented lineages.

More importantly, as long as these conditions are met, it is possible to extend the above results to more general models of speciation. Modeling variation in N, the total number of species, over geological times, could be of particular interest. Indeed, a corollary of the observation that LGT events record evolutionary paths along the complete species tree is that the phylogenies of genes from a limited sample of extant species carry information about extinct lineages, and therefore about the size and dynamics of ancient biodiversity. In fact, patterns of gene transfer may be even more informative about past biodiversity than the species tree itself. Drawing an analogy with population genetics, inferring biodiversity dynamics based on species trees (Nee 2001; Morlon et al. 2010; Stadler 2011) is similar to inferring past demography based on single-locus data. Single-locus inference is limited by the intrinsic stochasticity of Kingman's coalescent, in particular in the deep part of the genealogy. LGTs, on the other hand, are analogous to multiple loci (Heled and Drummond 2008), and as such, have the potential to increase the statistical power for inferring past biodiversity.

Supplementary Material

Data files related to this paper have been deposited at Dryad under doi:10.5061/dryad.27d0g.

Funding

This work was supported by the Marie Curie Fellowship 253642 “Geneforest” (to G.J.Sz.); the Institut National de Physique Nucléaire et de Physique des Particules' (IN2P3) computing centre; and the project was supported by the French Agence Nationale de la Recherche (ANR) through grant [ANR-10-BINF-01- 01] “Ancestrome”.

Acknowledgments

We thank B. Boussau and all the members of the Bioinformatics and Evolutionary Genomics Group for discussions of the results and comments on the article.

Appendix

The dynamic programming algorithm described in Equations (4)(7) calculates the likelihood of a gene tree given the species tree S and the rates of speciation, duplication, transfer, and loss. As illustrated in Figure 3, the likelihood is calculated in a piecewise independent manner, from the evolution of gene copies that appear as single gene lineages when observed from the present. Because in contrast to Figure 3, we do not know the exact evolutionary scenario we must sum over all reconciliations. This process can be represented as summing over all possible ways to draw the gene tree into the species tree using the set of events shown in Figure A1. The diagrams in Figure A1, and the corresponding terms in Equations (4)(7), are expressed using two types of functions describing the evolution of single gene lineages: (i) the extinction probabilities Ee and Ē that give the probability of gene present on, respectively, branch e of S or an unrepresented species having no descendant at time t = 0 in the genome of any of the n sampled species; (ii) the single gene propagators Ge(s,t) and Inline graphic corresponding to the probability that all sampled descendants of the gene seen at time s, respectively, on branch e of S, or in an unrepresented species, descend from the gene present at a later time t in the same species.

Below we provide the expressions for each of these functions that can be derived using the theory of birth-and-death processes. We also discuss the independence assumption in relation to the single gene propagators, write down the complete form of Equation (4) and describe the details of the data analysis presented in the main text.

Figure A1.

Figure A1

Diagrams corresponding to reconciliation events. Each diagram corresponds to a term in Equations (4)(7), with diagrams following each other in the same order as terms in the indicated equation. a) Depicts events that start with a gene lineage u in represented branch e of S at time tt; b) events that start with a gene lineage u in an unrepresented species at time tt; and finally, c) corresponds to represented speciation events in S. To illustrate the correspondence between terms and equations, consider the third diagram in the top row (a) depicting an unrepresented speciation and the corresponding (third) term in Equation (4). This term, Inline graphic, describes the probability that gene lineage u seen at time tt is succeeded as a result of an unrepresented speciation by two gene linages (v and w) one of which (w) is present in the same branch e as u while the other (v) resides in an unrepresented species.

Evolution of Single Genes

The forward Kolmogorov equations describing single gene extinction and propagation can be derived analogously to Tofigh (2009), Stadler (2011), and Szöllősi et al. (2012). The main differences is that here we also consider the speciation dynamics.

The extinction probability for branch e of S:

graphic file with name syt003m8.jpg (A.1)

where i(S) denotes the set of branches of S in time slice i, and ni their number. The terms correspond to (i) loss (ii) the rate duplications, and (iii) transfers to represented hosts, both conditional on survival, and finally, (iv) the rate of unrepresented speciations and transfers to unrepresented hosts, again conditional on survival. The initial conditions specify that at the end of branch e the probability of extinction is 0 if we are at time t = 0, that is, e is a terminal branch of S, and the product of the extinction probability of the descendants of e in S otherwise.

The extinction probability in an unrepresented species:

graphic file with name syt003m9.jpg (A.2)

Note that the term corresponding to transfer back to S acts as an inhomogeneity in the absence of which the only solution is Ē (t) = 1.

The single observed lineage propagator along branch e of S:

graphic file with name syt003m10.jpg (A.3)

The single observed lineage propagator in an unrepresented species:

graphic file with name syt003m11.jpg (A.4)

Note that if we set Ē (t) = 1 and neglect gene birth and death, which is much slower than the speciation dynamics, that is, δ+ τ+ λσ, we recover the exponential probability of coalescence with the represented tree assumed in Equation (2).

The propagator describing the evolution of k gene copies can be expressed using the single gene copy propagator. Consider the expression for Inline graphic, the probability that k genes seen at time s in an unrepresented species all leave a single descendant descending from the copy seen at time t < s in an unrepresented species:

graphic file with name syt003m12.jpg (A.5)

Since Equation (3) implies Inline graphic and neglecting second- and higher order terms in 1 – Ē gives 2(1–{Ē}k) ≃2k(1–Ē) and the above can be written as

graphic file with name syt003m13.jpg (A.6)

which has the solution Inline graphic. Analogous reasoning can be used to show that Ee(t) and Ge(s,t) can be used to factor the respective functions describing the evolution of multiple gene copies.

The Probability of a Gene Tree

The expressions for Pe(u,t) and Inline graphic are derived under the approximation that unrepresented speciation and extinction is independent per gene, as discussed above. The full expression for Pe(u,t + Δti) including terms corresponding to direct transfers and indirect transfers that depart S via a preceding transfer events, which are neglected in Equation (4), is:

graphic file with name syt003m14.jpg (A.7)

Routes to Cyanobacterial Genomes

The data set was constructed for near universal single-copy genes from all 36 cyanobacterial genomes found in Version 5 the HOGENOM database (Penel et al. 2009). Amino acid sequences were extracted for each family that had a single copy in at least 34 of the 36 cyanobacterial genomes. For each family, sequences were aligned using MUSCLE (Edgar 2004) with default parameters. The multiple alignment was subsequently cleaned using GBLOCKS (Talavera and Castresana 2007) with the options:

graphic file with name syt003m15.jpg (A.8)

Subsequently, we inferred a gene topology G that maximizes the joint likelihood

graphic file with name syt003m16.jpg (A.9)

where the first term corresponds to the likelihood of observing the unrooted gene tree topology G according to the exODT model developed above [Eqs. (4) and (5)), while the second term corresponds to the classic Felsenstein likelihood (Felsenstein 1981) of the alignment. For the exODT model, we fixed the parameter values N = 106 and σ = 2N, used the dated phylogeny from Szöllősi et al. (2012) and estimated global gene birth and death rates as described below. To calculate the Felsenstein likelihood, we used the Bio++ library (Dutheil et al. 2006) with an LG+Γ4+I model. Alignments and reconstructed gene trees are available from Dryad under doi:10.5061/dryad.27d0g.

Gene trees inference was performed in a two-step approach:

Initial estimate

  1. Using the DTL rates δ = 1.0×10−2, τ = 1.0× 10−2, λ = 2.0×10−2 for each family, the joint likelihood was calculated for all nearest neighbor interchanges (NNIs) (Felsenstein 2004) and a move was accepted if it improved the joint likelihood.

  2. For the set of trees obtained global DTL parameters were estimated that maximize the product of the joint likelihood of all 473 gene families.

Final estimate

  1. Using the obtained DTL rates for each family, the joint likelihood was calculated for all NNIs (Felsenstein 2004) and a move was accepted if it improved the joint likelihood.

  2. For the set of trees obtained global DTL parameters were again estimated with the results: δ = 1.010×10−5, τ = 4.438×10−3, and λ = 1.015×10−1.

Before performing NNIs starting gene tree topologies were estimated using an amalgamation approach (David and Alm 2011) wherein the Felsenstein likelihood was approximated using conditional clade probabilities (Höhna and Drummond 2012) based on posterior sample of 10 000 tree topologies obtained using PhyloBayes using an LG+Γ4+I substitution model.

The support of transfer events was measured based on a posterior sample of 1000 reconciliations per family. For each family, we assessed the support of all transfer events in the reconciliation that was seen the largest number of times. Two transfers were considered equivalent if they involved the transfer of the same gene linage between identical branches of the species tree.

References

  1. Abby S.S., Tannier E., Gouy M., Daubin V. Lateral gene transfer as a support for the tree of life. Proc. Natl Acad. Sci. USA. 2012;109:4962–4967. doi: 10.1073/pnas.1116871109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andam C.P., Gogarten J.P. Biased gene transfer in microbial evolution. Nat. Rev. Microbiol. 2011;9:543–555. doi: 10.1038/nrmicro2593. [DOI] [PubMed] [Google Scholar]
  3. Boussau B., Daubin V. Genomes as documents of evolutionary history. Trends Ecol. Evol. 2010;25:224–232. doi: 10.1016/j.tree.2009.09.007. [DOI] [PubMed] [Google Scholar]
  4. Brochier-Armanet C., Forterre P. Widespread distribution of archaeal reverse gyrase in thermophilic bacteria suggests a complex history of vertical inheritance and lateral gene transfers. Archaea. 2007;2:83–93. doi: 10.1155/2006/582916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Csűrös M., Miklós I. Streamlining and large ancestral genomes in archaea inferred with a phylogenetic birth-and-death model. Mol. Biol. Evol. 2009;26:2087–2095. doi: 10.1093/molbev/msp123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. David L.A., Alm E.J. Rapid evolutionary innovation during an archaean genetic expansion. Nature. 2011;469:93–96. doi: 10.1038/nature09649. [DOI] [PubMed] [Google Scholar]
  7. Doyon J.-P., Scornavacca C., Gorbunov K., Szöllősi G.J., Ranwez V., Berry V. Berlin (Germany): Springer; 2010. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers; pp. 93–108. (Comparative genomics; vol. 6398) [Google Scholar]
  8. Dutheil J., Gaillard S., Bazin E., Glémin S., Ranwez V., Galtier N., Belkhir K. Bio++: a set of c++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinform. 2006;7:188. doi: 10.1186/1471-2105-7-188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Edgar R.C. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Falcón L.I., Magallón S., Castillo A. Dating the cyanobacterial ancestor of the chloroplast. ISME J. 2010;4:777–783. doi: 10.1038/ismej.2010.2. [DOI] [PubMed] [Google Scholar]
  11. Felsenstein J. Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  12. Felsenstein J. Sunderland (MA): Sinauer Associates; 2004. Inferring phylogenies. [Google Scholar]
  13. Fournier G.P., Huang J., Gogarten J.P. Horizontal gene transfer from extinct and extant lineages: biological innovation and the coral of life. Phil. Trans. R. Soc. Lond. B Biol. Sci. 2009;364:2229–2239. doi: 10.1098/rstb.2009.0033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Galtier N., Daubin V. Dealing with incongruence in phylogenomic analyses. Phil. Trans. R. Soc. Lond. B Biol. Sci. 2008;363:4023–4029. doi: 10.1098/rstb.2008.0144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Heled J., Drummond A.J. Bayesian inference of population size history from multiple loci. BMC Evol. Biol. 2008;8:289. doi: 10.1186/1471-2148-8-289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Höhna S., Drummond A.J. Guided tree topology proposals for bayesian phylogenetic inference. Syst. Biol. 2012;61:1–11. doi: 10.1093/sysbio/syr074. [DOI] [PubMed] [Google Scholar]
  17. Karev G.P., Wolf Y.I., Rzhetsky A.Y., Berezovskaya F.S., Koonin E.V. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol. 2002;2:18. doi: 10.1186/1471-2148-2-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kingman J. On the genealogy of large populations. J. Appl. Probab. 1982;19:27–43. [Google Scholar]
  19. Maddison W.P. Gene trees in species trees. Syst. Biol. 1997;46:523–536. [Google Scholar]
  20. Moran P.A.P. Oxford: Clarendon Press; 1962. The statistical processes of evolutionary theory. [Google Scholar]
  21. Morlon H., Potts M.D., Plotkin J.B. Inferring the dynamics of diversification: a coalescent approach. PLoS Biol. 2010;8:e1000493. doi: 10.1371/journal.pbio.1000493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Nee S. Inferring speciation rates from phylogenies. Evolution. 2001;55:661–668. doi: 10.1554/0014-3820(2001)055[0661:isrfp]2.0.co;2. [DOI] [PubMed] [Google Scholar]
  23. Nelson K.E., Clayton R.A., Gill S.R., Gwinn M.L., Dodson R.J., Haft D.H., Hickey E.K., Peterson J.D., Nelson W.C., Ketchum K.A., McDonald L., Utterback T.R., Malek J.A., Linher K.D., Garrett M.M., Stewart A.M., Cotton M.D., Pratt M.S., Phillips C.A., Richardson D., Heidelberg J., Sutton G.G., Fleischmann R.D., Eisen J.A., White O., Salzberg S.L., Smith H.O., Venter J.C., Fraser C.M. Evidence for lateral gene transfer between archaea and bacteria from genome sequence of thermotoga maritima. Nature. 1999;399:323–329. doi: 10.1038/20601. [DOI] [PubMed] [Google Scholar]
  24. Nesbo C.L., L'Haridon S., Stetter K.O., Doolittle W.F. Phylogenetic analyses of two “archaeal” genes in thermotoga maritima reveal multiple transfers between archaea and bacteria. Mol. Biol. Evol. 2001;18:362–375. doi: 10.1093/oxfordjournals.molbev.a003812. [DOI] [PubMed] [Google Scholar]
  25. Ochman H., Lawrence J.G., Groisman E.A. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405:299–304. doi: 10.1038/35012500. [DOI] [PubMed] [Google Scholar]
  26. Penel S., Arigon A.-M., Dufayard J.-F., Sertier A.-S., Daubin V., Duret L., Gouy M., Perrière G. Databases of homologous gene families for comparative genomics. BMC Bioinform. 2009;10(Suppl 6):S3. doi: 10.1186/1471-2105-10-S6-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Stadler T. Mammalian phylogeny reveals recent diversification rate shifts. Proc. Natl Acad. Sci. USA. 2011;108:6187–6192. doi: 10.1073/pnas.1016876108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Szöllősi G.J., Daubin V. Modeling gene family evolution and reconciling phylogenetic discord. Methods Mol. Biol. 2012;856:29–51. doi: 10.1007/978-1-61779-585-5_2. [DOI] [PubMed] [Google Scholar]
  29. Szöllősi G.J., Boussau B., Abby S.S., Tannier E., Daubin V. Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations. Proc. Natl Acad. Sci. USA. 2012;109:17513–17518. doi: 10.1073/pnas.1202997109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Talavera G., Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 2007;56:564–577. doi: 10.1080/10635150701472164. [DOI] [PubMed] [Google Scholar]
  31. Tofigh A. [Stockholm (Sweden)]: KTH, School of Computer Science and Communication; 2009. Using trees to capture reticulate evolution: lateral gene transfers and cancer progression [dissertation] [Google Scholar]
  32. Torsvik V., øvreås L., Thingstad T.F. Prokaryotic diversity–magnitude, dynamics, and controlling factors. Science. 2002;296:1064–1066. doi: 10.1126/science.1071698. [DOI] [PubMed] [Google Scholar]
  33. Zhaxybayeva O., Gogarten J.P. Cladogenesis, coalescence and the evolution of the three domains of life. Trends Genet. 2004;20:182–187. doi: 10.1016/j.tig.2004.02.004. [DOI] [PubMed] [Google Scholar]
  34. Zuckerkandl E., Pauling L. Molecules as documents of evolutionary history. J. Theor. Biol. 1965;8:357–366. doi: 10.1016/0022-5193(65)90083-4. [DOI] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES