A Maximum Likelihood Method for Reconstruction of the Evolution of Eukaryotic Gene Structure

Liran Carmel; Igor B Rogozin; Yuri I Wolf; Eugene V Koonin

doi:10.1007/978-1-59745-243-4_16

. Author manuscript; available in PMC: 2012 Aug 2.

Published in final edited form as: Methods Mol Biol. 2009;541:357–371. doi: 10.1007/978-1-59745-243-4_16

A Maximum Likelihood Method for Reconstruction of the Evolution of Eukaryotic Gene Structure

Liran Carmel, Igor B Rogozin, Yuri I Wolf, Eugene V Koonin

PMCID: PMC3410445 NIHMSID: NIHMS385225 PMID: 19381540

Abstract

Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.

Keywords: Maximum likelihood, expectation-maximization, intron evolution, ancestral reconstruction, eukaryotic gene structure

1. Introduction

In eukaryotes, many protein-coding genes have their coding sequence broken into pieces – the exons – separated by the non-coding spliceosomal introns. These introns are removed from the nascent pre-mRNA and the exons are spliced together to form the intronless mRNA by the spliceosome, a large and elaborate macromolecular complex comprising several small RNA molecules and numerous proteins. No spliceosomal introns have ever been found in prokaryotes, and there are no eukaryotes with a completely sequenced genomes, not even the very basal ones, which would not possess introns (1–3) and the accompanying splicing machinery (4).

Despite the introns being such a remarkable idiosyncrasy of eukaryotic genomes, their origin and evolution are not thoroughly understood (5, 6). It is generally accepted that introns can be regarded as units of evolution and that their presence/absence pattern is a result of stochastic processes of loss and gain. However, the nature of these processes is vigorously debated. Recent large-scale attempts to study these processes using extant eukaryotic genomes led to incongruent conclusions.

In a study on reconstruction of intron evolution, Rogozin et al. (7) analyzed ~700 sets of intron-bearing orthologous genes from eight eukaryotic species. The multiple alignment of the orthologs within each set was computed, and the intron positions were projected on the alignments to form presence/absence maps. Using Dollo parsimony to infer ancestral states, these authors observed a diverse repertoire of behaviors. Some lineages endured extensive losses, while others experienced mostly gain events. Early forbearers, such as the last common ancestor of multicellular life, were shown to be relatively intron-rich. This work suggested that both gain and loss of introns played significant roles in shaping the modern eukaryotic gene structure. However, as these inferences rely upon the Dollo parsimony reconstruction, the number of gains in terminal branches (leaves of the phylogenetic tree) is overestimated, resulting in underestimation (potentially, significant) of the number of introns in ancient lineages.

The same data set was analyzed by Roy and Gilbert (8, 9) using a different methodology. They adopted a simple evolutionary model, according to which different lineages are associated with different loss and gain probabilities. Using a variation on maximum likelihood estimation, they obtained considerably higher estimates for the number of introns in early eukaryotes and a correspondingly lower level of gains in all lineages, i.e., a clear dominance of loss events in the evolution of eukaryotic genes. Roy and Gilbert have substantially simplified the mathematics involved in the estimation procedure, at the expense of introducing into the computation considerations of parsimony, which yielded an inference technique that is a hybrid between parsimony and maximum likelihood. This hybrid, however, excludes from consideration different evolutionary scenarios, resulting in inflated estimates of the number of introns in early eukaryotes (10).

The model of Roy and Gilbert is branch-specific, i.e., it assumes that the gain and loss rates depend only on the branch, thus tacitly presuming that all genes behave identically with respect to intron gain and loss. Exactly the inverse approach was adopted by Qiu et al. (11). These authors developed a gene-specific model, whereby different gene families are characterized by different rates of intron gain and loss, but for a particular gene these rates are constant across the entire phylogenetic tree. They used a different data set combined with a Bayesian estimation technique and concluded that almost all extant introns were gained during the eukaryotic evolution. This suggests evolution is dominated by intron gain events with few losses. However, the validity of a gene-specific model is disputable as it is hard to reconcile with the accumulating evidence on large differences between lineages (12–15).

Recently, two maximum likelihood estimation techniques have been developed for essentially the same branch-specific evolutionary model as the one of Roy and Gilbert. Csuros (10) used a direct approach, while Nguyen et al. (16) developed an expectation-maximization algorithm. Both methods encountered the same problem of estimating the number of unobserved intronless sites. Each employed a technically different but conceptually similar method to evaluate this number. Both techniques were applied to the eight-species data of Rogozin et al. (7), yielding very close estimates. As expected, these methods predict intron occupancy level of ancient lineages higher than those predicted by Dollo parsimony and lower than those predicted by the hybrid technique of Roy and Gilbert. Notably, these estimates are generally closer to those obtained using Dollo parsimony, and they imply an evolutionary landscape comprising both losses and gains, with some excess of gains.

While the Dollo parsimony (7) and the hybrid technique of Roy and Gilbert (8, 9) showed some methodological biases, the other analyses of intron evolution (10, 11, 16) used well-established estimation techniques. Nevertheless, these studies kept yielding widely diverging inferences. The reason seems to be the differences in the underlying evolutionary models, neither being sufficient to describe the complex reality of intron evolution. The branch-specific model fails to account for important differences between genes, whereas the gene-specific model ignores the sharp differences between lineages. Additionally, rate variability between sites, known to be an important factor in other fields of molecular evolution (17, 18), should be taken into account also in the evolution of gene structure. This is particularly important for intron gain in light of the accumulating evidence in favor of the proto-splice model, according to which new introns are preferentially inserted inside certain sequence motifs (19–21). This means that sites could dramatically differ in their gain rate depending on their position relative to a proto-splice site.

Here we describe a model of evolution that takes into consideration all of the above factors. In order to efficiently estimate the model parameters by maximum likelihood, we have developed an expectation-maximization algorithm. We also compiled a data set that is considerably larger than previously used ones, consisting of 400 sets of orthologous genes from 19 eukaryotic species. Applying our algorithm to this data set, we obtained high-precision estimates, revealing a fascinating evolutionary history of gene structure, where both losses and gains played significant roles albeit the contribution of losses was somewhat greater. Moreover, we identified novel properties of intron evolution: (i) all eukaryotic lineages share a common, universal, mode of intron evolution, whereby the loss and gain processes are positively correlated. This suggests that the mechanisms of intron gain and loss share common mechanistic components. In some lineages, additional forces come into play, resulting either in elevated loss rate or in elevated gain rate. Lineages exhibiting an increased loss rate are dispersed throughout the entire phylogenetic tree. In contrast, lineages with excessive gains are much rarer, and all of them are ancient. (ii) Intron loss rates of individual genes show no correlation with any other genomic property. By contrast, intron gain rate of individual genes show several remarkable relationships, not always easily explained. In brief, intron gain rate is positively correlated with expression level, negatively correlated with sequence evolution rate, and negatively correlated with the gene length. Moreover, genes of apparent bacterial origin have significantly lower rates of intron gain than genes of archaeal origin. (iii) We showed that the remarkable conservation of intron positions is, mainly (~90%), due to shared ancestry, and only in a minority of the cases (~10%), due to parallel gain at the same location. (iv) We determined that the density of potential intron insertion sites is about 1 site per 7 nucleotides.

2. Materials

The algorithm learns the parameters of the model by comparing the structure of orthologous genes in extant species. To carry out this comparison, it requires two sets of input data, to be described in this section. The first is a phylogenetic tree, defining topological relationships between a set of eukaryotic species. The second is a collection of genes, for which one can identify orthologs in at least a subset of the species above.

2.1. Multiple Alignments

Suppose that we have G sets of aligned orthologous genes from S species. To represent the gene structure, we transform these alignments into intron presence–absence maps by substituting for each nucleotide (or amino acid) 0 or 1, depending on whether an intron is present or absent in the respective position. We allow for missing data by using a third symbol (*), and consequently a gene might be included in the input data even if it is missing in part of the species. Every site in an alignment, called pattern, is a vector of length S over the alphabet (0,1,*). Let Ω be the total number of unique patterns in the entire set of G alignments, denoted ω₁, . . . , ω_Ω, and let n_gp count the number of times pattern ω_p is found in the multiple alignment of gene g. Assuming that the sites evolve independently, the set M_g = (n_g1, . . . , n_gΩ) fully characterizes the multiple alignment of the gth gene. Thus, all the relevant information about the multiple alignments is captured by the list of unique patterns ω₁, . . . , ω_Ω, and the list of vectors M₁, . . . , M_G.

2.2. Phylogenetic Tree

Let T be a rooted bifurcating phylogenetic tree with S leaves (terminal nodes) corresponding to the S species above. The total number of nodes in T is N = 2S – 1, and we index them by t = 0, 1, . . . , N – 1, with the convention that zero is the root node. The state of node t is described by the random variable q_t, which can take the values 0 and 1 (and * in leaves). We use V_t for the set of all leaves such that node t is among their ancestors. The entire collection of leaves is, obviously, V₀. The parent node of t is denoted P(t). We use the special notations $q_{t}^{P}$ and $V_{t}^{P}$ for qp_(t) and V_P(t), respectively. Analogously, the two direct descendents of node t are denoted L(i) and R(i), and we use the special notations $q_{t}^{L}, q_{t}^{R}, V_{q}^{L}$ , and $V_{q}^{R}$ for q_L(t), q_R(t), V_L(t), and V_R(t), respectively. We index branches by the node into which they are leading, and use Δ_t to denote the length (in time units) of the tth branch. We assume that the tree topology, as well as all the branch lengths Δ₁, . . . , Δ_N–1 are known.

3. Methods

3.1. The Probabilistic Model

A graphical model is a mathematical graph whose nodes symbolize random variables, and whose branches describe dependence relationships between them (22). A bifurcating phylogenetic tree, when viewed as a graphical model, depicts the probabilistic model

\Pr (q_{0}) \prod_{t = 1}^{N - 1} \Pr (q_{t} ∣ q_{t}^{P}) .

[1]

We use the notation π_i = Pr(q₀ = i) to describe the prior probability of the root, and $A_{i j} (g, t) = \Pr (q_{t} = j ∣ q_{t}^{P} = i, g)$ to describe the transition probability for gene g along branch t. In our model, we assume that the transition probability depends on both the gene and the branch, and that it takes the explicit form

A (g, t) = (\begin{matrix} 1 - ξ_{t} (1 - e^{- η_{g} Δ_{t}}) & ξ_{t} (1 - e^{- η_{g} Δ_{t}}) \\ 1 - (1 - ϕ_{t}) e^{- θ_{g} Δ_{t}} & (1 - ϕ_{t}) e^{- θ_{g} Δ_{t}} \end{matrix}) .

[2]

Here, η_g and θ_g are nonnegative parameters, determining the intron gain and loss rates, respectively, of gene g. Complementarily, ξ_t and φ_t determine the intron gain and loss coefficients of branch t, respectively, and are bound to the range 0 ≤ ξ_t, φ_t ≤ 1. The probability of an intron present in gene g at the beginning of branch t to be retained along the branch is $(1 - ϕ_{t}) e^{- θ_{g} Δ_{t}}$ , that is, it is retained only if the branch does not lose it (with probability 1 – φ_t), and also the gene does not lose it (with probability $e^{- θ_{g} Δ_{t}}$ ). This comes to reflect a reality where strong forces to strip a gene off its introns will be practically unaffected by the particular lineage, and, oppositely, strong forces to strip a lineage off its introns will be practically unaffected by the particular gene. In the same spirit, the probability of an intron to be gained in gene g along branch t is $ξ_{t} (1 - e^{- η_{g} Δ_{t}})$ , that is, it is gained only if both the branch “approves” it (with probability ξ_t) and the gene “approves” it (with probability $1 - e^{- η_{g} Δ_{t}}$ ).

In other fields of molecular evolution, it was long realized that analysis precision improves if one allows for rate variability across sites (17, 18). Typically, such rate variability is modeled by introducing a rate variable, r, which scales, for each site, the time units of the phylogenetic tree, Δ_t ← r · Δ_t. This rate variable is a random variable, distributed according to a distribution function with nonnegative domain and unit mean, typically the unit-mean gamma distribution. The rate variability reflects the idea that sites differ in their rate of evolution. Specifically, there are fast-evolving sites (r > > 1), as well as slow-evolving ones (r < < 1). In our model of intron evolution we extend this idea by assuming that the gain and loss processes are subject to rate variability, independently of each other. Hence, a site can have any combination of gain and loss rates. To accommodate this idea, we use two independent rate variables, r^η and r^θ, that are used to scale, for each site, the gene-specific gain rate, $η_{g} \leftarrow r^{η} \cdot η_{g}$ , and the gene-specific loss rate, $θ_{g} \leftarrow r^{θ} \cdot θ_{g}$ . We further assume that the distributions of these rate variables are independent of the genes, and are explicitly given by

\begin{matrix} r^{η} & \sim ν δ (η) + (1 - ν) Γ (η; λ_{η}) \\ r^{θ} & \sim Γ (θ; λ_{θ}) . \end{matrix}

[3]

Here, Γ(x, λ) is the unit-mean gamma distribution of variable x with shape parameter λ, δ(x) is the Dirac delta-function, and η is the fraction of sites that are assumed to have zero gain rate. These latter sites, denoted invariant sites, reflect these sites that are not a proto-splice site (19–21). Intron loss does not have an invariant counterpart, as the assumption is that once an intron is gained, it can always be lost. Therefore, the loss rate variable is assumed to be distributed according to a gamma distribution, which is by far the most popular in describing rate variability (17, 18, 23).

In practice, the rate distributions in Eq. [3] are rendered discrete (24). We assume that the gain rate variable can take K_η discrete values $r_{1}^{η} = 0, r_{2}^{η}, \dots, r_{K_{η}}^{η}$ with probabilities $f_{1}^{η} = ν, f_{2}^{η} \dots, f_{K_{η}}^{η}$ such that $\sum_{k = 1}^{K_{η}} f_{k}^{η} = 1$ . Analogously, we assume that the loss rate variable can take K_θ discrete values $r_{1}^{θ}, \dots, r_{K_{θ}}^{θ}$ with probabilities $f_{1}^{θ}, \dots, f_{K_{θ}}^{θ}$ such that $\sum_{k = 1}^{K_{θ}} f_{k}^{θ} = 1$ . For a particular gain rate value $r_{k}^{η}$ , we denote the actual gain rate $r_{k}^{η} \cdot η_{g}$ by $η_{k g}$ . Similarly, for a particular loss rate value $r_{k}^{θ}$ , we denote the actual loss rate $r_{k}^{θ} \cdot θ_{g}$ by θ_kg.

For notational clarity, we aggregate the model parameters into a small number of sets. To this end, let Ξ_t = {ξ_t, φ_t} be the set of parameters that are specific for branch t, and let Ξ = (Ξ₁, . . . , Ξ_{N – 1}) be the set of all branch-specific parameters. Similarly, let $Ψ_{g} = (η_{g}, θ_{g})$ be the set of parameters that are specific for gene g, and let Ψ = (Ψ₁ . . . ,Ψ_G) be the set of all gene-specific parameters. Additionally, we denote by Λ = (ν, λ_η, λ_θ) the parameters that determine the rate variability. When the distinction between the different sets of parameters is irrelevant, we shall use Θ = (Ξ, Ψ, Λ) as the set of all the model's parameters. We achieve further succinctness in notations by denoting the actual gene-specific rate values for particular values $r_{k}^{η}$ and $r_{k^{'}}^{θ}$ of the rate variables as $Ψ_{k k^{'} g} = (η_{k g}, θ_{k^{'} g})$ .

3.2. The EM Algorithm

For each site, the S leaves form a set of observed random variables, their states being described by the corresponding pattern ω_p. The state of all the internal nodes, denoted σ, form a set of hidden random variables, that is, random variables whose state is not observed. In order to account for rate variability across sites, we associate with each pattern two hidden random variables, $ρ_{p}^{η}$ and $ρ_{p}^{η}$ , that determine the value of the rate variables in that site. To sum up, the observed random variables are ω_p, and the hidden random variables are $(σ, ρ_{p}^{η}, ρ_{p}^{θ})$ .

We assume that sites within a gene, as well as the genes themselves, evolve independently. Therefore, the total likelihood can be decomposed as

L (M_{1}, \dots, M_{G} ∣ ϴ) = \prod_{g = 1}^{G} L (M_{g} ∣ Ξ, Ψ_{g}, Λ) = \prod_{g = 1}^{G} \prod_{p = 1}^{Ω} L {(ω_{p} ∣ Ξ, Ψ_{g}, Λ)}^{n_{g p}} .

and so

log L (M_{1}, \dots, M_{G} ∣ ϴ) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} n_{g p} log L (ω_{p} ∣ Ξ, Ψ_{g}, Λ) .

[4]

According to the well-known EM paradigm (25) log L(M₁, . . . , M_G|Θ) is guaranteed to increase as long as we maximize the auxiliary function

Q (ϴ, ϴ^{0}) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} n_{g p} Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}),

[5]

where

Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \sum_{σ, ρ_{p}^{η}, ρ_{p}^{θ}} \Pr (σ, ρ_{p}^{η}, ρ_{p}^{θ} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) log \Pr (ω_{p}, σ, ρ_{p}^{η}, ρ_{p}^{θ} ∣ Ξ, Ψ_{g}, Λ) .

[6]

Using some manipulations (see Note 1), this can be written as

Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} [\Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0})] . [\sum_{σ} \Pr (σ ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0}) \cdot {log f_{k}^{η} + log f_{k^{'}}^{θ} + log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}})}] .

Denoting by $w_{g p k k^{'}}$ and $Q_{g p k k^{'}}$ the first and second square brackets, respectively, this expression becomes

Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} w_{g p k k^{'}} Q_{g p k k^{'}},

[7]

and consequently

Q (ϴ, ϴ^{0}) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} n_{g p} w_{g p k k^{'}} Q_{g p k k^{'}} .

[8]

3.2.1. The E-Step

In this step we compute the function $Q (ϴ, ϴ^{0})$ , or, equivalently, the set of coefficients $w_{g p k k^{'}}$ and $Q_{g p k k^{'}}$ . We accomplish this with the aid of an inward–outward recursion on the tree.

3.2.1.1. The Inward (γ) Recursion

Here we propose a variation on the well-known Felsenstein's pruning algorithm (26). Let us associate with each node t (except for the root) a vector $γ_{i}^{g p k k^{'}} (t) = \Pr (V_{t} ∣ q_{t}^{P} = i, Ξ^{0}, Ψ_{g k k^{'}}^{0})$ . In words, $γ_{i}^{g p k k^{'}}$ (t) is the probability of observing the nodes V_t (which are a subset of the pattern ω_p) for a gene g, when the gain and loss rate variables are $r_{k}^{η}$ and $r_{k^{'}}^{θ}$ , respectively, and when the parent node of t is known to be in state i. By definition, this function is initialized at all leaves (t ∈ V₀) by

γ (t \in V_{0}) = {\begin{matrix} (\begin{matrix} 1 - ξ_{t} (1 - e^{- η_{g k} Δ_{t}}) \\ 1 - (1 - ϕ_{t}) e^{- θ_{g k^{'}} Δ_{t}} \end{matrix}) & q_{t} = 0 \\ (\begin{matrix} ξ_{t} (1 - e^{- η_{g k} Δ_{t}}) \\ (1 - ϕ_{t}) e^{- θ_{g k^{'}} Δ_{t}} \end{matrix}) & q_{t} = 1 . \end{matrix}

[9]

Here, and in the derivations to follow, we omit the superscript from γ. For all internal nodes (except for the root), γ is computed using the recursion

γ_{i} (t) = \sum_{j = 0}^{1} A_{i j} (g, t) {\tilde{γ}}_{j} (t),

[10]

where ${\tilde{γ}}_{j} (t)$ is defined as γ_j[L(t)]γ_j[R(t)] (see Note 2).

The γ-recursion allows for computing the likelihood of any observed pattern ω_p, given the values of the rate variables:

\Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) = \Pr (V_{0} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) = \Pr (V_{0}^{L}, V_{0}^{R} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) = = \sum_{i = 0}^{1} \Pr (V_{0}^{L}, V_{0}^{R}, q_{0} = i ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) = = \sum_{i = 0}^{1} \Pr (q_{0} = i ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) \cdot \Pr (V_{0}^{L} ∣ q_{0} = i, Ξ^{0}, Ψ_{g k k^{'}}^{0}) \cdot \Pr (V_{0}^{R} ∣ V_{0}^{L}, q_{0} = i, Ξ^{0}, Ψ_{g k k^{'}}^{0}) .

Given q₀, $V_{0}^{R}$ is independent of $V_{0}^{L}$ , and so

\Pr (V_{0}^{R} ∣ V_{0}^{L}, q_{0} = i, Ξ^{0}, Ψ_{g k k^{'}}^{0}) = \Pr (V_{0}^{R} ∣ q_{0} = i, Ξ^{0}, Ψ_{g k k^{'}}^{0}),

and

\Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0}) = \sum_{i = 0}^{1} π_{i} {\tilde{γ}}_{i} (0) .

[11]

This γ-recursion can be easily modified to incorporate missing data (see Note 3).

3.2.1.2. The Outward (α) Recursion

Once the γ-recursion is computed, we can use it to compute a second, complementary, recursion. To this end, let us associate with each node t (except for the root node) a matrix $α_{i j}^{g p k k^{'}} (t) = \Pr (q_{t} = j, q_{t}^{P} = i ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0})$ . It is beneficial to define for each node t (except for the root node) a vector $β_{j}^{g p k k^{'}} (t) = \sum_{i = 0}^{1} α_{i j}^{g p k k^{'}} (t) = \Pr (q_{t} = j ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0})$ . Upon the computation of α, β is readily computed too. Again, omitting the superscripts, α can be initialized from its definition on the two direct descendants of the root,

α (D (0)) = \frac{1}{\Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0})} {\begin{matrix} (\begin{matrix} π_{0} γ_{0} (\overset{‒}{D} (0)) A_{00} (g, D (0)) & 0 \\ π_{1} γ_{1} (\overset{‒}{D} (0)) A_{10} (g, D (0)) & 0 \end{matrix}) D (0) \in V_{0}, q_{0}^{D} = 0 \\ (\begin{matrix} 0 & π_{0} γ_{0} (\overset{‒}{D} (0)) A_{01} (g, D (0)) \\ 0 & π_{1} γ_{1} (\overset{‒}{D} (0)) A_{11} (g, D (0)) \end{matrix}) D (0) \in V_{0}, q_{0}^{D} = 1 \\ (\begin{matrix} π_{0} γ_{0} (\overset{‒}{D} (0)) {\tilde{γ}}_{0} (D (0)) A_{00} (g, D (0)) & π_{0} γ_{0} (\overset{‒}{D} (0)) {\tilde{γ}}_{1} (D (0)) A_{01} (D (0)) \\ π_{1} γ_{1} (\overset{‒}{D} (0)) {\tilde{γ}}_{0} (D (0)) A_{10} (D (0)) & π_{1} γ_{1} (\overset{‒}{D} (0)) {\tilde{γ}}_{1} (D (0)) A_{11} (D (0)) \end{matrix}) & D (0) \notin V_{0} . \end{matrix}

[12]

Here, D(0) stands for any one of the direct descendants of the root, and D(0) is its sibling. For any other internal node, α is computed using the outward-recursion

α (t) = (\begin{matrix} β_{0} (P (t)) {\tilde{γ}}_{0} (t) A_{00} (g, t) ∕ γ_{0} (t) & β_{0} (P (t)) {\tilde{γ}}_{1} (t) A_{01} (g, t) ∕ γ_{0} (t) \\ β_{1} (P (t)) {\tilde{γ}}_{0} (t) A_{10} (g, t) ∕ γ_{1} (t) & β_{1} (P (t)) {\tilde{γ}}_{1} (t) A_{11} (g, t) ∕ γ_{1} (t) \end{matrix})

[13]

(see Note 4).

Finally, for each leaf that is not a descendant of the root,

α (t) = {\begin{matrix} (\begin{matrix} β_{0} (P (t)) & 0 \\ β_{1} (P (t)) & 0 \end{matrix}) & q_{t} = 0 \\ (\begin{matrix} 0 & β_{0} (P (t)) \\ 0 & β_{1} (P (t)) \end{matrix}) & q_{t} = 1 . \end{matrix} t \in V_{0}, P (t) \neq 0

[14]

Again, this recursion can be straightforwardly modified when missing data are present (see Note 5).

These inward–outward recursions are the phylogenetic equivalent of the backward–forward recursions known from hidden Markov models, and other versions of it have already been developed (27, 28). The version that we developed here can be shown to be the realization of the junction tree algorithm (29) on rooted bifurcating trees (see Note 6).

3.2.1.3. Computing the Coefficients w_gpkk′

Here we show that the γ-recursion is sufficient to compute the coefficients w_gpkk′. From the definition, $w_{g p k k^{'}} = \Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ . Using the Bayes formula Pr(x, y|z) = Pr(x, y, z)/Σ_x,y Pr(x, y, z), we can rewrite it as

\begin{matrix} w_{g p k k^{'}} & = \frac{\Pr (ρ_{p}^{η}, = k, ρ_{p}^{θ} = k^{'}, ω_{p} ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0})}{\sum_{h, h^{'}} \Pr (ρ_{p}^{η} = h, ρ_{p}^{θ} = h^{'}, ω_{p} ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0})} = \\ = \frac{\Pr (ρ_{p}^{η} = k ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ρ_{p}^{θ} = k^{'} ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0})}{\sum_{h, h^{'}} \Pr (ρ_{p}^{η} = h ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ρ_{p}^{θ} = h^{'} ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g h h^{'}}^{0})} . \end{matrix}

But $\Pr (ρ_{p}^{η} = k ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ is just the current estimate of the probability of the gain rate variable to have the value $r_{k}^{η}$ , namely ${(f_{k}^{η})}^{0}$ . Similarly, $\Pr (ρ_{p}^{θ} = k^{'} ∣ Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ is just ${(f_{k^{'}}^{θ})}^{0}$ . Therefore, the expression for the coefficients $w_{g p k k^{'}}$ is reduced to

w_{g p k k^{'}} = \frac{{(f_{k}^{η})}^{0} {(f_{k^{'}}^{θ})}^{0} \Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0})}{\sum_{h, h^{'}} {(f_{h}^{η})}^{0} {(f_{h^{'}}^{θ})}^{0} \Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g h h^{'}}^{0})} .

[15]

The function $\Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0})$ is the likelihood of observing pattern ω_p for gain and loss rate variables $r_{k}^{η}$ and $r_{k^{'}}^{θ}$ , respectively. This is readily computed upon completion of the γ-recursion, using Eq. [11].

3.2.1.4. Computing the Coefficients Q _gpkk'

Here we show that these coefficients require the α, β-recursion. By definition,

Q_{g p k k^{'}} = \sum_{σ} \Pr (σ ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0}) \cdot [log f_{k}^{η} + log f_{k^{'}}^{θ} + log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}})] .

The probability $\Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}})$ is just the likelihood of a particular realization of the tree, thus from Eq. [1]

log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}}) = \sum_{i = 0}^{1} δ (q_{0}, i) \cdot log π_{i} + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} δ (q_{t}, j) δ (q_{t}^{P}, i) \cdot log A_{i j} (g, t) .

[16]

Here, δ(a, b) is the Kronecker delta function, which is 1 for a = b and 0 otherwise. Denote the expectation over $\Pr (σ ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0})$ by E_σ. Applying it to Eq. [16], we get

E_{σ} [log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}})] = \sum_{i = 0}^{1} log π_{i} \cdot E_{σ} [δ (q_{0}, i)] + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} log A_{i j} (g, t) \cdot E_{σ} [δ (q_{t}, j) δ (q_{t}^{P}, i)] .

But $E_{σ} [δ (q_{0}, i)] = \Pr (q_{0} = i ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0}) = β_{i} (0)$ , and similarly $E_{σ} [δ (q_{t}, j) δ (q_{t}^{P}, i)] = α_{i j} (t)$ . Hence, $Q_{g p k k^{'}}$ is given by

\begin{matrix} Q_{g p k k^{'}} & = \sum_{σ} \Pr (σ ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0}) [log f_{k}^{η} + log f_{k^{'}}^{θ} + log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}})] \\ = log f_{k}^{η} + log f_{k^{'}}^{θ} + \sum_{i = 0}^{1} β_{i} (0) log π_{i} + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} α_{i j} (t) log A_{i j} (g, t) . \end{matrix}

[17]

3.2.2. The M-Step

Substituting Eq. [17] in Eq. [8], we obtain an explicit form of the function whose maximization guarantees stepping up-hill in the likelihood landscape,

\begin{matrix} Q = & \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} n_{g p} w_{g p k k^{'}} (log f_{k}^{η} + log f_{k^{'}}^{θ}) + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} n_{g p} w_{g p k k^{'}} [β_{0}^{g p k k^{'}} (0) log π_{0} + β_{1}^{g p k k^{'}} (0) log π_{1}] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k^{'}} α_{00}^{g p k k^{'}} (t) log [1 - ξ_{t} (1 - e^{- η_{g k} Δ_{t}})] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k^{'}} α_{01}^{g p k k^{'}} (t) [log ξ_{t} + log (1 - e^{- η_{g k} Δ_{t}})] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k^{'}} α_{10}^{g p k k^{'}} (t) log [1 - (1 - ϕ_{t}) e^{- θ_{g k^{'}} Δ_{t}}] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k^{'}} α_{11}^{g p k k^{'}} (t) [log (1 - ϕ_{t}) - θ_{g k^{'}} Δ_{t}] . \end{matrix}

[18]

Actually, any increase in $Q$ is sufficient to guarantee an increase in the likelihood, suggesting that a precise maximization of $Q$ is not very important. Therefore, we speed computations by performing low-tolerance maximization with respect to each of the parameters individually. Except for the parameters λ_η and λ_θ, it is easy to differentiate $Q$ twice with respect to any parameter. This lends itself into using simple zero-finding algorithms; we chose the Newton-Raphson algorithm (30). Maximizing $Q$ with respect to the shape parameters λ_η and λ_θ is more involved, as $Q$ depends on these parameters only through the discrete approximation of the rate variability distributions, Eq. [3] (see Note 7).

Footnotes

If we replace the formal summing over all states of

ρ_{p}^{η}

and

ρ_{p}^{η}

in Eq. [6] by a direct sum, we get

Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \sum_{k = 1}^{K_{η}} \sum_{k^{'} = 1}^{K_{θ}} \sum_{σ} \Pr (σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) log \Pr (ω_{p}, σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ Ξ, Ψ_{g}, Λ) .

[19]

Using our notational conventions, we can write the first term in Eq. [19] as

\Pr (σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (σ ∣ ω_{p}, Ξ^{0}, Ψ_{g k k^{'}}^{0}),

[20]

and the second term as

log \Pr (ω_{p}, σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k^{'} ∣ Ξ, Ψ_{g}, Λ) = log \Pr (ρ_{p}^{η} = k ∣ Ξ, Ψ_{g}, Λ) + + log \Pr (ρ_{p}^{θ} = k^{'} ∣ Ξ, Ψ_{g}, Λ) + log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}}) = log f_{k}^{η} + log f_{k^{'}}^{θ} + log \Pr (ω_{p}, σ ∣ Ξ, Ψ_{g k k^{'}}) .

[21]

Substituting Eqs. [20] and [21] back in Eq. [19] gives the desired result.

We expand

\begin{matrix} γ_{i} (t) & = \Pr (V_{t} ∣ q_{t}^{P} = i) = \Pr (V_{t}^{L}, V_{t}^{R} ∣ q_{t}^{P} = i) \\ = \sum_{j = 0}^{1} \Pr (V_{t}^{L}, V_{t}^{R}, q_{t} = j ∣ q_{t}^{P} = i) = \\ = \sum_{j = 0}^{1} \Pr (q_{t} = j ∣ q_{t}^{P} = i) \cdot \Pr (V_{t}^{L} ∣ q_{t} = j, q_{t}^{P} = i) \cdot \Pr (V_{t}^{R} ∣ V_{t}^{L}, q_{t} = j, q_{t}^{P} = i) . \end{matrix}

[22]

The first term is simply the definition of $A_{i j} (g, t)$ . Given q_t, $V_{t}^{L}$ is independent on $q_{t}^{P}$ , thus the second term is just $\Pr (V_{t}^{L} ∣ q_{t} = j) = γ_{j} (t^{L})$ . By similar arguments the third term is just $\Pr (V_{t}^{R} ∣ q_{t} = j) = γ_{j} (t^{R})$ . By substituting those results in Eq. [22], we recover the recursion formula, Eq. [10].

One of the appealing features of this recursion is that it allows to treat missing data fairly easily. Only a single option has to be added to the initialization phase Eq. [9],

γ (t \in V_{0}) = (\begin{matrix} 1 \\ 1 \end{matrix}) q_{t} = * .

⁴

To prove this recursion, let us start with the definition of α,

\begin{matrix} α_{i j} (t) & = \Pr (q_{t} = j, q_{t}^{P} = i ∣ ω_{p}) = \Pr (q_{t} = j, q_{t}^{P} = i ∣ V_{0}) \\ = \Pr (q_{t}^{P} = i ∣ V_{0}) \cdot \Pr (q_{t} = j ∣ q_{t}^{P} = i, V_{0}) \\ = β_{i} (P (t)) \cdot \Pr (q_{t} = j ∣ q_{t}^{P} = i, V_{0}) . \end{matrix}

[23]

Let us make the decomposition V₀ = V_t + V̄_t, with V̄_t being the set of all leaves such that node t is not among their ancestors. But, given

q_{t}^{P}

, the state of node t is independent on V̄_t, and therefore Eq. [23] becomes

α_{i j} (t) = β_{i} (P (t)) \cdot \Pr (q_{t} = j ∣ q_{t}^{P} = i, V_{t}) .

[24]

From Bayes formula,

\begin{matrix} \Pr (q_{t} = j ∣ q_{t}^{P} = i, V_{t}) & = \frac{\Pr (q_{t} = j, V_{t} ∣ q_{t}^{P} = i)}{\Pr (V_{t} ∣ q_{t}^{P} = i)} \\ = \frac{\Pr (q_{t} = j ∣ q_{t}^{P} = i) \cdot \Pr (V_{t} ∣ q_{t} = j, q_{t}^{P} = i)}{γ_{i} (t)} \\ = \frac{A_{i j} (g, t)}{γ_{i} (t)} \cdot \Pr (V_{t} ∣ q_{t} = j, q_{t}^{P} = i) . \end{matrix}

[25]

But given q_t, V_t is independent of P(t) and therefore

\Pr (V_{t} ∣ q_{t} = j, q_{t}^{P} = i) = \Pr (V_{t} ∣ q_{t} = j) = {\tilde{γ}}_{j} (t) .

[26]

Combining Eqs. [25] and [26] in Eq. [24], we get

α_{i j} (t) = \frac{{\tilde{γ}}_{j} (t) β_{i} (P (t))}{γ_{i} (t)} A_{i j} (g, p),

which is just another form of writing Eq. [13].

⁵

When missing data are present, two simple modifications are required. First, we have to add to the initialization phase Eq. [12] an option

α (D (0)) = \frac{1}{\Pr (ω_{p} ∣ Ξ^{0}, Ψ_{g k k^{'}}^{0})} {\begin{matrix} π_{0} γ_{0} [\overset{‒}{D} (0)] A_{00} [g, D (0)] & π_{0} γ_{0} \overset{‒}{D} (0)] A_{01} [D (0)] \\ π_{1} γ_{1} [\overset{‒}{D} (0)] A_{10} [D (0)] & π_{1} γ_{1} (\overset{‒}{D} (0)] A_{11} [D (0)] \end{matrix}} D (0) \in V_{0}, q_{0}^{D} = *

Second, we have to add to the finalization phase Eq. [14] an option

α (t) = {\begin{matrix} β_{0} [P (t)] A_{00} (g, t) & β_{0} [P (t)] A_{01} (g, t) \\ β_{1} [P (t)] A_{10} (g, t) & β_{1} [P (t)] A_{11} (g, t) \end{matrix}} q_{t} = * .

⁶

The junction tree algorithm is a scheme to compute marginal probabilities of maximal cliques on graphs by means of belief propagation on a modified junction tree. Indeed, the matrix α computes marginal probabilities of pairs (t, P(t)), but such pairs are nothing but maximal cliques on rooted bifurcating trees.

⁷

In our implementation, we used Yang's quantile method (24) to compute the discrete levels of the gamma distributions such that each level has equal probability. Formally, $f_{1}^{η} = ν, f_{k}^{η} = (1 - ν) ∕ (K_{η} - 1)$ for k = 2, . . . , K_η, and $f_{k}^{θ} = 1 ∕ K_{θ}$ for k = 1, . . . , K_θ. To perform the maximization in this case, we used Brent's maximization algorithm that does not require derivatives (30).

References

1.Nixon JE, Wang A, Morrison HG, McArthur AG, Sogin ML, Loftus BJ, Samuelson J. A spliceosomal intron in Giardia lamblia. Proc Natl Acad Sci U S A. 2002;99:3359–3361. doi: 10.1073/pnas.042700299. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Vanacova S, Yan W, Carlton JM, Johnson PJ. Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc Natl Acad Sci U S A. 2005;102:4430–4435. doi: 10.1073/pnas.0407500102. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Simpson AG, MacQuarrie EK, Roger AJ. Early origin of canonical introns. Nature. 2002;419:270. doi: 10.1038/419270a. [DOI] [PubMed] [Google Scholar]
4.Collins L, Penny D. Complex spliceosomal organization ancestral to extant eukaryotes. Mol Biol Evol. 2005;22:1053–1066. doi: 10.1093/molbev/msi091. [DOI] [PubMed] [Google Scholar]
5.Lynch M, Richardson AO. The evolution of spliceosomal introns. Curr Opin Genet Dev. 2002;12:701–710. doi: 10.1016/s0959-437x(02)00360-x. [DOI] [PubMed] [Google Scholar]
6.Roy SW, Gilbert W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006;7:211–221. doi: 10.1038/nrg1807. [DOI] [PubMed] [Google Scholar]
7.Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Curr Biol. 2003;13:1512–1517. doi: 10.1016/s0960-9822(03)00558-x. [DOI] [PubMed] [Google Scholar]
8.Roy SW, Gilbert W. Complex early genes. Proc Natl Acad Sci U S A. 2005;102:1986–1991. doi: 10.1073/pnas.0408355101. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Roy SW, Gilbert W. Rates of intron loss and gain: implications for early eukaryotic evolution. Proc Natl Acad Sci U S A. 2005;102:5773–5778. doi: 10.1073/pnas.0500383102. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Csuros M. McLysaght A, Huson D, editors. Likely scenarios of intron evolution, Lecture Notes in Bioinformatics. Proc. RECOMB 2005 Comparative Genomics International Workshop (RCG 2005) 2005;3678:47–60. [Google Scholar]
11.Qiu WG, Schisler N, Stoltzfus A. The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol. 2004;21:1252–1263. doi: 10.1093/molbev/msh120. [DOI] [PubMed] [Google Scholar]
12.Fedorov A, Roy SW, Fedorova L, Gilbert W. Mystery of intron gain. Genome Res. 2003;13:2236–2241. doi: 10.1101/gr.1029803. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cho S, Jin SW, Cohen A, Ellis RE. A phylogeny of caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 2004;14:1207–1220. doi: 10.1101/gr.2639304. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Roy SW, Hartl DL. Very little intron loss/gain in Plasmodium: intron loss/gain mutation rates and intron number. Genome Res. 2006;16:750–756. doi: 10.1101/gr.4845406. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Jeffares DC, Mourier T, Penny D. The biology of intron gain and loss. Trends Genet. 2006;22:16–22. doi: 10.1016/j.tig.2005.10.006. [DOI] [PubMed] [Google Scholar]
16.Nguyen HD, Yoshihama M, Kenmochi N. New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput Biol. 2005;1:e79. doi: 10.1371/journal.pcbi.0010079. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Nei M, Chakraborty R, Fuerst PA. Infinite allele model with varying mutation rate. Proc Natl Acad Sci U S A. 1976;73:4164–4168. doi: 10.1073/pnas.73.11.4164. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Uzzell T, Corbin KW. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]
19.Dibb NJ. Proto-splice site model of intron origin. J Theor Biol. 1991;151:405–416. doi: 10.1016/s0022-5193(05)80388-1. [DOI] [PubMed] [Google Scholar]
20.Dibb NJ, Newman AJ. Evidence that introns arose at proto-splice sites. Embo J. 1989;8:2015–2021. doi: 10.1002/j.1460-2075.1989.tb03609.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV. Reconstruction of ancestral protosplice sites. Curr Biol. 2004;14:1505–1508. doi: 10.1016/j.cub.2004.08.027. [DOI] [PubMed] [Google Scholar]
22.Jordan IM, editor. Learning in Graphical Models. Kluwer Academic Publishers; Boston, MA: 1998. [Google Scholar]
23.Jin L, Nei M. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol. 1990;7:82–102. doi: 10.1093/oxfordjournals.molbev.a040588. [DOI] [PubMed] [Google Scholar]
24.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
25.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc B. 1977;39:1–38. [Google Scholar]
26.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
27.Friedman N, Ninio M, Pe'er I, Pupko T. A structural EM algorithm for phylogenetic inference. J Comput Biol. 2002;9:331–353. doi: 10.1089/10665270252935494. [DOI] [PubMed] [Google Scholar]
28.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21:468–488. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]
29.Castillo E, Gutierrez JM, Hadi AS. Expert systems and probabilistic network models (Monographs in Computer Science) Springer; New York: 1996. [Google Scholar]
30.Press WH, Flannery BP, Teukolsky SA, Vetterling WT. Numerical recipes in C: The art of scientific computing. 2nd ed. Cambridge University Press; New York: 1992. [Google Scholar]

[R1] 1.Nixon JE, Wang A, Morrison HG, McArthur AG, Sogin ML, Loftus BJ, Samuelson J. A spliceosomal intron in Giardia lamblia. Proc Natl Acad Sci U S A. 2002;99:3359–3361. doi: 10.1073/pnas.042700299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Vanacova S, Yan W, Carlton JM, Johnson PJ. Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc Natl Acad Sci U S A. 2005;102:4430–4435. doi: 10.1073/pnas.0407500102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Simpson AG, MacQuarrie EK, Roger AJ. Early origin of canonical introns. Nature. 2002;419:270. doi: 10.1038/419270a. [DOI] [PubMed] [Google Scholar]

[R4] 4.Collins L, Penny D. Complex spliceosomal organization ancestral to extant eukaryotes. Mol Biol Evol. 2005;22:1053–1066. doi: 10.1093/molbev/msi091. [DOI] [PubMed] [Google Scholar]

[R5] 5.Lynch M, Richardson AO. The evolution of spliceosomal introns. Curr Opin Genet Dev. 2002;12:701–710. doi: 10.1016/s0959-437x(02)00360-x. [DOI] [PubMed] [Google Scholar]

[R6] 6.Roy SW, Gilbert W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006;7:211–221. doi: 10.1038/nrg1807. [DOI] [PubMed] [Google Scholar]

[R7] 7.Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Curr Biol. 2003;13:1512–1517. doi: 10.1016/s0960-9822(03)00558-x. [DOI] [PubMed] [Google Scholar]

[R8] 8.Roy SW, Gilbert W. Complex early genes. Proc Natl Acad Sci U S A. 2005;102:1986–1991. doi: 10.1073/pnas.0408355101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Roy SW, Gilbert W. Rates of intron loss and gain: implications for early eukaryotic evolution. Proc Natl Acad Sci U S A. 2005;102:5773–5778. doi: 10.1073/pnas.0500383102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Csuros M. McLysaght A, Huson D, editors. Likely scenarios of intron evolution, Lecture Notes in Bioinformatics. Proc. RECOMB 2005 Comparative Genomics International Workshop (RCG 2005) 2005;3678:47–60. [Google Scholar]

[R11] 11.Qiu WG, Schisler N, Stoltzfus A. The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol. 2004;21:1252–1263. doi: 10.1093/molbev/msh120. [DOI] [PubMed] [Google Scholar]

[R12] 12.Fedorov A, Roy SW, Fedorova L, Gilbert W. Mystery of intron gain. Genome Res. 2003;13:2236–2241. doi: 10.1101/gr.1029803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Cho S, Jin SW, Cohen A, Ellis RE. A phylogeny of caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 2004;14:1207–1220. doi: 10.1101/gr.2639304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Roy SW, Hartl DL. Very little intron loss/gain in Plasmodium: intron loss/gain mutation rates and intron number. Genome Res. 2006;16:750–756. doi: 10.1101/gr.4845406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Jeffares DC, Mourier T, Penny D. The biology of intron gain and loss. Trends Genet. 2006;22:16–22. doi: 10.1016/j.tig.2005.10.006. [DOI] [PubMed] [Google Scholar]

[R16] 16.Nguyen HD, Yoshihama M, Kenmochi N. New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput Biol. 2005;1:e79. doi: 10.1371/journal.pcbi.0010079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Nei M, Chakraborty R, Fuerst PA. Infinite allele model with varying mutation rate. Proc Natl Acad Sci U S A. 1976;73:4164–4168. doi: 10.1073/pnas.73.11.4164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Uzzell T, Corbin KW. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]

[R19] 19.Dibb NJ. Proto-splice site model of intron origin. J Theor Biol. 1991;151:405–416. doi: 10.1016/s0022-5193(05)80388-1. [DOI] [PubMed] [Google Scholar]

[R20] 20.Dibb NJ, Newman AJ. Evidence that introns arose at proto-splice sites. Embo J. 1989;8:2015–2021. doi: 10.1002/j.1460-2075.1989.tb03609.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV. Reconstruction of ancestral protosplice sites. Curr Biol. 2004;14:1505–1508. doi: 10.1016/j.cub.2004.08.027. [DOI] [PubMed] [Google Scholar]

[R22] 22.Jordan IM, editor. Learning in Graphical Models. Kluwer Academic Publishers; Boston, MA: 1998. [Google Scholar]

[R23] 23.Jin L, Nei M. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol. 1990;7:82–102. doi: 10.1093/oxfordjournals.molbev.a040588. [DOI] [PubMed] [Google Scholar]

[R24] 24.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]

[R25] 25.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc B. 1977;39:1–38. [Google Scholar]

[R26] 26.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]

[R27] 27.Friedman N, Ninio M, Pe'er I, Pupko T. A structural EM algorithm for phylogenetic inference. J Comput Biol. 2002;9:331–353. doi: 10.1089/10665270252935494. [DOI] [PubMed] [Google Scholar]

[R28] 28.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21:468–488. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]

[R29] 29.Castillo E, Gutierrez JM, Hadi AS. Expert systems and probabilistic network models (Monographs in Computer Science) Springer; New York: 1996. [Google Scholar]

[R30] 30.Press WH, Flannery BP, Teukolsky SA, Vetterling WT. Numerical recipes in C: The art of scientific computing. 2nd ed. Cambridge University Press; New York: 1992. [Google Scholar]

PERMALINK

A Maximum Likelihood Method for Reconstruction of the Evolution of Eukaryotic Gene Structure

Liran Carmel

Igor B Rogozin

Yuri I Wolf

Eugene V Koonin

Abstract

1. Introduction

2. Materials

2.1. Multiple Alignments

2.2. Phylogenetic Tree

3. Methods

3.1. The Probabilistic Model

3.2. The EM Algorithm

3.2.1. The E-Step

3.2.1.1. The Inward (γ) Recursion

3.2.1.2. The Outward (α) Recursion

3.2.1.3. Computing the Coefficients w_gpkk′

3.2.1.4. Computing the Coefficients Q _gpkk'

3.2.2. The M-Step

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Maximum Likelihood Method for Reconstruction of the Evolution of Eukaryotic Gene Structure

Liran Carmel

Igor B Rogozin

Yuri I Wolf

Eugene V Koonin

Abstract

1. Introduction

2. Materials

2.1. Multiple Alignments

2.2. Phylogenetic Tree

3. Methods

3.1. The Probabilistic Model

3.2. The EM Algorithm

3.2.1. The E-Step

3.2.1.1. The Inward (γ) Recursion

3.2.1.2. The Outward (α) Recursion

3.2.1.3. Computing the Coefficients wgpkk′

3.2.1.4. Computing the Coefficients Q gpkk'

3.2.2. The M-Step

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2.1.3. Computing the Coefficients w_gpkk′

3.2.1.4. Computing the Coefficients Q _gpkk'