A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree

Tanja Stadler; James H Degnan

doi:10.1186/1748-7188-7-7

. 2012 Apr 30;7:7. doi: 10.1186/1748-7188-7-7

A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree

Tanja Stadler ^1,^✉, James H Degnan ^2,³

PMCID: PMC3637458 PMID: 22546066

Abstract

Background

The ancestries of genes form gene trees which do not necessarily have the same topology as the species tree due to incomplete lineage sorting. Available algorithms determining the probability of a gene tree given a species tree require exponential computational runtime.

Results

In this paper, we provide a polynomial time algorithm to calculate the probability of a ranked gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. However, the complexity of calculating the probability of a gene tree topology with an exponential number of rankings for a given species tree remains unknown.

Conclusions

Polynomial algorithms for calculating ranked gene tree probabilities may become useful in developing methodology to infer species trees based on a collection of gene trees, leading to a more accurate reconstruction of ancestral species relationships.

Keywords: Incomplete lineage sorting, Coalescent history, Anomalous gene tree, Dynamic programming

Background

Phylogenetic reconstruction methods aim to infer the species phylogeny which gave rise to a group of extant species. Typically, this species phylogeny is obtained based on genetic data from representative individuals of each extant species. The ancestries of genes at different loci form gene trees which do not necessarily have the same topology as the species tree. Gene tree topologies and species tree topologies might be different due to such phenomena as incomplete lineage sorting, gene duplication, recombination within gene loci, and horizontal gene transfer [1]. In this paper, we focus on incomplete lineage sorting as the mechanism for incongruence of gene tree and species tree topologies, in which two gene lineages do not coalesce in the most recent population ancestral to the individuals from which the genes were sampled. As an example, the lineages sampled from species A and B in Figure 1b do not coalesce until the population ancestral to species A, B, and C, thus allowing the B and C lineages in the gene tree to have a more recent common ancestor than lineages A and B.

**In (a)–(d) the ranked species tree topology is**(((A,B)₄,C)₂,(D,E)₃)₁. (a) The ranked gene tree matches the ranked species tree. (b) The (ranked or unranked) gene tree does not match the species tree, and there is an incomplete lineage sorting event (a deep coalescence) because the lineages from species A and B fail to coalesce more recently than s₂. (c) The gene tree and species tree have the same unranked topology but have different ranked topologies, as D and E coalesce in the gene tree more recently than A and B, while A and B is the most recent divergence in the species tree. The gene tree in (c) has ranked topology (((A,B)₃,C)₂,(D,E)₄)₁. In (c), there are no incomplete lineage sorting events (no deep coalescences); however, there is an extra lineage at time s₃ which leads to the gene tree and species tree having different rankings. In (c), all coalescences occur in the most recent possible interval consistent with the ranked gene tree, and we have ℓ₁= 2,ℓ₂= 3,ℓ₃= 5,ℓ₄= 5, and g₁= 2, g₂= 3, g₃= 5, g₄= 5. (d) A gene tree with the same ranked topology as the gene tree in (c) but with coalescences occurring in different intervals.

Given a fixed species tree, and assuming the gene tree evolved under the multi-species coalescent [1], the most probable gene tree topology can have a different topology from that of the species tree. Such a gene tree topology is called an anomalous gene tree. In fact, for every species tree topology with at least 5 leaves, we can choose edge lengths in the species tree topology such that anomalous gene trees exist [2]. This implies that the gene tree topology appearing most often when considering different genes might not agree with the species tree topology, thus we cannot use a simple majority-heuristic to infer the species tree from a collection of gene trees. Instead we need statistical tools rather than majority rule heuristics for inferring the species tree based on gene trees.

Current methods for inferring species trees from gene trees in this setting can be divided into topology-based and genealogy-based methods, in which the input for a reconstruction algorithm accepts either gene tree topologies or genealogies, i.e., gene trees with branch lengths (coalescence times). Topology-based methods include Minimize Deep Coalescence (MDC) [3,4], STAR [5], STELLS [6], rooted triple consensus [7] and other consensus and supertree methods [8,9]. Genealogy-based methods include Bayesian and likelihood methods such as BEST, *BEAST, and STEM [10-12] and clustering and distance-based methods [5,13-15]. Possible pros and cons of the two approaches are that topology-based methods can be computationally faster and less sensitive to errors in estimating gene trees (and gene tree branch lengths) from sequence data [16], while methods that use coalescence times, particularly using Bayesian modelling, can be the most accurate when model assumptions are correct [17].

Another possibility that has been so far unexplored in methods for inferring species trees from gene trees is to use ranked gene trees, in which the temporal order of the nodes of the gene tree (the coalescence times) is used, but not the continuous-valued branch lengths. This approach might therefore be intermediate between purely topology-based methods and genealogy-based methods. By preserving more of the temporal information in the gene tree nodes, the hope is to develop methods that are more powerful than purely topology-based methods and that are still computationally efficient and robust to errors in estimating gene trees and gene tree branch lengths from sequence data.

In [18], a first step toward developing methods that use ranked gene trees for inferring species trees was taken by providing formulae to calculate the probability of a ranked gene tree given a species tree. The previous work, however, was based on an exponential enumeration of what were called ranked coalescent histories and did not provide an algorithm for computing some of the key terms in the probability of individual ranked histories. In this paper, we improve this previous (computationally inefficient) approach, by providing a method for computing probabilities of ranked gene trees given species trees which is polynomial in the number of leaves using a dynamic programming approach.

Methods for computing probabilities of ranked gene trees efficiently may also be of interest in the context of computing probabilities of unranked gene trees, particularly because no polynomial time algorithm has been found for calculating the probability of a gene tree topology given a species tree under the multispecies coalescent [6,19-21]. The probability of an unranked gene tree topology can be obtained by summing over all ranked gene tree topologies with the same topology. Thus, for unranked gene trees with particular shapes where the number of rankings increases in polynomial time, using ranked gene trees can potentially increase the speed of computing probabilities of unranked gene trees as well. We note that a completely unbalanced gene tree has only one ranking, while the number of rankings can be exponential in the number of leaves when gene trees become more balanced. Thus, our approach for calculating unranked gene tree probabilities will be most useful for less balanced ranked gene trees.

The bulk of the paper consists of the derivation of the polynomial time method for computing ranked gene tree probabilities. The algorithm is summarized in section ‘An algorithm’. This is followed by a discussion of applications to computing probabilities of unranked gene tree topologies and to inferring ranked species trees under maximum likelihood and a modification to the MDC criterion.

Calculating the probability of a ranked gene tree topology

In the following, we will derive the probability of a ranked gene tree topology given a species tree, $P [G | T]$ . Equations (1, 2, 3, 4, 8, 10) allow the calculation of $P [G | T]$ in time O(n⁵). The model giving rise to the gene tree is the multi-species coalescent with constant population sizes [1]. Each species consists of a population of constant size where lineages merge according to the coalescent. Thus, lineages from two different species may coalesce any time previous to the split of the two species.

We begin with some notation, which is also summarized in Table 1. Let time be 0 today and increasing going into the past. Let $T$ be a species tree with n species, and thus n − 1 speciation events (denoted by 1,…,n − 1) occurring at times s₁>⋯> s_n−1. Denote the interval between speciation event i − 1 and speciation event i by τ_i, see Figure 1.

Table 1.

Notation used in the paper

Symbol	Meaning
$T$	Species tree with real-valued divergence times
$G$	Ranked gene tree (real-valued coalescence times not specified)
n	The number of leaves of $T$ and $G$
s_i	Speciation times, with s₁>⋯> s_n−1, let s₀= ∞
τ_i	Intervals between speciation times, τ_i= [s_i,s_i−1)
ℓ_i	The number of gene tree lineages at time s_i
m_i	The number of coalescence events in interval τ_i
$G_{i, ℓ_{i}}$	The ranked gene tree observed from time 0 to time s_i
g_i	The minimum number of gene tree lineages at time s_i
y_i,z	Population z in interval τ_i in beaded tree
u_i	Internal node (coalescence) with rank i in the gene tree, u₁ is most ancient, u_n−1 is the most recent
k_i,j,z	The number of lineages available for coalescence in population y_i,z just after the jth coalescence (considered forward in time) in interval τ_i; k_i,0,z is the number of lineages “exiting” at time s_i−1
δ(y),δ(u)	The set of leaves descended from a node of the species tree or gene tree, respectively
lca(u)	For a node u of the gene tree, the node y of the species tree with largest rank such that δ(u) ⊂ δ(y)
τ(y)	For a node y with rank i on the species tree, we denote τ(y) = τ_i (the interval immediately above y)
λ_i,j	The overall coalescence rate in interval τ_i immediately preceding (backwards in time) the jth coalescence
H_k	Number of sequences of coalescences above the root of the species tree starting with k lineages
f_i	The joint density of coalescence times in interval τ_i

Open in a new tab

Let $G$ be a ranked gene tree topology. It is convenient to use the same labels for the leaves of $G$ and of $T$ . This is a slight abuse of notation, as leaf A of $T$ refers to a population (or species), and A of $G$ refers to a gene sampled from population A. We denote the nodes of $G$ (which are coalescence events) by u₁,…,u_n−1, where node u_j has rank j, and where higher rank indicates a more recent coalescence. A ranked tree topology can be notated similarly to Newick notation, putting the rank as a subscript for each node, see also Figure 1.

Let $G_{i, ℓ_{i}}$ be part of a ranked gene tree evolving on a species tree between time s_i and time 0 (i.e. the present). $G_{i, ℓ_{i}}$ consists of ℓ_i gene tree lineages at speciation time s_i and the coalescent history of $G_{i, ℓ_{i}}$ in time interval (0,s_i) is consistent with the ranked gene tree $G$ . Let g_i be the minimum number of lineages required in the ranked gene tree at time s_i such that $G$ can be embedded into the species tree $T$ . Note that n ≥ ℓ_i≥ g_i> i. Next we provide a dynamic programming approach for calculating the probability of a ranked gene tree given a species tree. An efficient way to determine the required quantities g₁,…,g_n−1 is provided in Section ‘Calculation of g_i and k_i,j,z’.

Essentially, in our approach, we traverse the intervals between speciation events going back in time, τ_n−1,…,τ₂ (formalized in Theorem 2), and calculate the probability of the appropriate coalescent events occurring in interval τ_i based on how many coalescent events happened in the later intervals τ_i+1,…,τ_n−1 (Theorem 3). Finally with Theorem 1, we account for the most ancestral time interval τ₁.

Theorem 1

The probability of a ranked gene tree given a species tree is,

P [G | T] = \sum_{ℓ_{1} = g_{1}}^{n} P [G_{1, ℓ_{1}} | T] / H_{ℓ_{1}}

(1)

where

H_{ℓ_{1}} = ℓ_{1}! (ℓ_{1} - 1)! / 2^{ℓ_{1} - 1}

(2)

is the probability for the coalescences above the root appearing in the right order[22].

For precalculated $P [G_{1, ℓ_{1}} | T]$ (ℓ₁= 2,…,n) the complexity of calculating $P [G | T]$ is thus O(n). Next, we will provide a recursive way to calculate $P [G_{1, ℓ_{1}} | T]$ for ℓ₁= 2,…,n in polynomial time, thus $P [G | T]$ can be calculated in polynomial time.

Theorem 2

The probability $P [G_{i, ℓ_{i}} | T]$ can be calculated for all i recursively (with l_i≥ g_i),

\begin{array}{l} P [G_{i, ℓ_{i}} | T] \\ = \sum_{ℓ_{i + 1} = max (ℓ_{i}, g_{i + 1})}^{n} P [G_{i, ℓ_{i}} | G_{i + 1, ℓ_{i + 1}}, T] P [G_{i + 1, ℓ_{i + 1}} | T] \end{array}

(3)

with

P [G_{n - 1, n} | T] = 1 .

The complexity of calculating $P [G_{1, ℓ_{1}} | T]$ for ℓ₁= 2,…,n is O(n³), given we know $P [G_{i, ℓ_{i}} | G_{i + 1, ℓ_{i + 1}}, T]$ for all i,ℓ_i,ℓ_i+1.

Proof

At the time of the most recent speciation event, s_n−1, we have n lineages with probability 1, which is the initial value of the recursion. Calculating $P [G_{i, ℓ_{i}} | T]$ for i < n − 1 can be done in the following way,

\begin{align} P [G_{i, ℓ_{i}} | T] \\ = \sum_{ℓ_{i + 1} = max (ℓ_{i}, g_{i + 1})}^{n} P [G_{i, ℓ_{i}}, G_{i + 1, ℓ_{i + 1}} | T] \\ = \sum_{ℓ_{i + 1} = max (ℓ_{i}, g_{i + 1})}^{n} P [G_{i, ℓ_{i}} | G_{i + 1, ℓ_{i + 1}}, T] P [G_{i + 1, ℓ_{i + 1}} | T] . \end{align}

Suppose $P [G_{i, ℓ_{i}} | G_{i + 1, ℓ_{i + 1}}, T]$ is known. Given we calculated the probability $P [G_{i + 1, ℓ_{i + 1}} | T]$ for ℓ_i+1= i + 2,…,n, then calculating $P [G_{i, ℓ_{i}} | T]$ for ℓ_i= i + 1,…,n requires $O (\sum_{j = 1}^{n - i} j) = O ((\binom{n - i + 1}{2}))$ calculations. Summing up over i = 1,…,n − 1 yields a complexity of $O (\sum_{i = 2}^{n} (\binom{i}{2})) = O ((\binom{n + 1}{3})) = O (n^{3})$ . □

It remains to determine $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ . Note that during the interval τ_i, we have i branches in the species tree. Let m_i be the number of coalescent events in τ_i, so m_i= ℓ_i− ℓ_i−1. Let the number of lineages on branch z just after the jth coalescent event (going forward in time) in τ_i be k_i,j,z. Calculation of k_i,j,z can be done efficiently as shown in Section ‘Calculation of g_i and k_i,j,z’.

Theorem 3

We have,

P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T] = \sum_{j = 0}^{m_{i}} \frac{e^{- λ_{i, j} (s_{i - 1} - s_{i})}}{\prod_{k = 0, k \neq j}^{m_{i}} (λ_{i, k} - λ_{i, j})}

(4)

where $λ_{i, j} = \sum_{z = 1}^{i} (\binom{k_{i, j, z}}{2})$ and $(\binom{1}{2}) : = 0$ .

Proof

The density for the coalescence events in interval τ_i can be obtained by considering the waiting time to the “next” coalescent event (going backwards in time) as being due to competing exponentials in the different branches, where the coalescence rate within branch z is $(\binom{k_{i, j, z}}{2})$ . Thus, the waiting time until the next coalescent event has rate $λ_{i, j} = \sum_{z = 1}^{i} (\binom{k_{i, j, z}}{2})$ .

We denote the time between the jth and (j + 1)st coalescent event as v_j, where v₀ is the time between s_i−1 and the first (least recent) coalescent event in τ_i and with $v_{m_{i}}$ being the time between s_i and coalescent event m_i.

The density for the coalescent events in the interval τ_i is [18],

\begin{align} f_{i} (v_{0}, v_{1}, \dots, v_{m_{i}}) & = e^{- \sum_{j = 0}^{m_{i}} \sum_{z = 1}^{i} (\binom{k_{i, j, z}}{2}) v_{j}} \\ = e^{- \sum_{j = 0}^{m_{i}} λ_{i, j} v_{j}} . \end{align}

It remains to integrate over v, for which we distinguish between case (i) λ_i,0= 0, and case (ii) λ_i,0> 0.

Case (i): If λ_i,0= 0 (which occurs if ℓ_i−1= i, i.e., all lineages within each population coalesce), then we rewrite f_i as,

f_{i} (v_{0}, v_{1}, \dots, v_{m_{i}}) = \frac{\prod_{j = 1}^{m_{i}} λ_{i, j} e^{- λ_{i, j} v_{j}}}{\prod_{j = 1}^{m_{i}} λ_{i, j}} .

(5)

Using the fact that the integral of the numerator of Equation (5) is a hypoexponential distribution based on the sum of m_i exponential random variables [23] (with density functions $λ_{i, j} e^{- λ_{i, j} v_{j}}$ , j = 1,…,m_i), the probability of the coalescent events in the interval is the cumulative distribution function of the hypoexponential distribution evaluated at $s_{i - 1} - s_{i} = \sum_{j = 0}^{m_{i}} v_{i}$ . Thus, with λ_i,j< λ_i,j+1,

\begin{array}{l} P [G_{i - 1, ℓ_{i - 1}} & | G_{i, ℓ_{i}}, T] \\ = \frac{1}{\prod_{j = 1}^{m_{i}} λ_{i, j}} - \sum_{j = 1}^{m_{i}} \frac{e^{- λ_{i, j} (s_{i - 1} - s_{i})}}{λ_{i, j} \prod_{k = 1, k \neq j}^{m_{i}} (λ_{i, k} - λ_{i, j})} \\ = \frac{1}{\prod_{j = 1}^{m_{i}} λ_{i, j}} + \sum_{j = 1}^{m_{i}} \frac{e^{- λ_{i, j} (s_{i - 1} - s_{i})}}{\prod_{k = 0, k \neq j}^{m_{i}} (λ_{i, k} - λ_{i, j})} \\ = \sum_{j = 0}^{m_{i}} \frac{e^{- λ_{i, j} (s_{i - 1} - s_{i})}}{\prod_{k = 0, k \neq j}^{m_{i}} (λ_{i, k} - λ_{i, j})} \end{array}

(6)

where the second line follows because −λ_i,j= λ_i,0− λ_i,j.

Case (ii): If λ_i,0> 0, then we rewrite f_i as,

f_{i} (v_{0}, v_{1}, \dots, v_{m_{i}}) = \frac{\prod_{j = 0}^{m_{i}} λ_{i, j} e^{- λ_{i, j} v_{j}}}{\prod_{j = 0}^{m_{i}} λ_{i, j}}

(7)

For integrating f_i, we use the fact that the integral of the numerator in Equation (7) is the convolution of m_i+ 1 exponential random variables with parameters $λ_{i, 0}, \dots, λ_{i, m_{i}}$ , which is the hypoexponential distribution. Now, since λ_i,j< λ_i,j+1, we observe, using the probability density function of the hypoexponential distribution,

\begin{align} P [G_{i - 1, ℓ_{i - 1}} & | G_{i, ℓ_{i}}, T] \\ = \int_{v} f_{i} (v_{0}, v_{1}, \dots, v_{m_{i}}) dv \\ = \sum_{j = 0}^{m_{i}} \frac{e^{- λ_{i, j} (s_{i - 1} - s_{i})}}{\prod_{k = 0, k \neq j}^{m_{i}} (λ_{i, k} - λ_{i, j})}, \end{align}

which is the same expression as for the λ_i,0= 0 case (6). Note that for case (i) we made use of the cumulative distribution function of the hypoexponential distribution, while for case (ii) we made use of the density function of the hypoexponential distribution. Both cases yield the same final expression for $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ , which establishes the proof. □

Corollary 4

The probabilities $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ for all possible i, m_iand ℓ_i(recall that m_i= ℓ_i− ℓ_i−1) are calculated in O(n⁵), given all λ_i,j.

Proof

For a fixed i, m_i and ℓ_i, we require $O (m_{i}^{2})$ calculations to evaluate $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ . We need to determine $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ for all possible i, m_i and ℓ_i. First, we observe that i ≤ ℓ_i−1≤ n, and thus for a fixed ℓ_i, we have, 0 ≤ m_i≤ ℓ_i− i. Second, i < ℓ_i≤ n. And third, 2 ≤ i ≤ n − 1. Thus, the number of calculations needed to calculate $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ for all possible i, m_i and ℓ_i is,

\begin{align} O (\sum_{i = 2}^{n - 1} \sum_{ℓ_{i} = i + 1}^{n} \sum_{m_{i} = 0}^{ℓ_{i} - i} m_{i}^{2}) & = O (\sum_{i = 2}^{n - 1} \sum_{ℓ_{i} = i + 1}^{n} {(ℓ_{i} - i)}^{3}) \\ = O (\sum_{i = 2}^{n - 1} {(n - i)}^{4}) \\ = O (n^{5}) . \end{align}

□

Corollary 5

The quantities λ_i,jcan be calculated for all possible i, m_i, ℓ_iand j in O(n⁵), given all k_i,j,z.

Proof

For a fixed i, m_i, ℓ_i and j, we require O(i) calculations to evaluate λ_i,j. As j = 0,…,m_i, with the same arguments as in Corollary 4, we obtain,

\begin{align} O (\sum_{i = 2}^{n - 1} \sum_{ℓ_{i} = i + 1}^{n} \sum_{m_{i} = 0}^{ℓ_{i} - i} \sum_{j = 0}^{m_{i}} i) & = O (\sum_{i = 2}^{n - 1} i \sum_{ℓ_{i} = i + 1}^{n} \sum_{m_{i} = 0}^{ℓ_{i} - i} m_{i}) \\ = O (\sum_{i = 2}^{n - 1} i \sum_{ℓ_{i} = i + 1}^{n} {(l_{i} - 1)}^{2}) \\ = O (\sum_{i = 2}^{n - 1} i {(n - i)}^{3}) \\ = O (n^{5}) . \end{align}

□

We note that the terms $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ are analogous to the functions g_i,j defined in [24],[25], which give the probability that i lineages coalesce into j within time t in a single population and are used extensively in computing probabilities related to unranked gene trees [6],[19],[26,27]. In particular, if only one population, say z^∗, has coalescence events, then we have r

\begin{align} P [G_{i, ℓ_{i}} | G_{i + 1, ℓ_{i + 1}}, T] \\ = \frac{g_{ℓ_{i + 1}, ℓ_{i}} (s_{i} - s_{i + 1}) \prod_{z \neq z^{*}} g_{k_{i, 0, z}, k_{i, 0, z}} (s_{i} - s_{i + 1})}{\prod_{k = 1}^{ℓ_{i + 1} - ℓ_{i}} (\binom{ℓ_{i + 1} - k + 1}{2})}, \end{align}

a product of g_i,j functions with the denominator counting the number of sequences in which m_i coalescences could have occurred. The terms $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ allow for the coalescences to occur in separate populations, however, and are constrained by the ranking of the gene tree. For example, in interval τ₃ of Figure 1c, there are two coalescences which occur in different populations. If the ranking of the gene tree were not important, the branches could be considered independent, and the probability of this event would be g_2,1(s₂− s₃)g_2,1(s₂− s₃). However, the gene tree ranking constrains the coalescence of A and B to be less recent than that of D and E, so the probability for events in this interval is, r

P [G_{2, 3} | G_{3, 5}, T] = {[g_{2, 1} (s_{2} - s_{3})]}^{2} / 2 .

We illustrate that we get the same result from Theorem 3: there are two coalescence events in interval τ₃, so we use j = 0,1,2, and calculate

\begin{align} λ_{3, 0} = (\binom{1}{2}) + (\binom{1}{2}) + (\binom{1}{2}) = 0, \\ λ_{3, 1} = (\binom{2}{2}) + (\binom{1}{2}) + (\binom{1}{2}) = 1, \\ λ_{3, 2} = (\binom{2}{2}) + (\binom{1}{2}) + (\binom{2}{2}) = 2 . \end{align}

Thus, Equation (4) from Theorem 3 evaluates to

\begin{align} \frac{e^{- 0 (s_{2} - s_{3})}}{(2 - 0) (1 - 0)} + \frac{e^{- 1 (s_{2} - s_{3})}}{(0 - 1) (2 - 1)} + \frac{e^{- 2 (s_{2} - s_{3})}}{(0 - 2) (1 - 2)} \\ = \frac{1}{2} - e^{- (s_{2} - s_{3})} + \frac{1}{2} e^{- 2 (s_{2} - s_{3})} \\ = \frac{1}{2} {(1 - e^{- (s_{2} - s_{3})})}^{2} \\ = {[g_{2, 1} (s_{2} - s_{3})]}^{2} / 2 . \end{align}

Remark 6

The probability of a gene tree topology is the sum of the probabilities of each ranked gene tree with the given topology. A given tree topology has $(n - 1)! / \prod_{i = 1}^{n - 1} (c_{i} - 1)$ rankings, wherec_iis the number of descendant leaves of interior vertex i. A proof can be found in[28]. For a completely balanced tree on n = 2^kleaves, the number of rankings grows faster than polynomial: the numerator can be approximated by,

n! \approx \sqrt{2 Πn} {(n / e)}^{n},

and the denominator can be approximated by,

\prod_{i = 1}^{n - 1} (c_{i} - 1) = \prod_{i = 1}^{k} (2^{i} - 1) n / 2^{i} \approx n^{k} = n^{{log}_{2} n},

showing that the ratio grows faster than polynomial in n.

Calculation of g_i and k_i,j,z

Calculation of g_i

If $T$ and $G$ have the same ranked topology, then g_i= i + 1. In general, to compute g_i, we let lca (u_j) be the least common ancestor node on the species tree for a node u_j on the ranked gene tree – i.e., the node with the largest rank on the species tree which is ancestral to all species represented in u_j. For a node y on the species tree, let τ(y) be the interval immediately above y. For example, in Figure 1c, τ(lca(u₄)) = τ₃ where u₄ is the gene tree node with rank 4 — the node ancestral to D and E only. In order to compute g_i, we count the number of gene tree nodes which may occur closer to the present than s_i. These are precisely all gene tree nodes u_j where lca (u_j) is in any of the intervals τ_i+1,…,τ_n−1. Since at the present, n lineages are able to coalesce, we can express g_i as,

g_{i} = n - \sum_{j = i + 1}^{n - 1} \prod_{k = j}^{n - 1} I (τ (lca (u_{k})) > τ_{i})

(8)

where τ_j< τ_i iff j < i, and where I(·) is an indicator function taking the value 1 if the condition holds and otherwise 0. Assuming each lca() operation is O(1) [29,30], preprocessing allows all lca terms to be computed in O(n) time. Thus, calculating g₁,…,g_n−1 can be done, based on Equation 8, in O(n³).

Calculation of k_i,j,z

We let y_i,j be the jth population (read left to right) in interval τ_i (equivalently, the jth branch or jth node subtending the branch). In order to label every population before and after a speciation time s_i uniquely, extra nodes can be added to the species tree to form a beaded species tree (Figure 2), so that there are i nodes at time s_i, $i = 1, \dots, n - 1$ . For each i ∈ {1,…,n−1}, there is one node of outdegree 2, and i − 1 nodes of outdegree 1. Thus, population y_i,j corresponds to a branch (equivalently, a node) in the beaded species tree. We denote the outdegree of a node y by outdeg(y).

**The beaded version of the species tree topology in Figure**1a–d.

In the remainder of this section, we compute the values k_i,j,z, i.e. the number of lineages on branch y_i,z of the beaded species tree during the interval immediately after the jth coalescence event (going forward in time), with k_i,0,z being the number of lineages “exiting” the branch at time s_i−1. For example, in Figure 1b, we have

\begin{array}{l} k_{2, 0, 1} = 1, & k_{2, 1, 1} = 2, & k_{2, 2, 1} = 3, \\ k_{2, 0, 2} = 1, & k_{2, 1, 2} = 1, & k_{2, 2, 2} = 1, \end{array}

The value of k_i,j,z depends on the number of lineages entering branch i, ℓ_i, as well as the number of lineages exiting the branch, and not just on the number of coalescence events in the interval. For example, in Figure 1c, k_2,0,1= 1 and k_2,1,1= 2, while in Figure 1d, k_2,0,1= 2 and k_2,1,1= 3, although the two gene trees have the same ranked topology and m₂= 1 for both cases.

To determine the terms k_i,j,z we note that the number of coalescences that have occurred more recently than interval τ_i is n − ℓ_i. In a given interval τ_i, we let z⁽¹⁾ and z⁽²⁾ be the left and right children, respectively, of population z of outdegree 2, and let z⁽¹⁾= z⁽²⁾ be the only child of a node z of outdegree 1.

The number of lineages available to coalesce in population z of interval τ_i is

k_{i, m_{i}, z} = \sum_{j = 1}^{outdeg (y_{i, z})} k_{i + 1, 0, z^{(j)}}

(9)

where the z^(j) are the daughter populations (one or two) of z. Further, k_n,0,z= 0 for all z. Since the beaded species tree has n²/2 nodes, precalculating outdeg(y_i,z) requires O(n²). For 0 ≤ j < m_i, we have

k_{i, j, z} = \{\begin{array}{l} k_{i, j + 1, z} - 1 & j th coalescence on branch z \\ k_{i, j + 1, z} & otherwise \end{array})

(10)

Consequently, determining a particular k_i,j,z is O(1). Thus determining k_i,j,z for all possible i, m_i and ℓ_i is (see also Corollary 4),

\begin{align} = O (\sum_{i = 2}^{n - 1} \sum_{ℓ_{i} = i + 1}^{n} \sum_{m_{i} = 0}^{ℓ_{i} - i} \sum_{j = 0}^{m_{i}} O (1)) \\ = O (n^{4}) . \end{align}

Note that taking the sum over all z is not necessary, as in all but one branch the k_i,j,z equals the k_i,j+1,z.

An algorithm

In summary, we derived an algorithm with runtime O(n⁵) for calculating the probability of a ranked gene tree given a species tree on n tips:

1. Calculate g₁,…g_n−1 using Equation (8).

2. Calculate k_i,j,z (for i,j = 1,…,n;z = 1…i), using Equations (9) and (10).

3. Calculate $λ_{i, j} = \sum_{z = 1}^{i} (\binom{k_{i, j, z}}{2})$ (for i,j = 1,…,n).

4. Calculate $P [G_{i - 1, ℓ_{i - 1}} | G_{i, ℓ_{i}}, T]$ (for i = 2,…,n; ℓ_i−1= g_i−1,…,n; ℓ_i= g_i,…,n), using Theorem 3.

5. Calculate $P [G_{1, ℓ_{1}} | T]$ using Theorem 2.

6. Calculate $P [G | T]$ using Theorem 1.

Conclusions

In this paper, we provide a polynomial-time algorithm (O(n⁵) where n is the number of species) to calculate the probability of a ranked gene tree topology given a species tree, summarized in Section ‘An algorithm’. We now discuss applying these results to computing probabilities of unranked gene tree topologies and to inferring ranked species trees.

Computing probabilities of unranked gene tree topologies

Previous work on computing probabilities of unranked gene tree topologies used the concept of coalescent histories, which specify the branches in the species tree in which each node of the gene tree occurs. An unranked gene tree probability can then be computed by enumerating all coalescent histories and computing the probability of each. The number of coalescent histories grows at least exponentially when the (unranked) gene tree matches the species tree, making this approach computationally intensive. Coalescent histories can be enumerated either recursively (e.g., in PHYLONET [31] or [20]) or nonrecursively (COAL [19]).

A much faster approach using dynamic programming similar to that used in this paper is implemented in STELLS [6], which conditions on the ancestral configuration in each branch rather than the number of lineages. Here an ancestral configuration keeps track not only of the number of lineages in a branch in the species tree, but also the particular nodes of the gene tree. Different ancestral configurations can potentially have the same number of lineages within a population. Enumerating ancestral configurations turns out to have exponential running time for arbitrarily shaped trees, but the number of ancestral configurations is still much smaller than the number of coalescent histories. When computing probabilities of ranked gene tree topologies, however, the ranking specifies the sequence of coalescence events, leading to a unique ancestral configuration given the number of lineages in a time interval. This fortuitously enables probabilities of ranked gene tree topologies to be computed in polynomial time.

We note that although the number of rankings for a gene tree is not polynomial in the number of leaves in general, the number of rankings can be small for certain tree shapes. For example, if the gene tree has a caterpillar shape, in which each internal node has a leaf as a descendant, then there is only one ranking, and thus computing the ranked and unranked gene tree are equivalent. For a pseudo-caterpillar, a tree made by replacing the subtree with four leaves of a caterpillar with a balanced tree on four leaves [20], there are only two rankings possible, and for a bicaterpillar[20], for which the left subtree is a caterpillar with n_L leaves and the right subtree is a caterpillar with n − n_L leaves, there are $(\binom{n - 2}{n_{L} - 1})$ rankings. Thus computing unranked gene tree probabilities by summing ranked gene tree probabilities can be done in polynomial time for some tree shapes. We note that for the approach used by STELLS, some tree shapes can also be computed in polynomial time, including the cases we mentioned with a polynomial number of rankings (caterpillar and pseudo-caterpillar). An open question is whether there are any classes of unranked gene trees which have a polynomial number of rankings but an exponential number of ancestral configurations, or vice versa.

Inferring species trees from ranked gene trees

Our fast calculation of the probability of ranked gene tree topologies can be used to determine the maximum likelihood species tree from a collection of known gene trees. Assume we have observed N ranked gene trees (i.e., N loci). Now the maximum likelihood species tree $T_{ML}$ (with branch lengths on internal branches) is

T_{ML} = \underset{T}{argmax} P [G_{1}, \dots, G_{N} | T]

where

P [G_{1}, \dots, G_{N} | T] = \prod_{k = 1}^{N} P [G_{k} | T] = \prod_{i = 1}^{H_{n}} P {[G^{(i)} | T]}^{n_{i}}

(11)

is a multinomial likelihood. Here $P [G_{k} | T]$ can be determined with our polynomial-time algorithm, we let $G^{(i)}$ denote the ith ranked topology, and n_i is the number of times ranked topology i is observed, with $\sum_{i = 1}^{H_{n}} n_{i} = N$ . Note in particular that the ranked topology of $T_{ML}$ might differ from the most frequent ranked gene tree topology [18].

Our derivation of the ranked gene tree probability also suggests a way to infer a ranked species tree topology from ranked gene tree topologies with a similar flavor as the MDC criterion. In MDC, for an input gene tree and candidate species tree, the number of extra lineages (lineages which necessarily fail to coalesce due to topological differences between gene and species trees) on each edge of the species tree is counted. For MDC, whether the edge of the species tree is long or short does not affect the deep coalescence cost. In working with ranked gene trees, however, we can keep track of the minimum number of extra lineages within each time interval τ_i. The total number of extra lineages in this sense is

\sum_{i = 1}^{n - 1} g_{i} - (i + 1)

(12)

Minimizing (12) as a criterion for the ranked species tree will tend to penalize long edges of the species tree which have multiple lineages persisting through multiple species divergence events. As an example, in Figure 1b, the gene tree has a MDC cost of 1 since there are two lineages exiting the population immediately ancestral to A and B; however the cost according (12) is 2 because there are two edges on the beaded version of the species tree (Figure 2) that each have an extra lineage. In Figure 1c, the gene tree has a MDC cost of 0 for the species tree since it has the matching unranked topology; however, the number of extra lineages from equation (12) is 1. We note that in Figure 1c, interval τ₃, incomplete lineage sorting (and deep coalescence) have not occurred as these concepts are normally used. To capture the idea that coalescence has nevertheless occurred in a more ancient time interval than allowed, we might refer to the coalescence of A and B in Figure 1c as an “ancient lineage sorting” event (rather than incomplete lineage sorting event) or an ancient coalescence rather than a deep coalescence. We could therefore refer to minimizing equation (12) as the Minimize Ancient Coalescence (MAC) criterion, which would provide an interesting comparison to the usual topology-based MDC criterion.

In practice, a method of inferring a species tree from ranked gene trees would require estimating the ranked gene trees. This would require clock-like gene trees, or trees with times estimated for nodes, which can also be inferred under relaxed clock models in BEAST [32]. To account for the uncertainty in the gene trees, the counts for different ranked gene trees could be weighted by their posterior probabilities obtained from Bayesian estimation of the gene trees [33]. Thus, in equation (11), we would let n_ik be the posterior probability of ranked topology i at locus k, and use $n_{i} = \sum_{k = 1}^{H_{n}} n_{ik}$ as the estimated number of times that ranked topology i was observed. Similarly, for equation (12), the coalescence cost at a locus could be distributed over multiple topologies weighted by their posterior probabilities.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Both authors contributed equally to all parts of this work. Both authors read and approved the final manuscript.

Contributor Information

Tanja Stadler, Email: tanja.stadler@env.ethz.ch.

James H Degnan, Email: j.degnan@math.canterbury.ac.nz.

Acknowledgements

We thank David Bryant for suggesting the dynamic programming approach to this problem and two anonymous referees for valuable comments, particularly on calculating g_i and k_i,j,z. JHD was funded by the New Zealand Marsden fund and by a Sabbatical Fellowship at the National Institute for Mathematical and Biological Synthesis, an Institute sponsored by the National Science Foundation, the U.S. Department of Homeland Security, and the U.S. Department of Agriculture through NSF Award #EF-0832858, with additional support from The University of Tennessee, Knoxville. TS was funded by the Swiss National Science Foundation.

References

Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference, and the multispecies coalescent. Trends Ecol Evol. 2009;24:332–340. doi: 10.1016/j.tree.2009.01.009. [DOI] [PubMed] [Google Scholar]
Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2:762–768. doi: 10.1371/journal.pgen.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maddison WP, Knowles LL. Inferring phylogeny despite incomplete lineage sorting. Syst Biol. 2006;55:21–30. doi: 10.1080/10635150500354928. [DOI] [PubMed] [Google Scholar]
Than C, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comput Biol. 2009;5:e1000501. doi: 10.1371/journal.pcbi.1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009;58:468–477. doi: 10.1093/sysbio/syp031. [DOI] [PubMed] [Google Scholar]
Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution. 2011,. [DOI] [PubMed]
Ewing GB, Ebersberger I, Schmidt HA, von Haeseler A. Rooted triple consensus and anomalous gene trees. BMC Evol Biol. 2008;8:118. doi: 10.1186/1471-2148-8-118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA. Properties of consensus methods for inferring species trees from gene trees. Syst Biol. 2009;58:35–54. doi: 10.1093/sysbio/syp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Degnan JH. Performance of matrix representation with parsimony for inferring species from gene trees. Stat Appl Genet Mol Biol. 2011;10:21. [Google Scholar]
Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010;27:570–580. doi: 10.1093/molbev/msp274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kubatko LS, Carstens BC, Knowles LL. STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009;25:971–973. doi: 10.1093/bioinformatics/btp079. [DOI] [PubMed] [Google Scholar]
Liu L, Pearl DK. Species trees from gene trees: Reconstructing bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007;56:504–514. doi: 10.1080/10635150701429982. [DOI] [PubMed] [Google Scholar]
Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011;60:661–667. doi: 10.1093/sysbio/syr027. [DOI] [PubMed] [Google Scholar]
Liu L, Yu L, Pearl DK. Maximum tree: a consistent estimator of the species tree. J Math Biol. 2010;60:95–106. doi: 10.1007/s00285-009-0260-0. [DOI] [PubMed] [Google Scholar]
Mossel E, Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans Comp Biol Bioinf. 2010;7:166–171. doi: 10.1109/TCBB.2008.66. [DOI] [PubMed] [Google Scholar]
Huang H, He Q, Kubatko LS, Knowles LL. Sources of error for species-tree estimation: Impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. Syst Biol. 2009;59:573–583. doi: 10.1093/sysbio/syq047. [DOI] [PubMed] [Google Scholar]
Liu L, Yu L, Kubatko LS, Pearl DK, Edwards SV. Coalescent methods for estimating phylogenetic trees. Mol Phylogenet Evol. 2009;53:320–328. doi: 10.1016/j.ympev.2009.05.033. [DOI] [PubMed] [Google Scholar]
Degnan JH, Rosenberg N, Stadler T. The probability distribution of ranked gene trees on a species tree. Math Biosci. 2012;235:45–55. doi: 10.1016/j.mbs.2011.10.006. [DOI] [PubMed] [Google Scholar]
Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution. 2005;59:24–37. [PubMed] [Google Scholar]
Rosenberg NA. Counting coalescent histories. J Comput Biol. 2007;14:360–377. doi: 10.1089/cmb.2006.0109. [DOI] [PubMed] [Google Scholar]
Than C, Ruths D, Innan H, Nakhleh L. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J Comput Biol. 2007;14:517–535. doi: 10.1089/cmb.2007.A010. [DOI] [PubMed] [Google Scholar]
Edwards AWF. Estimation of the branch points of a branching diffusion process. J R Stat Soc Ser B. 1970;32:155–174. [Google Scholar]
Ross S. Introduction to Probability Models. San Diego: Academic Press; 2007. [Google Scholar]
Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol. 1984;26:119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
Wakeley J. Coalescent Theory. Greenwood Village: Roberts & Company; 2008. [Google Scholar]
Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5:568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]
Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor Pop Biol. 2002;61:225–247. doi: 10.1006/tpbi.2001.1568. [DOI] [PubMed] [Google Scholar]
Semple C, Steel M. Phylogenetics, vol. 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford: Oxford University Press; 2003. [Google Scholar]
Harel D, Tarjan RE. Fast algorithms for finding nearest common ancestors. SIAM J Comput. 1984;13:338–355. doi: 10.1137/0213024. [DOI] [Google Scholar]
Schiever B, Vishkin U. On finding lowest common ancestors: simplification and parallelization. SIAM J Comput. 1988;17:1253–1262. doi: 10.1137/0217079. [DOI] [Google Scholar]
Than C, Ruths D, Nakhleh L. Phylonet: A software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics. 2008;9:322. doi: 10.1186/1471-2105-9-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Drummond AJ, Rambaut A. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evolut Biol. 2007;7:214. doi: 10.1186/1471-2148-7-214. [DOI] [PMC free article] [PubMed] [Google Scholar]
Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62:833–862. doi: 10.1007/s00285-010-0355-7. [DOI] [PubMed] [Google Scholar]

[B1] Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference, and the multispecies coalescent. Trends Ecol Evol. 2009;24:332–340. doi: 10.1016/j.tree.2009.01.009. [DOI] [PubMed] [Google Scholar]

[B2] Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2:762–768. doi: 10.1371/journal.pgen.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Maddison WP, Knowles LL. Inferring phylogeny despite incomplete lineage sorting. Syst Biol. 2006;55:21–30. doi: 10.1080/10635150500354928. [DOI] [PubMed] [Google Scholar]

[B4] Than C, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comput Biol. 2009;5:e1000501. doi: 10.1371/journal.pcbi.1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009;58:468–477. doi: 10.1093/sysbio/syp031. [DOI] [PubMed] [Google Scholar]

[B6] Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution. 2011,. [DOI] [PubMed]

[B7] Ewing GB, Ebersberger I, Schmidt HA, von Haeseler A. Rooted triple consensus and anomalous gene trees. BMC Evol Biol. 2008;8:118. doi: 10.1186/1471-2148-8-118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA. Properties of consensus methods for inferring species trees from gene trees. Syst Biol. 2009;58:35–54. doi: 10.1093/sysbio/syp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Wang Y, Degnan JH. Performance of matrix representation with parsimony for inferring species from gene trees. Stat Appl Genet Mol Biol. 2011;10:21. [Google Scholar]

[B10] Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010;27:570–580. doi: 10.1093/molbev/msp274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Kubatko LS, Carstens BC, Knowles LL. STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009;25:971–973. doi: 10.1093/bioinformatics/btp079. [DOI] [PubMed] [Google Scholar]

[B12] Liu L, Pearl DK. Species trees from gene trees: Reconstructing bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007;56:504–514. doi: 10.1080/10635150701429982. [DOI] [PubMed] [Google Scholar]

[B13] Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011;60:661–667. doi: 10.1093/sysbio/syr027. [DOI] [PubMed] [Google Scholar]

[B14] Liu L, Yu L, Pearl DK. Maximum tree: a consistent estimator of the species tree. J Math Biol. 2010;60:95–106. doi: 10.1007/s00285-009-0260-0. [DOI] [PubMed] [Google Scholar]

[B15] Mossel E, Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans Comp Biol Bioinf. 2010;7:166–171. doi: 10.1109/TCBB.2008.66. [DOI] [PubMed] [Google Scholar]

[B16] Huang H, He Q, Kubatko LS, Knowles LL. Sources of error for species-tree estimation: Impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. Syst Biol. 2009;59:573–583. doi: 10.1093/sysbio/syq047. [DOI] [PubMed] [Google Scholar]

[B17] Liu L, Yu L, Kubatko LS, Pearl DK, Edwards SV. Coalescent methods for estimating phylogenetic trees. Mol Phylogenet Evol. 2009;53:320–328. doi: 10.1016/j.ympev.2009.05.033. [DOI] [PubMed] [Google Scholar]

[B18] Degnan JH, Rosenberg N, Stadler T. The probability distribution of ranked gene trees on a species tree. Math Biosci. 2012;235:45–55. doi: 10.1016/j.mbs.2011.10.006. [DOI] [PubMed] [Google Scholar]

[B19] Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution. 2005;59:24–37. [PubMed] [Google Scholar]

[B20] Rosenberg NA. Counting coalescent histories. J Comput Biol. 2007;14:360–377. doi: 10.1089/cmb.2006.0109. [DOI] [PubMed] [Google Scholar]

[B21] Than C, Ruths D, Innan H, Nakhleh L. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J Comput Biol. 2007;14:517–535. doi: 10.1089/cmb.2007.A010. [DOI] [PubMed] [Google Scholar]

[B22] Edwards AWF. Estimation of the branch points of a branching diffusion process. J R Stat Soc Ser B. 1970;32:155–174. [Google Scholar]

[B23] Ross S. Introduction to Probability Models. San Diego: Academic Press; 2007. [Google Scholar]

[B24] Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol. 1984;26:119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]

[B25] Wakeley J. Coalescent Theory. Greenwood Village: Roberts & Company; 2008. [Google Scholar]

[B26] Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5:568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]

[B27] Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor Pop Biol. 2002;61:225–247. doi: 10.1006/tpbi.2001.1568. [DOI] [PubMed] [Google Scholar]

[B28] Semple C, Steel M. Phylogenetics, vol. 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford: Oxford University Press; 2003. [Google Scholar]

[B29] Harel D, Tarjan RE. Fast algorithms for finding nearest common ancestors. SIAM J Comput. 1984;13:338–355. doi: 10.1137/0213024. [DOI] [Google Scholar]

[B30] Schiever B, Vishkin U. On finding lowest common ancestors: simplification and parallelization. SIAM J Comput. 1988;17:1253–1262. doi: 10.1137/0217079. [DOI] [Google Scholar]

[B31] Than C, Ruths D, Nakhleh L. Phylonet: A software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics. 2008;9:322. doi: 10.1186/1471-2105-9-322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Drummond AJ, Rambaut A. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evolut Biol. 2007;7:214. doi: 10.1186/1471-2148-7-214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62:833–862. doi: 10.1007/s00285-010-0355-7. [DOI] [PubMed] [Google Scholar]

PERMALINK

A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree

Tanja Stadler

James H Degnan

Abstract

Background

Results

Conclusions

Background

Figure 1.

Calculating the probability of a ranked gene tree topology

Table 1.

Theorem 1

Theorem 2

Proof

Theorem 3

Proof

Corollary 4

Proof

Corollary 5

Proof

Remark 6

Calculation of g_i and k_i,j,z

Calculation of g_i

Calculation of k_i,j,z

Figure 2.

An algorithm

Conclusions

Computing probabilities of unranked gene tree topologies

Inferring species trees from ranked gene trees

Competing interests

Authors’ contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree

Tanja Stadler

James H Degnan

Abstract

Background

Results

Conclusions

Background

Figure 1.

Calculating the probability of a ranked gene tree topology

Table 1.

Theorem 1

Theorem 2

Proof

Theorem 3

Proof

Corollary 4

Proof

Corollary 5

Proof

Remark 6

Calculation of gi and ki,j,z

Calculation of gi

Calculation of ki,j,z

Figure 2.

An algorithm

Conclusions

Computing probabilities of unranked gene tree topologies

Inferring species trees from ranked gene trees

Competing interests

Authors’ contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Calculation of g_i and k_i,j,z

Calculation of g_i

Calculation of k_i,j,z