Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2013 Oct 28;8:26. doi: 10.1186/1748-7188-8-26

Recursive algorithms for phylogenetic tree counting

Alexandra Gavryushkina 1,, David Welch 1, Alexei J Drummond 1,2
PMCID: PMC3829674  PMID: 24164709

Abstract

Background

In Bayesian phylogenetic inference we are interested in distributions over a space of trees. The number of trees in a tree space is an important characteristic of the space and is useful for specifying prior distributions. When all samples come from the same time point and no prior information available on divergence times, the tree counting problem is easy. However, when fossil evidence is used in the inference to constrain the tree or data are sampled serially, new tree spaces arise and counting the number of trees is more difficult.

Results

We describe an algorithm that is polynomial in the number of sampled individuals for counting of resolutions of a constraint tree assuming that the number of constraints is fixed. We generalise this algorithm to counting resolutions of a fully ranked constraint tree. We describe a quadratic algorithm for counting the number of possible fully ranked trees on n sampled individuals. We introduce a new type of tree, called a fully ranked tree with sampled ancestors, and describe a cubic time algorithm for counting the number of such trees on n sampled individuals.

Conclusions

These algorithms should be employed for Bayesian Markov chain Monte Carlo inference when fossil data are included or data are serially sampled.

Keywords: Ranked tree, Constraint tree, Resolution, Counting trees, Dynamic algorithms, Bayesian tree prior, Phylogenetics

Background

A phylogenetic tree is the common object of interest in many areas of biological science. The tree represents the ancestral relationships between a group of individuals. Given molecular sequence data sampled from a group of organisms it is possible to infer the historical relationships between these organisms using a statistical model of molecular evolution. At present, Bayesian Markov chain Monte Carlo (MCMC) methods are the dominant inferential tool for inferring molecular phylogenies [1].

It is a recent trend to include fossil evidence into the inference to obtain absolute estimates of divergence times [2,3]. Fossils may restrict the age of the most resent common ancestor of a subgroup of individuals. This imposes a constraint on the tree topology (the discrete component of a genealogy) and therefore reduces the space of allowable genealogies.

Another trend in phylogenetic analyses is serial (or heterochronous) sampling in which molecular data is obtained from significantly different time points and analysed together. This type of data arises most frequently with ancient DNA and rapidly evolving pathogens [4]‐[6]. In this case tip dates become a part of the genealogy.

Including serially sampled or fossil data modifies or restricts the shape of a phylogenetic tree. Little has been done to describe and classify these modified trees. In this paper, we aim to explore the new spaces formed by these trees.

A genealogy consists of discrete and continuous components — the tree topology and the divergence times. The tree topologies form a finite tree space when the number of tips is bounded. An important characteristic of this space is the number of trees in it and we aim to find an efficient way to calculate this number.

In the case that fossil data restricts the tree topology, counting the number of trees that satisfy the imposed constraints reveals how much the constraints reduce the tree space.

The number of trees arises as a constant in tree prior distributions. Typically we model the distribution of tree topologies as independent of the distribution of divergence times. The density function of the distribution of genealogies is then a product of the density function for the divergence times and the distribution function for tree topologies. A common prior on tree topologies is uniform over all allowable topologies so the distribution function is a constant that is equal to one over the number of tree topologies. When inferring tree topologies using Bayesian MCMC methods, we do not usually need to know this constant but in some cases, as described below, the absolute value of the prior distribution is of interest and the constant has to be calculated.

When fossils are used to restrict the age of internal nodes, the tree prior should accurately account for this fact. Heled and Drummond [3] introduced a natural approach for tree prior specification when fossil evidence is employed in the inference. Their method requires counting of ranked phylogenetic trees that obey a number of constraints that arise from including the fossil evidence. The construction requires calculation of the marginal density for the time of the calibration node, the node representing the most recent common ancestor of a clade which may or may not be monophyletic. For a particular location of the calibration node, or particular constraints on the tree topology, the marginal density function is the marginal density function for the divergence times weighted by the number of trees satisfying the constraints. In this case, the weight constants do not cancel in the MCMC scheme and therefore have to be calculated.

Tree counting has a long history. For phylogenetic trees, the counting problem is to find the number of all possible trees on n leaves. For some types of phylogenetic tree, there are known closed form solutions to this problem. For other types, only recursive equations have been derived. In this paper, we consider only rooted trees.

A survey of results on counting different types of rooted trees is presented in [7] where trees with different combinations of the following properties are considered: trees are either labeled (only leaves are labeled) or unlabeled, ranked or non‐ranked, and bifurcating or multifurcating. The results presented in the survey can also be found in [6,8,9].

In [10], Griffiths considered unlabeled, non‐ranked rooted trees such that interior nodes can have one child or more and the root has at least two children. Using generating functions, he derived recursive equations for counting the number of all possible such trees on n leaves with s interior nodes. In [11], Felsenstein considered partially labeled trees, i.e., a tree in which all the leaves are labeled and some interior nodes also may be labeled. He derived the recursive equations for counting the number of rooted, non‐ranked, partially labeled trees with n labeled nodes.

In this paper, we consider a number of counting problems for different classes of phylogenetic trees. First, we describe an effective way of counting the number of all possible fully ranked trees on n leaves, that is, trees on n leaves in which all internal and leaf nodes are ranked.

Second, we find the number of bifurcating trees that resolve a given multifurcating tree with n leaves. We give a solution to this problem for rooted, ranked, labeled trees and generalise the algorithm to count resolutions to fully ranked trees.

Finally, we introduce and formally describe a new type of phylogenetic tree and describe an algorithm for counting the number of all such trees on n leaves. This type of tree is important when we have a serial sample and sampled individuals can be direct ancestors of later sampled individuals. When the population size is small or the fraction of individuals sampled from the population is large, this type of tree should be included in the inference [12,13].

Serial sampling

We mainly follow the terminology from [9] for the definitions of phylogenetic trees. A tree is a finite connected undirected graph with no cycles. A rooted tree is a tree with a single node ρ designated as a root. Every rooted tree T=(V,E,ρ) imposes a partial order on V that is defined as follows: v1 T v2 if a unique simple path from the root to v2 passes through v1. So the root is the smallest element. If v1 T v2 then we say that v1 is an ancestor of v2 and v2 is a descendant of v1. A node in a rooted tree is called interior if it has descendants and a leaf if it has no descendants. The root is considered interior. Denote V the set of interior nodes of T. A node u is a parent of a node v and v is a child of u if v< T u and there is no wV such that v< T w< T u. A rooted tree is called binary if every interior node has exactly two children. It is called weakly binary if every interior node has at most two children. We have chosen this terminology to fit with the usage of “binary” in the phylogenetics literature which may not agree with that in other literatures.

Let X be a finite non‐empty set of labels. A phylogenetic X‐tree is a pair T=(T,ϕ), where T is a tree and ϕ is a bijection from X onto the set of leaves of T (we may omit X and say “tree” instead of “X‐tree” if the set of labels is not specified). The tree T is called an underlying tree or a shape of the phylogenetic tree Inline graphic and ϕ is a labeling function. If the underlying tree of Inline graphic is rooted then Inline graphic is called a rooted phylogenetic tree. In what follows, we consider only rooted trees unless explicitly stated otherwise. A phylogenetic tree is binary (weak binary) if its underlying tree is binary (weak binary). A ranked phylogenetic tree is a pair (T,h), where Inline graphic is a rooted phylogenetic tree and h is an injective function (ranking function) from the set V into the set {1,,|V|} such that v1 T v2 implies h(v1)≤h(v2) for every v1,v2V. In other words, there is a linear order on the interior nodes of T that is consistent with the partial order of T.

Definition 1

A ranked X‐tree is a binary ranked phylogenetic X‐tree.

An example of a ranked tree is given in Figure 1.

Figure 1.

Figure 1

Ranked tree. Ranked X‐tree, X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

In biology, a phylogenetic tree represents the evolutionary history of a collection of sampled individuals. The collection of individuals is represented by the set X. The root of the tree is the most recent common ancestor of X and interior nodes are bifurcation events. The ranking function represents the time order of the bifurcation events. A general problem in evolutionary biology is how to reconstruct the phylogenetic tree from sequence data obtained from sampled individuals. Tackling this problem in a Bayesian framework may require counting the number of all possible histories on a sample of individuals.

When all individuals are sampled at the same time (as in Figure 1) counting tree problem has a simple solution.

Let X be a fixed label set such that |X|=n. The number of all ranked X‐trees up to isomorphism is

R(n)=n!(n1)!2n1

This formula has been derived by many authors. Proofs can be found in [6,7], or [9]. The letter R in the equation comes from the word “ranked”.

The situation is different when individuals are sampled at different times (serially sampled). In this case, we need to define another kind of phylogenetic tree in which leaves are also ranked.

Definition 2

A fully ranked (FR) X‐tree is a pair(T,h), whereInline graphicis a binary rooted phylogenetic X‐tree andh:V→{1,…,l} with|V|<l|V|is a surjective function such that

● v1 T v2 implies h(v1)≤h(v2) and

● h(v1)=h(v2) implies v1=v2 or v1,v2VV.

An example of a fully ranked X‐tree is given in Figure 2.

Figure 2.

Figure 2

Fully ranked tree. Fully ranked X‐tree. X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

Before the tree is reconstructed we observe only leaves (sampled individuals) of the tree that are grouped (pre‐ranked) according to the times they were sampled. For the tree shown in Figure 2, we have two sampling times and hence two groups: A, B, and C form the first group, D and E form the second group.

Let T=(T,h) be a fully ranked X‐tree with h:V→{1,…,l}. Let m=|h(ϕ(X))|, that is, the number of sampling times. Define a pre‐ranking functionĥ from X onto {1,…,m} for tree T such that for all x1,x2X

● h(ϕ(x1))≤h(ϕ(x2)) implies ĥ(x1)ĥ(x2) and

● h(ϕ(x1))=h(ϕ(x2)) iff ĥ(x1)=ĥ(x2).

For the tree given in Figure 1, ĥ(A)=ĥ(B)=ĥ(C)=1 and ĥ(D)=ĥ(E)=2.

Let X and ĥ:X{1,,m} be fixed. We are interested in the number of all fully ranked X‐trees that haveĥ as a pre‐ranking function. Note that this number depends only on the numbers ni=|{x|ĥ(x)=i}|, the number of individuals sampled at the ith time point, not on X and ĥ directly. We denote this quantity by F(n1,…,n m ), where F stands for “fully ranked”. Then

F(n1,,nm)=i=1nmR(nm)R(i)F(n1,,nm1+i) (1)

and F(n)=R(n).

Proof

Consider a continuous process of bifurcation in which lineages may bifurcate in time or be cut and labeled (sampled). The process finishes when all lineages are cut producing a tree. The discrete structure of the tree produced by this process is a fully ranked X‐tree. It is easy to see that every fully ranked X‐tree can be obtained as a result of this process. To count the required number we can count the number of different trees which can be produced by the process if we know that after it finishes there are n i sampled individuals (i.e., cut and labeled lineages) at the ith time point, i.e., we have the sequence (n1,…,n m ).

Suppose that at the (m−1)th time point there are i lineages that are ancestral to n m individuals sampled at time m. When we look at this process backwards in time the bifurcation events become coalescence events. The number of different ways these n m lineages coalesce to i lineages is R(nm)R(i). This is the number of all possible ranked X‐trees on n m individuals but since we are not interested in the structure of the coalescent after we reach i lineages, it is divided by the number of ways in which the remaining i lineages can coalesce. Note that if coalescence patterns are different between the (m−1)th and mth time points then the trees are also different.

Further, for each of these coalescence patterns, we need to count the number of different ways these i lineages and other n1,…,n m−1 lineages can coalesce. This is where we can apply the recursion. We can consider that we also cut these i lineages at time m−1 and label them with the ranked subtrees descendant from these lineages. Then, at time m−1, we have n m−1 sampled individuals and another i sampled individuals and it remains to count the number of trees on the sequence (n1,…,n m−1+i).

Note that two trees are different if they have different numbers of lineages at time m−1. The number of additional i lineages can be between 1 and n m and we need to sum over all possible i to complete the recursion. □

We introduce a third type of tree in which sampled individuals may be direct ancestors of later sampled individuals. We call it a tree with sampled ancestors. This type of tree is not usually considered in phylogenetics since the probability of sampling a direct ancestor is often negligible. In small populations or when a large portion of the population is sampled, however, this can not be ignored.

Let T=(V,E,ρ) be a weak binary tree. Define a set V^ as follows:

V^={vV|deg(v)=1or[deg(v)=2andvis not the root]}

A rooted S‐phylogenetic X‐tree is a pair T=(T,ϕ), where T is a weak binary tree and ϕ:XV^ is a bijection.

Definition 3

A fully ranked X‐tree with sampled ancestors (FRS X‐tree) is a pair (T,h), where Inline graphicis a rooted S‐phylogenetic X‐tree andh:V→{1,…,l} is a surjective function such that

● v1< T v2 implies h(v1)<h(v2) and

● h(v1)=h(v2) implies v1=v2 or v1,v2V^,

(see Figure 3).

Figure 3.

Figure 3

FRS tree. FRS X‐tree with the labeled 1‐degree root. X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

The definition of a pre‐ranking function remains the same for FRS trees. Let S(n1,…,n m ) (with S standing for ‘sampled ancestors’) denote the number of all FRS X‐trees that have the same pre‐ranking function ĥ, where ni=|{x|ĥ(x)=i}|. Then

S(n1,,nm)=i=1nmj=0min{i,nm1}ijnm1j×R(nm)R(i)S(n1,,nm1+ij) (2)

and S(n)=R(n).

Proof

Consider the same process as before with only change of sampling events. Now, at some points of time, some lineages are cut and labeled and others are only labeled but not cut.

Then this equation can be obtained as follows. We have n m individuals that are sampled at the mth time point. At time m−1, there are between 1 and n m ancestral lineages of those n m individuals, depending on the number of coalescence between times m and m−1. Let i denote the number of these lineages. Then there are R(nm)R(i) different possible coalescent patterns that can lead to this situation. Some of these i ancestral lineages may be among the individuals sampled at time m−1, i.e. lineages that are labeled but not cut at time m−1. Let j be the number of those ancestral lineages that are sampled at time m−1. There are ij ways to chose these j lineages out of i and there are nm1j possible ways to chose j sampled at time m−1 individuals that are not cut at time m−1.

Further, at time m−1, there are n m−1 sampled lineages and ij ancestral lineages that are not sampled and it remains to count the number of FRS trees on the sequence (n1,…,n m−1+ij).

Finally, we sum over all possible i and j to complete the recursion. □

Dynamic counting

Calculating the recursions (1) and (2) directly is inefficient and impractical. Here we describe a more efficient algorithm for counting fully ranked trees using these recursions. Rewrite equation (1) as

F(n1,,nm)=R(nm)i=1nmF(n1,,nm1+i)R(i)

Then instead of calculating F(n1,…,n j +α) for j∈{1,…,m−1} and α∈{0,…,n j+1+…+n m } we can calculate

Aj(α)=F(n1,,nj+α)R(α) (3)

using recurrence equations:

Aj(0)=R(nj)i=1njAj1(i)and (4)
Aj(α+1)=(nj+α)(nj+α+1)α(α+1)Aj(α)+R(nj+α+1)R(α+1)Aj1(nj+α+1). (5)

This leads to Algorithm 1 to calculate F(n1,…,n m ). Let n be the number of sampled individuals, i.e., n=i=1mni. Calculation of all the R(i) takes O(n) steps and calculation of all the A j takes at most O(n) steps. In total, the algorithm takes O(mn) steps.

Algorithm 1 Calculating the number of fully ranked trees

A similar approach leads to an O(mn2)‐time algorithm for counting FRS trees. The description of this algorithm is in Appendix Appendix 1: Algorithm for counting FRS trees.

Constraints

In phylogenetic analysis, it is common to have some limited information about the ancestors of sampled individuals. We consider two types of such information. First, we may know that a subgroup of sampled individuals is monophyletic. That means that the most recent common ancestor of the subgroup is not an ancestor of any other individual that does not belong to the subgroup. Second, we may know the relative ages of the most recent common ancestors of monophyletic subgroups. This known information imposes constraints on the space of possible phylogenetic trees representing the evolutionary history of sampled individuals. The question is how many phylogenetic trees satisfy the constraints on a group of sampled individuals?

The number of resolutions of a constraint tree

We first describe a problem for contemporaneous sampling in terms of constraint trees. We call a rooted tree multifurcated if each interior node has at least 2 children. Note that in contrast to the common terminology we assume that a binary tree is also multifurcated. If we replace the word “binar” with “multifurcated” in the definition of a ranked tree we obtain a more general class of trees.

Definition 4

A constraint X‐tree is a multifurcated ranked phylogenetic X‐tree.

An example of a constraint tree is given in Figure 4. A constraint tree represents prior information about clades and ranking. Each interior node constrains a subgroup of individuals, leaves that are descendant from this node, to by monophyletic. The most recent common ancestor of the whole group of individuals, the root node, is also regarded as a constraint. The ranking function constrains the ages of the most recent common ancestors of the monophyletic subgroups to have a specified order.

Figure 4.

Figure 4

Constraint tree. Constraint tree, labels are omitted. Subtree 2 is coloured green. It has two child nodes that are leaves, therefore, n2=2. The ancestor function for this tree is defined as f(2)=f(3)=1 and f(4)=2. A compact notation for this constraint tree is (n1,,nk,f)=(0,2,3,2,{(2,1),(3,1),(4,2)}).

We say that a ranked X‐tree T1=(T1,h1)resolves a constraint X‐tree T2=(T2,h2) if there is an isomorphic embedding of T2 into T1, i.e., there is an injective mapping f:V 2 →V 1 such that

● ϕ 1 (x)=f(ϕ 2 (x)) for each x∈X,

● 

vT2u

iff f(v)T1f(u) for each u,v∈V 2 , and

● h 2 (v)≤h 2 (u) implies h 1 (f(v))≤h 1 (f(u)) for each u,vV2.

We wish to calculate the number of ranked trees that resolve a given constraint tree T=(T,ϕ,h). It is easy to see that this number depends only on the underlying tree T and ranking function h, but does not depend on the labeling function ϕ or the label set X .

We now introduce some notation in order to define recursive equations. We label interior nodes according to their ranks such that node i is a node v such that h(v)=i. A subtree induced by node i and its children is called subtree i. Child nodes in a subtree may be leaves in the initial tree. Let n i ≥0 denote the number of such child nodes in subtree i. Let f:{2,…,k}→{1,…k−1} be the parent function on interior nodes, i.e., f(i)=j whenever j is a parent to i. When k=1, i.e., there is only one interior node, f=∅. See Figure 4 for an example of introduced notation. Note that a tuple (n 1 ,…,n k ,f) completely defines a pair (T,h).

Let R r (n 1 ,…,n k ,f) be the number of ranked trees resolving a constraint tree defined by the tuple (n 1 ,…,n k ,f). The superscript r stands for “resolution”. Then the following equations hold.

Rr(2,)=1, (6)
Rr(n1,,nk1,2,f)=iCni2Rr(n1,,ni1,,nk1,2,f)+Rr(n1,,nf(k)+1,,nk1,f|{2,,k1})and (7)
Rr(n1,,nk,f)=iCni2Rr(n2,,ni1,,nk,f),ifnk>2, (8)

where C is the collection of nodes that have more than 2 children and at least 2 of them are leaves, i.e., C={i<k | n i ≥2 and n i i >2} and α i =|{a| f(a)=i}|. Note that n i i is the number of children of node i.

Proof

When a constraint tree has 2 leaves, it is unique and is resolution of itself. So Equation (6) is trivial. To explain the main sum in Equation (7) and (8) we consider the constraint tree which is defined by (3,3,{(2,1)}) and shown in Figure 5, left. The last interior node of a resolving tree (that is, the interior node with the highest rank or the furthest node from the root) is either a parent to leaves in subtree 1 or leaves in subtree 2. Suppose it is the first case (see Figure 5, centre). Since leaves have distinct labels from X, there are 32 ways to chose two leaves that are children of that last node. We can partition all the resolving trees for which the last node is in subtree 1 in 32 groups. The number of trees in each group is the number of trees that resolve a constraint tree defined by (2,3,{(2,1)}) and shown on the right of Figure 5. A similar argument holds if the last node in a resolving tree is a parent to nodes from subtree 2.

Figure 5.

Figure 5

Recursive approach. The last interior node in a resolving tree locates in subtree 2.

So in the general case, there are k subtrees and if the last node is in subtree i we have ni2 ways to choose two lineages that coalesce and then we should count the number of trees resolving the tree defined by (n 1 ,…,n i −1,…,n k ,f). Note that in the example above, we consider the tree with more than 2 leaves in each subtree. However, the last interior node of a resolving tree can not be in subtree i for i<k if there is not enough leaves in this subtree. This can happen either if there are less than 2 leaves in subtree i or if there are 2 leaves in subtree i and node i has only these two leaves as its children. Both cases imply that any parent to leaves of subtree i in a resolving tree has a lesser rank than the rank of node k. This explains why we sum only over the elements of the set C.

Finally, we should consider one more case which explains why there is one more summand in equation (7). If the last node in a constraint tree has only two children, i.e., there are 2 leaves in subtree k, then there is one more group of resolving trees, the group that consists of resolving trees that have this node as the last node. □

Dynamic counting

We will calculate R r (n 1 ,…,n k ,f) for the corresponding constraint tree. In order to find R r (n 1 ,…,n k ,f), at each step s, we will calculate numbers Rr(x1,,xt,f|t1) with itxi=s, t −1 ={2,…,t}, and t≤k. Note that we do not have to calculate all such numbers. To determine which numbers are required we define two upper triangular matrices m and M of size k×k.

Suppose we draw a horizontal line which is strictly below the line that passes through node j and strictly above the line that passes through node j+1 (or all the leaves if j=k). Then m i,j is the minimal possible number of intersections of this line with branches of subtree i in a resolving tree and M i,j is the maximal possible number. An example is given in Figure 6.

Figure 6.

Figure 6

Defining matrices m andM. Two trees that resolve a constraint tree from Figure 4 (only resolutions of subtree 2 are shown) and three ways to draw a horizontal line. The yellow lines correspond to the minimal number of intersections and the red line, to maximal. Thus, m2,2=2 and M2,2=3. We can also note that m1,2=M1,2=1.

Let a i,j =|{x≤j ∣ f(x)=i}| for i≤j. So a i,j is the number of children of node i with ranks at most j. Then

Mi,j=ni+αiai,jandmi,j=2ifai,j=0,1ifai,j>0andMi,j>0,0otherwise.

Let t≤k and x1,,xtN. We call a tuple (x1,,xt,f|t1) eligible if m i,t ≤x i M i,t for 1≤it.

We now turn to Algorithm 2 to count resolutions. At each step s≤n, we construct a set S s . A unique element of S n is R r (n 1 ,…,n k ,f) and calculating elements of S s only requires elements of S s−1 .

Algorithm 2 Calculating the number of resolutions of a constraint tree
Proposition 1

When k is fixed Algorithm 2 does at most O(n k ) steps.

Proof

The algorithm does O(k) steps for each eligible tuple and, since we assume k is a constant, it is O(1). For given j there are

i=1j(Mi,jmi,j+1)

eligible tuples of size j. Since M i,j m i,j +1≤n i i −1, for the total number of eligible tuples we have

j=1ki=1j(Mi,jmi,j+1)i=1k(n1+α11)··(ni+αi1)<k(n1+k1)··(nk+k1)k(nk+k1)k=O(nk)

The number of resolutions of a fully ranked constraint tree

We can generalise the results of the previous section to fully ranked trees. Now we replace the word “binary” with “multifurcated” in the definition of a fully ranked tree to get

Definition 5

A fully ranked constraint X‐tree is a pair (T,h), whereInline graphic is a multifurcated rooted phylogenetic X‐tree and h is a function such that h:V→{1,…,l}, where |V|<l|V|, and

● v 1 T v 2 implies h(v 1 )≤h(v 2 ),

● h(v 1 )=h(v 2 ) implies v 1 =v 2 or v1,v2VV.

We say that a fully ranked X‐tree T1=(T1,h1)resolves a fully ranked constraint X‐tree T2=(T2,h2) if there is an isomorphic embedding of T2 into T1, i.e., there is an injective mapping f:V 2 →V 1 such that

● ϕ 1 (x)=f(ϕ 2 (x)) for each x∈X,

● 

vT2u

iff f(v)T1f(u) for each u,v∈V 2 ,

● h 2 (v)≤h 2 (u) implies h 1 (f(v))≤h 1 (f(u)) for each u,v∈V 2 , and

● h 2 (v)=h 2 (u) iff h 1 (f(v))=h 1 (f(u)) for each u,v∈V 2 .

The problem is to count the number of fully ranked resolutions of a fully ranked constraint tree.

Let T=(T,h) be a fully ranked constraint tree with h:V→{1,…,l} and|V|=k. Again we label interior nodes with numbers from {1,…,k}. However, labels and ranks of interior nodes do not necessary coincide now. Let r:{1,,k}h(V) be an injective increasing function which maps the kth interior node to its rank. Note that r(1) is always equal to 1 because the root node always has rank 1 and that r(k)<l because only leaves can have rank l. Now, node i is the node that has rank r(i) and subtree i is the subtree which is induced by node i and its children. Since leaves in subtrees are now ranked, we need more parameters to encode the tree. Let n= [ n i,j ] be a matrix of sizek×l, wheren i,j is the number of leaves of rankj in subtreei or the number of children of nodei with rankj. Note that if nodei is a parent of nodej thenn i,r(j)=1 andn i,x =0 forxr(j) and this means that the parent function is uniquely defined by the matrix n and functionr and the number of resolutions ofT depends only on (n,r).

If a constraint tree has only 2 leaves then there is only 1 tree resolving it:

Fr((2),r0)=Fr((1,1),r0)=1

withr 0 mapping 1 to 1.

For the main recursion we need to determine the location of the last interior node in a resolving tree. As before, this node can be a parent to leaves in different subtrees. Also, since leaves now may have different ranks, the last interior node in a resolving tree may be ranked in different ways with respect to ranks of the leaves. LetN c,x denote the number of children of nodec with ranks at leastx, that is, Nc,x=j=xlnc,j. For eachp∈{r( k )+1,…,l}, we define a setC p of candidate subtrees in which the last node can be placed at levelp, i.e. between the (p−1)th andpth time points.

Cp={cNc,p2andNc,r(c)+1>2}

See Figure 7 for an example of subtree where the last node can be placed between the (p−1)th andpth time points.

Figure 7.

Figure 7

Location of a new node between the (p−1)‐th andp‐th time points. A new node can be placed in subtreec between the (p−1)‐th andp‐th time points if the number of green branches is greater or equal to 2 and the number of all branches in subtreec is greater than 2.

For eachp andc such thatcC p , there is a distinct group of resolving trees. The quantity of trees in this group is equal to the number of resolutions of a constraint tree which is defined by matrix nc,p=[ni,jc,p] of sizek×p such that

– 

ni,jc,p=ni,j

for 1≤ik andj<p,

– 

nc,pc,p=Nc,p1

,

– 

ni,pc,p=Ni,p

forik andic.

If nodek has only 2 children then, as before, there is one more group of resolving trees. This group consists of the resolving trees in which the last interior node coincides with nodek. Letn=[ni,j] be a matrix of size (k−1)×r(k) such that

– 

ni,j=ni,j

for 1≤ik andj<r(k),

– 

ni,r(k)=Ni,r(k)

for 1≤ik−1.

This matrix defines a constraint tree for which the number of resolutions is equal to the number of trees in the last group.

Then the main recursion is as follows:

Fr(n,r)=p=r(k)+1lcCpNc,p2Fr(nc,p,r)+Fr(n,r|{1,,k1})ifNk,r(k)+1=2, (9)
Fr(n,r)=p=r(k)+1lcCpNc,p2Fr(nc,p,r)otherwise, (10)

Using these equations, we can calculateF r (n,r) forO(m 2 nk) steps, wherem is the number of sampling points, i.e.,m=lk. The calculation is described in Appendix Appendix 2: Algorithm for counting fully ranked resolutions of a fully ranked constraint tree.

Conclusions

In Table 1, we summarised the complexity (upper bounds) of the counting algorithms. We can see that counting fully ranked and FRS trees is reasonably fast, particularly when the number of sampling points is small. Counting of resolutions of a constraint tree is an expensive procedure but for smallk it remains possible. In practice,k is typically small so this algorithm will be of practical use. Counting resolutions of an FRS constraint tree, using the same methods, has a greater cost and is less elegant, but its practicality and bounds on its complexity remain to be assessed.

Table 1.

The complexity of algorithms

  No constraints
( k =1)
k constraints
Contemporaneous
 
 
sampling
O(n)
O(n k )
(m=1)
 
 
Serial sampling
 
 
with no sampled
O(mn)
O(m 2 n k )
ancestors
 
 
Serial sampling
 
 
with sampled
O(mn 2 )

ancestors    

The table summaries the complexity of the counting algorithms, wheren is the sample size,k is the number of constraints, andm is the number of sampling time‐points. We assume thatk is fixed.

These algorithms can be implemented in software for phylogenetic analysis that involves serial sampling scheme or limited prior knowledge about ancestors of particular clades for calculating tree prior distributions.

Appendix 1: Algorithm for counting FRS trees

Rewrite equation (2) as follows:

S(n1,,nm)=i=0nmA(i,nm1,nm)S(n1,,nm1+i)

where A(i,a,b)=x=0min{a,bi}R(b)R(i+x)i+xxax. Then we can calculateS(n 1 ,…,n m ) recursively using Algorithm 3. We use the notation Nj=i=jmni.

Algorithm 3 Calculating the number of FRS trees

At each stepj>1 of the algorithm, we calculateS(n 1 ,…,n j +α) for 0≤αN j+1. We skip stepj=1 and do not calculateS( n 1 +α) for 0≤αN 2 becauseS(x)=R(x) and we have already calculated all the necessaryR(x). Further, for calculating eachS(n 1 ,…,n j +α), we need to calculate the coefficientsA and this is the most expensive part of the algorithm. So, at each step j >1, we need to calculateA(i, n j −1,n j +α) for 0≤in j +α and 0≤αN j+1.

First, we calculate these values forα=0. Denote

B(i,a,b,x)=R(b)R(i+x)i+xxax,forxaandi+xb

then

A(i,a,b)=x=0min{a,bi}B(i,a,b,x)

where 0≤ib,a=n j−1, andb=n j . Wheni is fixed, calculation ofA(i,n j−1,n j ) requires values ofB(i,n j−1, n j , x ) for 0≤ x m i n {n j−1,n j i }. Rewrite this asB(n j β,n j−1,n j ,x) for 0≤xmin{n j−1,β} withβ= n j i. Now we can calculate these values for 0≤βn j and, hence, for 0≤in j using recursive equations:

B(nj,nj1,nj,0)=1 ()
B(nj(β+1),nj1,nj,x)=(nj(β+1)+x)(njβ)B(njβ,nj1,nj,x)2 (11)
B(nj(β+1),nj1,nj,(β+1))=(njβ)(nj1β)B(njβ,nj1,nj,β)(β+1)2 (12)

We use equation (11) for 0≤xmin{ n j −1, β }. If m i n { n j −1, β +1}≠ m i n { n j −1, β } and therefore m i n { n j −1, β +1}= m i n { n j −1, β }+1= β +1 then we also use equation (12). The cost of this calculation is dominated by the number of the summands B ( n j β , n j −1, n j , x ) for 0≤ x m i n { n j −1, β } and 0≤ β n j . This number is O(nj2).

Now we need to calculate A (0, n j −1, n j + α ),…, A ( n j + α , n j −1, n j + α ) for 1≤ α N j +1. Having A (0, n j −1, n j + α ),…, A ( n j + α , n j −1, n j + α ) calculated (note that we have already calculated these values for α =0), we can calculate A (0, n j −1, n j +( α +1)),…, A ( n j +( α +1), n j −1, n j +( α +1)) using equations

A(i,a,b+1)=(b+1)b2A(i,a,b)+b+1iab+1iifbi<a,(b+1)b2A(i,a,b)ifabiandA(b+1,a,b+1)=1 (13)

where 0≤ i b , a = n j −1, and b = n j + α .

Moreover, for each α , we can optimize calculation of the second summands in the first case of equation (13) using recursion

nj+αi+1nj1nj+α(i+1)=(nj+αi)2(i+1)(nj1njα+i+1)nj+αinj1nj+αi (14)

To apply this recursion we need an initial value. This value depends on n j −1, n j , and α and, since only the first case of equation (13) contains the second summand, it is not necessary the value of this summand for i =0. There are 3 cases for calculation of all the initial values that are necessary at step j .

Case 1: N j < n j −1. That means that we use the first case of equation (13) for all α . In this case, for all α , the initial value for recursion (14) is the value of the second summand in (13) when i =0. So we need to calculate nj+α0nj1nj+α+0 for 1≤ α N j +1 and those are nj1nj+1,,nj1nj+Nj+1. This takes O ( N j ) steps.

Case 2: n j + α 0 = n j −1 for some α 0 ∈{0,…, N j +1}. If α 0 >0 then, for all α α 0 , we use the first case of (13) and therefore we need to calculate nj1nj+1,,nj1nj+α0, this takes O ( α 0 ) steps. For all α > α 0 , we use the second case first and, starting from i = α α 0 , we use only the first case. So we need to calculate nj+ααα0nj1nj+α(αα0) for α ∈{ α 0 +1,…, N j +1}, which arenj1+11,,nj1+(Nj+1α0)(Nj+1α0). This takes O ( N j +1 α 0 ) steps. In total, we have O ( N j +1) steps.

Case 3: n j −1< n j . That means that, for all α , we use the second case first and, starting from i = n j + α n j −1, we use only the first case. Calculatenj+1nj1,,nj+Nj+1nj1 for O ( n j −1+ N j +1) steps.

Each case costs at most O ( N j ). Provided that the initial values are calculated, the cost of the rest calculation at step j is dominated by the number of the coefficients A ( i , n j −1, n j + α ) for 0≤ i n j + α and 1≤ α N j +1 and it isO(Nj2). Summing upO(Nj2)X, O ( N j −1), and O(Nj2) gives us the cost of each step j , which is O(Nj2). Since 1< j m and calculation of R ( i ) for 0≤ i N 1 = n takes O ( n ), the algorithm does O ( m n 2 ) steps in total.

Appendix 2: Algorithm for counting fully ranked resolutions of a fully ranked constraint tree

The algorithm for counting resolutions of a constraint tree requires a few changes to count resolutions of a fully ranked constraint tree. Recall that, at each step s , we calculated the set S s which consists of the numbers of resolutions of intermediate trees with s leaves. To construct S s , for each element of S s −1, we proposed a collection of tuples that may define intermediate trees with s leaves. To accept eligible tuples we defined two matrices m and M. The general scheme for counting resolutions of a fully ranked constraint tree is the same. However, we need to define matrices m and M for a fully ranked constraint tree and describe the procedure of proposing new tuples. The procedure becomes more technical because of additional ranking.

We will calculate S r (n, r ) for the corresponding fully ranked constraint tree T. Let

ai,j=|{xnodeiis a parent of nodexandr(x)j}|

for 1≤ i k and r ( i )≤ j l −1, i.e., a i , j is the number of interior nodes that are children of node i and have rank at most j . Define two matrices m and M of size k ×( l −1) such that

Mi,j=Ni,j+1andmi,j=2ifai,j=0,1ifai,j>0andMi,j>0,0otherwise.

As before, if we consider a horizontal line that is strictly between the horizontal line that passes through the nodes of rank j and the horizontal line that passes through the nodes of rank ( j +1) then these matrices determine the minimal and maximal possible numbers of intersections of this line with branches of subtree i .

Let x= [ x i , j ] be a matrix of size t × q , where t k and q l , and t={1,…, t }. A tuple (x, r |t) is eligible if

–  x i , j x= n i , j for 1≤ i t and 1≤ j < q , and

– 

mi,q1xqiMi,q1

for 1≤ i t .

Having an intermediate tree with s −1 leaves, we need to consider all the possible ways to transform this tree to a tree with s leaves. First we need to add a new leaf to some subtree. We give the highest rank to this leaf. After that we may add new time points below the last time point. If we add new time points, we should rerank the leaves with the highest rank such that the new tree has enough leaves at each time point. Let x= [ x i , j ] be a matrix of size t × q for t k and q l . This matrix represents an intermediate tree with s −1 leaves. Suppose we add a new leaf to subtree i 0 , where 1≤ i 0 t . Let q 0 be the number of time points in a new tree and, therefore, q q 0 r ( t +1) (or q q 0 l if t = k ). Define a functiona d d o n e ((x,r| t ),i 0 ,q 0 ) which returns a tuple (x,r|t), wherex=[xi,j] is a matrix of sizet×q 0 such that

– 

xi0,q0=xi0,qj=qq01ni0,j+1

,

– 

xi,q0=xi,qj=qq01ni,j

for i i 0 ,

– 

xi,j=xi,j

for 1≤i t andj<q, and

– 

xi,j=ni,j

for 1≤it andqj<q 0 .

If this tuple is eligible then it represents a new intermediate tree with s leaves.

At each steps of the algorithm, for each tuple (x,r| t ) such thatF r (x,r| t )∈S s−1, we need to propose a collection of new tuples. The first part of this procedure is Algorithm 4. To stop the procedure, when we recognise that a new tree can not have x time points, we use a predicateext(x) which is true if we can extend the number of time points in a new tree tox. The number of time points in a new tree can not always be extended to any arbitrary number because there may not be enough leaves of the highest rank to rerank them in such a way that there will ben i,j leaves of each new rankj. Soext(q) is always true because the tree to which we add a new leaf already hasq time points.

Algorithm 4 Working with a tuple (x,r| t ) such thatF r (x, r | t )∈ S s−1 (part 1)

If t <k then the proposing procedure contains the second part. In this case, we also can increase the number of leaves in an intermediate tree by adding a new subtree. That means that one of the leaves of the tree becomes a parent to two new leaves. Again we may add new time points. Let q 0 be the number of time points in a new tree and, therefore, r ( t +1)< q 0 r ( t +2) (or r ( t +1)< q x 0 l when t +1= k ). We define another function a d d c o n s t r a i n t ((x, r | t ), q 0 ) which returns a tuple (x,r|t+1), where x=[xi,j] is a matrix of size ( t +1)× q 0 such that

– 

xt+1,q0=2j=r(t+1)+1q01nt+1,j

,

– 

xt+1,j=nt+1,j

for 1≤ j < q 0 ,

– 

xi,q0=xi,qj=qq01ni,j

for 1≤ i t ,

– 

xi,j=xi,j

for 1≤ i t and j < q , and

– 

xi,j=ni,j

for 1≤ i t and q j < q 0 .

This tuple defines a new tree if it is eligible. The second part of the procedure is Algorithm 5. The values of q 0 and e x t ( q 0 ) are as after running Algorithm 4.

Algorithm 5 Working with a tuple (x, r | t ) such that F r (x, r | t )∈ S s −1 (part 2)

Finally, as before, the main algorithm calculates setsS s recursively for 0≤sn, wheren is the number of leaves in a constraint tree T, and a unique element ofS n isS r (n,r).

Let m be a number of sampling points, i.e., m =lk.

Proposition 2

Ifk is fixed then the algorithm does at mostO(m) steps for each eligible tuple.

Proof

Let (x r | t ) be an eligible tuple. First, having all the necessary summands for equations (9) and (10), we need to calculate F r (x, r | t ). This takes at most O ( t ×[ r ( t +1)− r ( t )]) steps. Since we assume k is a constant and since t k and r ( t +1)− r ( t )≤ m , it is O ( m ). Second, we need to perform the procedure of proposing new tuples for this tuple. The most expensive part here is calculation of the last columns of matrices x because it involves calculation of the sums j=qq01ni,j. Note that, storing the values of xi,qj=qq01ni,j for 1≤ i t at each step q 0 , we can optimise these calculations. Then it takes O ( t ×( q r ( t +2)) steps and it is O ( m ) again. □

Proposition 3

If k is fixed then the algorithm has O(mnk) time complexity.

Proof

Let i x be such a number that r ( i x )≤ x < r ( i x +1). Then the number of eligible tuples is

j=1l1i=1ij(Mi,jmi,j+1)

Note that

Mi,jMi,j+1Ni,j+1Ni,r(i)+1+1=bi+1

2where b 1 +…+ b k = n + k −1. Then

j=1l1i=1ij(Mi,jMi,j+1)<(l1)(b1+1)(bk+1)<(l1)(nk)k=O(mnk)

From this and Proposition 2, it follows that the algorithm does at mostO(m 2 n k ) steps. □

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The authors equally contributed to conceive the work. AG drafted the manuscript and DW and AJD revise it. All authors read and approved the final manuscript.

Contributor Information

Alexandra Gavryushkina, Email: sasha.gavryushkina@auckland.ac.nz.

David Welch, Email: david.welch@auckland.ac.nz.

Alexei J Drummond, Email: alexei@cs.auckland.ac.nz.

Acknowledgements

AJD was partially funded by a Rutherford Discovery Fellowship from the Royal Society of New Zealand.

References

  1. Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol. 1997;14(7):717–724. doi: 10.1093/oxfordjournals.molbev.a025811. [DOI] [PubMed] [Google Scholar]
  2. Yang Z, Rannala B. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Mol Biol Evol. 2005;23:212–226. doi: 10.1093/molbev/msj024. [DOI] [PubMed] [Google Scholar]
  3. Heled J, Drummond A. Tree priors for relaxed phylogenetics and divergence time estimation. Syst Biol. 2012;61:138–149. doi: 10.1093/sysbio/syr087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Rodrigo AG, Felsenstein J. The Evolution of HIV. Baltimore: Johns Hopkins Univ Press; 1999. [Google Scholar]
  5. Drummond A, Pybus O, Rambaut A, Forsberg R, Rodrigo A. Measurably evolving populations. Trends Ecol Evol. 2003;18:481–488. doi: 10.1016/S0169-5347(03)00216-7. [DOI] [Google Scholar]
  6. Felsenstein J. Inferring Phylogenies. Sunderland: Sinauer Associates; 2004. [Google Scholar]
  7. Murtagh F. Counting dendograms: a survey. Discrete Appl Math. 1984;7:191–199. doi: 10.1016/0166-218X(84)90066-0. [DOI] [Google Scholar]
  8. Gordon AD. A review of hierarchical classification. J R Stat Soc A. 1987;150(2):119–137. doi: 10.2307/2981629. [DOI] [Google Scholar]
  9. Semple C, Steel M. Phylogenetics. New York: Oxford University Press; 2003. [Google Scholar]
  10. Griffiths RC. Counting genealogical trees. J Math Biol. 1987;25:423–431. doi: 10.1007/BF00277166. [DOI] [PubMed] [Google Scholar]
  11. Felsenstein J. The number of evolutionary trees. Syst Zool. 1978;23:27–33. [Google Scholar]
  12. Stadler T. Sampling‐through‐time in birth‐death trees. J Theor Biol. 2010;267(3):396–404. doi: 10.1016/j.jtbi.2010.09.010. [DOI] [PubMed] [Google Scholar]
  13. Stadler T, Kouyos RD, von Wyl V, Yerly S, Böni J, Bürgisser P, Klimkait T, Joos B, Rieder P, Xie D, Günthard HF, Drummond A, Bonhoeffer S. the Swiss HIV Cohort Study. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 2012;29:347–357. doi: 10.1093/molbev/msr217. [DOI] [PubMed] [Google Scholar]

Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES