Enumeration of Ancestral Configurations for Matching Gene Trees and Species Trees

Filippo Disanto; Noah A Rosenberg

doi:10.1089/cmb.2016.0159

. 2017 Sep 1;24(9):831–850. doi: 10.1089/cmb.2016.0159

Enumeration of Ancestral Configurations for Matching Gene Trees and Species Trees

Filippo Disanto ^1,^✉, Noah A Rosenberg ¹

PMCID: PMC5610458 PMID: 28437136

Abstract

Given a gene tree and a species tree, ancestral configurations represent the combinatorially distinct sets of gene lineages that can reach a given node of the species tree. They have been introduced as a data structure for use in the recursive computation of the conditional probability under the multispecies coalescent model of a gene tree topology given a species tree, the cost of this computation being affected by the number of ancestral configurations of the gene tree in the species tree. For matching gene trees and species trees, we obtain enumerative results on ancestral configurations. We study ancestral configurations in balanced and unbalanced families of trees determined by a given seed tree, showing that for seed trees with more than one taxon, the number of ancestral configurations increases for both families exponentially in the number of taxa n. For fixed n, the maximal number of ancestral configurations tabulated at the species tree root node and the largest number of labeled histories possible for a labeled topology occur for trees with precisely the same unlabeled shape. For ancestral configurations at the root, the maximum increases with , where is a quadratic recurrence constant. Under a uniform distribution over the set of labeled trees of given size, the mean number of root ancestral configurations grows with and the variance with ∼. The results provide a contribution to the combinatorial study of gene trees and species trees.

Keywords: : combinatorics, gene trees, phylogenetics, species trees

1. Introduction

Investigations of the evolution of genomic regions along species tree branches have generated new combinatorial structures that can assist in studying gene trees and species trees (Maddison, 1997; Degnan and Salter, 2005; Than and Nakhleh, 2009; Degnan et al., 2012; Wu, 2012). Among these structures are ancestral configurations, structures that for a given gene tree topology and species tree topology describe the possible sets of gene lineages that can reach a given node of the species tree (Wu, 2012).

Ancestral configurations represent the set of objects over which recursive computations are performed in a fundamental calculation for inference of species trees from information on multiple genetic loci: the evaluation of gene tree probabilities conditional on species trees (Wu, 2012). Because of the appearance of ancestral configurations in sets over which sums are computed [e.g., Eq. (7) of Wu (2012)], solutions to enumerative problems involving ancestral configurations contribute to an understanding of the computational complexity of phylogenetic calculations.

Under the assumption that a gene tree and a species tree have a matching labeled topology t, we examine the number of ancestral configurations that can appear at the nodes of the species tree. Extending results of Wu (2012), whose appendix reported the number of ancestral configurations for caterpillar species trees and established a lower bound for completely balanced species trees, we study the number of ancestral configurations when t belongs to families of trees characterized by a balanced or unbalanced pattern and a seed tree. As a special case, we derive upper and lower bounds on the number of ancestral configurations possessed by matching gene trees and species trees of given size. Finally, we study the mean and the variance of the number of ancestral configurations when t is a random labeled tree of given size selected under a uniform distribution.

2. Preliminaries

We study ancestral configurations for rooted binary labeled trees. We start with some definitions and preliminary results. In Section 2.1, we recall basic properties of rooted binary labeled trees. In Section 2.2, we recall properties of generating functions that will be used to derive some of our enumerative results. Following Wu (2012), in Section 2.3, we define ancestral configurations, and we determine a recursive procedure to compute their number for matching gene trees and species trees at a given species tree node. We then relate the total number of ancestral configurations in a tree to the number of ancestral configurations at the root of the tree.

2.1. Labeled topologies

A labeled topology, or tree for short, of size Inline graphic is a bifurcating rooted tree with n labeled taxa (Fig. 1A). We assume without loss of generality a linear (alphabetical) order among the set of possible labels for the taxa of a tree. A tree of size n has leaves labeled using the first n labels in the order . Given two trees t₁ and t₂, we write Inline graphic and say that t₁ is isomorphic to t₂ when, removing labels at their taxa, t₁ and t₂ share the same unlabeled topology. The set of trees of size n is denoted by T_n, and denotes the set of all trees of any size. The number of trees of size can be computed as (Felsenstein, 1978), which can be rewritten for Inline graphic as

The exponential generating function associated with the sequence Inline graphic is defined as

and it is given by (Flajolet and Sedgewick, 2009, Example II.19)

Throughout the article, most of our results are purely combinatorial. Where a probability distribution on the set of labeled topologies of a given size is needed, we assume a uniform probability distribution over the set of trees of given size.

2.2. Exponential growth and analytic combinatorics

Following Flajolet and Sedgewick (2009), a sequence of non-negative numbers a_n is said to have exponential growth kⁿ or, equivalently, to be of exponential order k when

This relationship can be rephrased as Inline graphic , where s is a subexponential factor, that is, . By these definitions, a sequence a_n grows exponentially in n when its exponential order strictly exceeds 1.

The exponential order of a sequence gives basic information about its speed of growth and enables comparisons with other sequences. In particular, from the definition, it follows that if Inline graphic has exponential order k_a and has exponential order , then the sequence of ratios converges to 0 exponentially fast as . If two sequences and have the same exponential growth, then we write .

We are interested in the exponential growth of several increasing sequences of non-negative integers. Several results will be obtained through techniques of analytic combinatorics [see Sections IV and VI of Flajolet and Sedgewick (2009)]. The entries of a sequence of integers Inline graphic can be interpreted as the coefficients of the power series expansion at of a function , the generating function of the sequence. Considering z as a complex variable, under suitable conditions, there exists a general correspondence between the singular expansion of the generating function Inline graphic near its dominant singularity—the one nearest to the origin—and the asymptotic behavior of the associated coefficients a_n. In particular, the exponential order of the sequence is given by the inverse of the modulus of the dominant singularity of . For instance, the exponential order of the sequence Inline graphic , with as in Equation (1), is 2 because is the dominant singularity of the associated generating function [Eq. (3)]. In other words, increases with a subexponential multiple of as n becomes large.

2.3. Gene trees, species trees, and ancestral configurations

In this section, we define the object on which our study focuses: the ancestral configurations of a gene tree G in a species tree S. Ancestral configurations have been introduced by Wu (2012). In our framework, where exactly one gene lineage has been selected from each species, we assume G and S to have the same labeled topology t.

2.3.1. Ancestral configurations

Suppose R is a realization of a gene tree G in a species tree S, where we focus on the case of Inline graphic (Fig. 1). In other words, R is one of the evolutionary possibilities for the gene tree G on the matching species tree S. Viewed backward in time, for a given node k of t, consider the set of gene lineages (edges of G) that are present in S at the point right before node k.

As in Wu (2012), the set Inline graphic is called the ancestral configuration of the gene tree at node k of the species tree. Taking the tree t depicted in Figure 1A and considering the realization R₁ of the gene tree in the species tree as given in Figure 1B, we see that the gene lineages a, b, and are those present in the species tree at the point right before the root node m. The set Inline graphic is thus the ancestral configuration of the gene tree at node m of the species tree. Similarly, the ancestral configuration of the gene tree at node of the species tree is the set of gene lineages . In Figure 1C, where a different realization R₂ of the same gene tree is depicted, the ancestral configuration at the root m of the species tree is the set of gene lineages Inline graphic . The ancestral configuration at node is .

Let Inline graphic be the set of possible realizations of the gene tree in the species tree . For a given node k of t, by considering all possible elements , we define the set

and the number

Thus, Inline graphic corresponds to the number of different ways the gene lineages of G can reach the point right before node k in S, when all possible realizations of the gene tree G in the species tree S are considered. For instance, taking t as in Figure 1A, we have , , and

Note that for two different realizations Inline graphic and an internal node k, we do not necessarily have .

For each internal node k, our definition of ancestral configuration specifically excludes as a possibility the case in which all gene tree lineages descended from node k have coalesced at species tree node k so that Inline graphic . Each configuration at node k is considered at the point right before node k in the species tree, and there is thus no time for the gene lineages from the left subtree of k to coalesce with those from the right subtree of k. Our definition is identical to that of Wu (2012), with the exception that we say that a leaf or 1-taxon tree has 0 ancestral configurations, whereas Wu assigns these cases 1 ancestral configuration.

Because we assume gene tree G and species tree S have the same labeled topology t, the set Inline graphic and the quantity defined in Equations (4) and (5) depend only on node k and tree t. In what follows, we use the term configuration at node k of t to denote an element of . The next result provides a recursive procedure for calculating the number at a given node k of t.

Proposition 1 Given a tree t with Inline graphic , the number of possible configurations at the root r of t can be recursively computed as

where Inline graphic (resp. r_r) denotes the left (resp. right) child of r and is set to 0 when .

Proof. If A and B are two sets of sets, we define Inline graphic . The set of configurations at internal node r can be decomposed as

where the set unions are disjoint because, as already noted, Inline graphic and . We immediately obtain Equation (7), as . ■

We reiterate that for Equation (7) to apply for all t with Inline graphic , we must set to 0 the number of configurations at a species tree leaf and at the root of the 1-taxon tree. For the tree depicted in Figure 1A, each configuration in [Eq. (6)] can be obtained as described in Equation (8) from the configurations in and . Note indeed that , as determined by Equation (7).

2.3.2. Total configurations and root configurations

Let Inline graphic be the set of nodes of a tree t. The number of nodes satisfies . Define the total number of configurations in t as the sum

Let Inline graphic be the number of configurations at the root r of t, or root configurations for short. As is shown in Appendix 1, satisfies the bound

Furthermore, because Inline graphic for each node k of t, we have

This result indicates that the total number of configurations c and the number of root configurations Inline graphic are equal up to a factor that is at most polynomial in the tree size . A consequence is that in measuring for a family of trees of increasing size, an exponential growth of the form for the number of root configurations translates into the same exponential growth for the total number of configurations in t:

where, by virtue of Equation (9), Inline graphic .

An equivalent result holds when we consider the expected value of the total number of configurations Inline graphic in a random labeled tree topology of given size n. Indeed, when a tree of size n is selected at random from the set of labeled topologies, Equation (10) gives . Thus, the exponential growth of with respect to n can be recovered from the exponential growth of ,

Similarly, for the second moment Inline graphic , we have , and thus

Using these results, in Sections 3 and 5 we will determine the exponential growth of Inline graphic and c with respect to size when t is considered in different settings. In Section 3, t belongs to families of unbalanced or balanced trees, whereas in Section 5, we perform our analysis considering t as a random labeled topology of given size.

2.4. Root configurations in small trees

For small values of n, Equation (7) enables the exhaustive computation of the number of root configurations Inline graphic for representative labelings of each of the unlabeled topologies of size n. In Figure 2, each dot corresponds to the logarithm of the number of root configurations for a certain tree shape of size determined by its x-coordinate. The dots associated with the largest values of are connected by the top line, whose growth is linear in n. Indeed, as was shown by Wu (2012), there exist families of trees for which the growth of the number of root configurations is exponential in the tree size. From Equation (9), it follows that the growth of the sequence of the largest number of root configurations in trees of size n must be exponential in n as well.

FIG. 2. — Natural logarithm of the number of root configurations for all possible tree shapes of size . The value for , , is omitted. Dots corresponding to the largest and smallest numbers of root configurations for each n are connected by the top and bottom lines, respectively.

The tree shapes whose labeled topologies possess the largest number of root configurations among trees of fixed size appear in Figure 3 together with their number of root configurations Inline graphic . Starting with , each shape in the sequence can be seen to be produced by connecting two smaller shapes also in the sequence (possibly the same shape) to a shared root.

FIG. 3. — Tree shapes of size whose labeled topologies have the largest number of root configurations among trees of size n. The number of root configurations is indicated for each tree. In each tree displayed, the two root subtrees each maximize the number of root configurations among trees of their size.

The tree shape that minimizes the number of root configurations is the caterpillar topology. The number of root configurations in the caterpillar of size n is Inline graphic (Wu, 2012). The bottom line in Figure 2, which connects dots corresponding to the smallest number of root configurations for a tree with n taxa, grows with .

These observations show that tree topology can have a considerable impact on the number of ancestral configurations that are possible for a given tree size. Indeed, the next section investigates the effect of tree balance on the number of root configurations in a tree. Figure 2 suggests that for random labeled topologies of a specified size, we can expect the variance of the number of root configurations to be large. We will confirm this claim in Section 5. We will also show that although there exist tree families (e.g., caterpillars) for which the growth of the number of root configurations is polynomial in the tree size, the expected number of root configurations in a random labeled topology of given size n grows exponentially in n.

3. Root Configurations for Unbalanced and Balanced Families of Trees

In this section, we study the number of root configurations for particular families of trees, extending beyond two cases considered by Wu (2012): the caterpillar case, which was studied exactly, and the completely balanced case, for which a loose lower bound of Inline graphic was reported. As balance is an important tree property that influences ancestral configurations, we study unbalanced and balanced families generated by different seed trees. Upper and lower bound results on the number of root configurations for trees of specified size appear in Section 4.

For a given seed tree s, we consider the unbalanced family Inline graphic (Fig. 4A) and the balanced family (Fig. 4B) defined as follows:

FIG. 4. — Unbalanced and balanced families of trees defined from a given seed tree s. **(A)** The unbalanced family is defined by , setting as the tree of size obtained by appending *u_h* and s to a shared root node. **(B)** The balanced family is defined by , setting as the tree of size obtained by appending two copies of *b_h* to a shared root node.

where Inline graphic is the tree shape obtained by appending trees t₁ and t₂ to a shared root node. Note that the family of caterpillar trees is obtained as when . For the same seed tree of size 1, is the family of completely balanced trees. When , resembles the lodgepole family , which is defined recursively by setting Inline graphic as the 1-taxon tree, and (Disanto and Rosenberg, 2015). The only difference is that in , each leaf is in a cherry, whereas has a unique leaf that is not in a cherry. For each family, it is understood that we consider an arbitrary labeling of each unlabeled shape in the family.

3.1. Unbalanced families

Fix a seed tree s and consider the family Inline graphic as defined in Equation (14). Let be the number of root configurations in , and define as the number of root configurations in u_h. If s is the 1-taxon tree, then as noted earlier, the number of root configurations is set to 0. From Proposition 1, we obtain the recursion

starting with Inline graphic . As shown in Appendix 2, the generating function

is described by

For Inline graphic , the dominant singularity of —the singularity nearest to the origin—is the solution of the equation . Applying Theorem IV.7 of Flajolet and Sedgewick (2009) yields the exponential growth of the sequence with respect to the index h as

Because u_h has Inline graphic leaves, substituting in Equation (18), we obtain the next proposition.

Proposition 2 In the unbalanced family Inline graphic , the exponential growth of the number of root configurations in the size is

where Inline graphic is the size of the seed tree and is its number of root configurations. The total number of configurations in the family has the same exponential growth.

In other words, for values of the number of leaves n at which a member of the unbalanced family exists, the number of root configurations in the unbalanced family grows with Inline graphic .

When the seed tree is the 1-taxon tree, so that Inline graphic and is the sequence of caterpillar trees, Equation (19) gives the exponential growth . Indeed, the number of root configurations in the caterpillar family grows like a polynomial function of the size, as immediately follows from Equation (16) [see also Wu (2012)]. Taking , the number of root configurations in Inline graphic becomes exponential in the tree size. Table 1 illustrates that for unbalanced families defined by small seed trees of size greater than one, root configurations in n-taxon trees—provided that a tree with n taxa is in the family—have exponential growth in the range to .

Table 1.

Approximate Values of the Constants That When Raised to the Power n Describe the Exponential Growth with the Number of taxa n of the Number of Ancestral Configurations in Unbalanced and Balanced Families For Small Seed Trees

		(unbalanced)	(balanced)			(unbalanced)	(balanced)
1	0	1	1.503	5	6	1.476	1.479
2	1	1.414	1.503	6	5	1.348	1.351
3	2	1.442	1.469	6	6	1.383	1.385
4	3	1.414	1.425	6	7	1.414	1.416
4	4	1.495	1.503	6	8	1.442	1.444
5	4	1.380	1.385	6	10	1.491	1.492
5	5	1.431	1.435	6	9	1.468	1.469

Open in a new tab

Each constant is obtained to three decimal places by numerically evaluating Equation (20).

3.2. Balanced families

The results change when we consider balanced families. For a fixed seed tree s, consider the family Inline graphic as defined in Equation (15). Let be the number of root configurations in seed tree , and define as the number of root configurations in b_h. If , then is 0. From Proposition 1, we obtain

with Inline graphic . Defining the sequence , with , it is straightforward to show that .

Sequence x_h can be studied as in Aho and Sloane (1973, Section 3 and Example 2.2). For Inline graphic , a constant exists for which

where Inline graphic is the floor function for k. The constant can be approximated using the recursive definition of x_h, summing terms in a series:

Switching back to Inline graphic , for , we obtain

Thus, because Inline graphic grows with , to determine the exponential growth of the number of root configurations, it remains to evaluate the constant . Rescaling Equation (21) to consider the number of leaves as a parameter, we obtain the next proposition.

Proposition 3 In the balanced family Inline graphic , the exponential growth of the number of root configurations in the size is

where Inline graphic is the size of the seed tree. The constant can be computed as in Equation (20) and bounded by

The total number of configurations in the family Inline graphic has the same exponential growth.

In other words, for values of the number of leaves n, at which a member of the balanced family exists, the number of root configurations in the balanced family grows with Inline graphic .

Proof. It remains only to prove the bound [Eq. (23)]. The lower bound follows quickly from Equation (20), as the exponent is positive. The upper bound is obtained by observing that the sequence Inline graphic is increasing, and thus for each . Therefore, from Equation (20) and the fact that , we have

■

Comparing the number of root configurations in balanced families with those in unbalanced families (Table 1), we see that the exponential order for balanced families is greater than in unbalanced families, although typically still in the range Inline graphic to .

3.3. Comparing unbalanced and balanced families

For a given seed tree s, the quantities Inline graphic and determine the exponential orders of the sequences considered in Propositions 2 and 3, respectively. We observe three facts.

(i) Applying the lower bound in Equation (23), Inline graphic , for a fixed seed tree s, we always have

Therefore, the growth of the number of ancestral configurations in the family Inline graphic is exponentially faster than the growth in the family . When s is not small, however, can become close to . For large s, is also large. Owing to the upper bound in Equation (23), although , only slightly exceeds . Furthermore, the exponent in the expressions for and further reduces the difference between them.

For instance, if s is the caterpillar tree with 10 leaves, we have Inline graphic , , and . In this case, is bounded above by a constant near . The increasing similarity of and is already evident in Table 1, as their values for 6-taxon seed trees are substantially closer to each other than for the smaller 1-, 2-, and 3-taxon seed trees.

(ii) The choice of the seed tree can play an important role in the relative values of Inline graphic and as taking two different seed trees can flip the inequality in Equation (24). In fact, if s₁ and s₂ are two seed trees of the same size for which , then

To obtain this result, we note that Inline graphic , where the latter inequality follows from the upper bound [Eq. (23)]. The result is observable in Table 1, where at fixed of 4, 5, or 6, for some of the shapes exceeds for other shapes.

(iii) When the seed tree s is chosen as the 1-taxon tree with Inline graphic , the constant determines an upper bound for the number of root configurations that a tree of given size can have. This result is shown in more detail in the following section. The value of k₀ can be computed numerically from Equation (20):

This constant provides the exact value for which Inline graphic , reported by Wu (2012), provided a lower bound.

4. Smallest and Largest Numbers of Root Configurations for Trees of Fixed Size

We have seen that the number of root configurations for caterpillar trees grows polynomially and that the number of root configurations in unbalanced noncaterpillar families and balanced families grows exponentially. In the examples we have considered, the exponential growth proceeds with Inline graphic to . We now show that the caterpillar trees have the smallest number of root configurations and that the constant k₀ [Eq. (26)], in fact, provides an upper bound on the exponential growth of the number of root configurations as n increases. We characterize the labeled topologies that possess the largest number of root configurations at fixed n.

4.1. Smallest number of root configurations

For the caterpillar tree of size n, the number of root configurations is Inline graphic . We show that this value, , is the smallest number of root configurations for a tree of size n.

Let Inline graphic denote the number of root configurations of tree t. Let . Suppose we have shown for each i with that

The claim clearly holds for Inline graphic , for each of which the sole tree t has root configurations.

For Inline graphic , we use induction to prove Equation (27) for . Suppose is a tree of size n such that . The number of root configurations of is given by Proposition 1 as the product , where and are the root subtrees of . Because has the minimal number of root configurations, and must separately possess the minimal number of root configurations among trees of their size. We can then write Inline graphic and , where, without loss of generality, i is a certain value with . Therefore, has the form . It is determined from the minimum

Applying the inductive hypothesis [Eq. (27)], we obtain Inline graphic . In the permissible range for i, the product reaches its minimum value at , equaling as desired.

By induction, we have shown that Equation (27) holds for each Inline graphic . Furthermore, the fact that the product in Equation (28) is minimal only at also demonstrates that those tree shapes of size n with the smallest number of root configurations can be recursively obtained by appending the 1-taxon tree and the tree shape of size with the smallest number of root configurations to a shared root node. Trees resulting from this recursive construction are exactly those having a caterpillar shape.

4.2. Largest number of root configurations

For the largest number of root configurations, we denote Inline graphic . Similarly to Equation (28), we seek to identify the trees t that produce the maximum in the following equation and to evaluate that maximum:

Note that Inline graphic . Taking , we have the recursion

starting with Inline graphic . The sequence was studied by de Mier and Noy (2012, Theorems 1 and 2), where it was shown (i) taking as the power of 2 nearest to , we have , so that

(ii) for all Inline graphic , , that is,

where the constant k₀ has been already computed in Equation (26).

For small n, the labeled topologies with the largest numbers of root configurations appear in Figure 3. Collecting the results for the smallest and largest number of root configurations, we can state the following facts.

Proposition 4 (i) For each Inline graphic , the smallest number of root configurations in a tree of size n is . The caterpillar tree shape of size n has exactly root configurations. (ii) For each , the largest number of root configurations in a tree of size n, , can be bounded as in Equation (30). For , if denotes the power of 2 nearest to Inline graphic , then is the number of root configurations in the tree shape t_n recursively defined as , . When for integers h, t_n is the completely balanced tree of depth h and [Eq. (21)].

As a corollary, we obtain the following result, the proof of which appears in Appendix 3.

Corollary 1 Inline graphic The exponential growth of the sequences and follows and . The sequences and , giving, respectively, the smallest and the largest total number of configurations c_t in a tree t of size n, have exponential growth and .

The family of tree shapes Inline graphic defined in Proposition 4 by the recursive decomposition and , where d is the power of 2 nearest to , already has a place in the study of gene trees and species trees, as it provides the maximally probable tree shapes of Degnan and Rosenberg (2006). Given a labeled topology t of size n, a labeled history of t is a linear ordering of the Inline graphic internal nodes of t such that the order of the nodes in each path going from the root of t to a leaf of t is increasing (Fig. 5). As reported by Harding (1974) and proved by Hammersley and Grimmett (1974), each labeled topology with t_n as its underlying unlabeled topology possesses the maximal number of labeled histories among labeled topologies of size n. Consider the Yule model for the probability distribution of tree shapes, in which pairs of lineages in a labeled set of n lineages are joined together, at each step choosing uniformly among pairs (Yule, 1925; Harding, 1971; Brown, 1994; McKenzie and Steel, 2000; Steel and McKenzie, 2001; Rosenberg, 2006; Disanto et al., 2013; Disanto and Wiehe, 2013). Among all labeled topologies with size n, those with the largest number of labeled histories—and hence with shape t_n—have the highest probability under the model.

FIG. 5. — The three labeled histories of the labeled topology of size . Each labeled history can be represented by bijectively labeling the internal nodes of t with the integers in in such a way that each path from the root of t to a leaf of t is labeled by an increasing sequence.

For Inline graphic , the maximally probable labeled topologies of size n—those with the most labeled histories—can be recursively characterized as those labeled topologies whose two root subtrees are maximally probable labeled topologies of sizes and , where (Hammersley and Grimmett, 1974; Harding, 1974). This characterization matches our characterization that the unlabeled shapes with the largest number of root configurations are those for which the subtrees have the most root configurations and sizes d and Inline graphic , where is the nearest power of 2 to .

To see that the characterizations are identical so that Inline graphic , note that a specific is the nearest power of 2 to precisely for integers . On the endpoints of the interval, there are two choices for d, but in both cases, one choice is . At the same time, the integers n for which are precisely those in . Thus, for all integers n in . On the lower boundary, for Inline graphic , and . Dividing the integers in into a union of intervals , we see that on each interval and hence for all .

This result shows that for a given tree size, those labeled topologies whose shapes belong to the family Inline graphic maximize both the number of root configurations and the number of labeled histories. For these labeled topologies, in Figure 6, we plot the logarithm of the maximum number of labeled histories possible for a labeled topology of size n as a function of the logarithm of the maximum number of root configurations. Although the shapes are the same, the number of labeled histories is considerably larger than the number of root configurations. The growth is approximately linear, suggesting that the maximal number of labeled histories increases approximately exponentially in the maximal number of root configurations.

FIG. 6. — Natural logarithm of the maximum number of histories possible for a labeled topology of size n as a function of the natural logarithm of the maximum number of root configurations possessed by a labeled topology of the same size (). The maxima occur at the same set of labeled topologies.

5. The Number of Root Configurations in a Random Labeled Topology

We now study through generating functions the number of root configurations when trees of a given size are randomly selected under a uniform distribution on the set of labeled topologies. In Section 5.1, we show that the expectation Inline graphic of the number of root configurations in a random labeled topology of size n has exponential growth . In Section 5.2, we show that the variance of the number of root configurations has exponential growth . The same results hold for the random total number of configurations.

5.1. Mean number of root configurations

Define the exponential generating function

where Inline graphic is the number of root configurations in tree t. As shown in Appendix 4, the function F satisfies

where Inline graphic is the exponential generating function in Equation (3). Solving Equation (32), we obtain a closed form for ,

We have taken the negative root of the quadratic equation, as it is the root that produces the correct value of Inline graphic at . It can be seen that is required by noting that the first term in Equation (31) is the z¹ term, as the set T contains only trees of size at least 1, so that Equation (31) has no constant term.

The value of z that cancels the second square root in Equation (33) is Inline graphic , which is smaller than the value that cancels the first square root, . In the complex plane, both and are singularities of . The dominant singularity is as it is nearer to the origin. To highlight the type of singularity that has at the point , it is convenient to factor the second square root in Equation (33), writing Inline graphic as

where

is an analytic function in the circle Inline graphic , except at a removable singularity . Thus, we see that at , the generating function has a singularity of the square root type.

We can then apply Theorems VI.1 and VI.4 of Flajolet and Sedgewick (2009) to recover the asymptotic behavior of the nth coefficient of Inline graphic ,

as the nth coefficient of the expansion of Inline graphic at the singularity . This expansion is given by

We thus have

where we have used the asymptotic relationship Inline graphic (Flajolet and Sedgewick, 2009). Dividing by the number of trees of size n, , as given in Equation (1), using Stirling's formula , and noting the definition of as a mean over all labeled topologies, we obtain the asymptotic expected number of root configurations in a random labeled topology of size n:

We summarize these results in a proposition.

Proposition 5 The mean number of root configurations in a random labeled topology of size n among the Inline graphic labeled tree topologies is asymptotically

The mean total number of configurations has exponential growth

In Figure 7A, we can see that the approach of the natural logarithm of the exact mean number of root configurations—computed by evaluating the expansion of the generating function Inline graphic —to the asymptotic value proceeds quickly, so that even with small values of n, the exact mean and the asymptote are quite close on a logarithmic scale.

FIG. 7. — Mean and variance of the number of root configurations in random labeled topologies of fixed size. **(A)** Exact natural logarithm of the mean, computed from the power series expansion of [Eq. (33)], and its asymptotic approximation from Proposition 5. **(B)** Exact natural logarithm of the variance, computed from the power series expansion of [Eq. (39)], and its asymptotic approximation from Proposition 6.

5.2. Variance of the number of root configurations

By applying the same approach used to determine the mean value of the number of root configurations across labeled topologies, in this section, we study the expectation Inline graphic and then derive the asymptotic variance of the number of root configurations.

Define the generating function

As shown in Appendix 5, the function Inline graphic satisfies

This equation relates Inline graphic to the generating functions and appearing in Equations (33) and (3). Solving for , we obtain the function

which has its dominant singularity at Inline graphic . In the same way as in the derivation of , we have taken the negative root of the quadratic Equation (38) as it is this root that produces the correct value of at . At the dominant singularity for z, the first square root in Equation (39) cancels. Factoring this square root, the function Inline graphic can be written as

where

The function Inline graphic is analytic in the circle , except at the removable singularity . By Theorems VI.1 and VI.4 of Flajolet and Sedgewick (2009), we can recover the asymptotic behavior of the nth coefficient as

Dividing by Inline graphic and using Stirling's approximation, we get

To obtain an asymptotic estimate for the variance, we use Equation (36) to note that the exponential growth of Inline graphic is . Because , we have that as ,

and thus, the variance asymptotically satisfies Inline graphic .

Furthermore, because Inline graphic and as shown in Equations (12) and (13), Equation (43) also holds when we replace by c. Thus, the variance of the total number of configurations in a random labeled topology of size n satisfies

We summarize these results in a proposition.

Proposition 6 The variance of the number of root configurations in a random labeled topology of size n among the Inline graphic labeled tree topologies is asymptotically

where Inline graphic . The variance of the total number of configurations has exponential growth

Figure 7B demonstrates that on a logarithmic scale, the approach of the exact variance of the number of root configurations—computed from Inline graphic —to the asymptotic value occurs rapidly in n, although slower than was seen for the mean in Figure 7A.

6. Conclusions

Under the assumption that the labeled gene tree topology matches the species tree topology, Inline graphic , we have studied the number of ancestral configurations in a given phylogenetic tree t. In particular, we have focused on the exponential growth of the number of root configurations in t, a quantity that also describes the exponential growth of the total number of configurations in t.

In Section 3, extending results of Wu (2012), in which the enumeration of ancestral configurations for caterpillar trees and a lower bound for their number in completely balanced trees were determined, we considered special families of trees generated by arbitrary seed trees s, namely the unbalanced family Inline graphic and the balanced family (Fig. 4). The main results describing the influence of tree balance and the seed tree topology on the number of ancestral configurations are collected in Proposition 2 and Proposition 3 for the unbalanced and balanced cases. We have shown that for each fixed seed tree s, the number of ancestral configurations in the balanced family Inline graphic grows exponentially faster than in the unbalanced family . When the size of the seed tree s is large, however, the difference between the exponential orders of the two integer sequences can become small. We have also observed that the choice of the seed tree can have an important influence on the number of root configurations. In fact, the number of root configurations in the family Inline graphic can grow exponentially faster than in the family when the number of root configurations in s₁ exceeds that of s₂.

When Inline graphic , the unbalanced family reduces to the caterpillar family, and the balanced family gives the family of completely balanced trees. As shown in Proposition 4, among trees of size n, the caterpillar tree with n taxa possesses the smallest number of root configurations. When n is a power of 2, the completely balanced tree of size n has the largest number; more generally, the largest number of root configurations occurs at precisely those labeled topologies that for a fixed n generate the largest number of labeled histories. As the caterpillar labeled topologies give rise to the smallest number of labeled histories at fixed n—only one—both the largest and smallest numbers of root configurations occur at trees producing the extrema in the number of labeled histories. The growth of the number of root configurations in the caterpillar family is polynomial, whereas for the completely balanced trees, it is exponential with order Inline graphic .

Assuming a uniform distribution over the labeled topologies with a given size n, in Section 5 we studied the mean and the variance of the number of ancestral configurations in a random labeled topology of size n. By using a generating function approach, in Propositions 5 and 6, we have shown that the mean number of ancestral configurations has exponential growth Inline graphic , whereas for the variance, we have

Our results can assist in relating the complexity of algorithms for computing gene tree probabilities based on ancestral configurations—STELLS (Wu, 2012)—to those that use an evaluation based on a different class of combinatorial objects, the coalescent histories (Degnan and Salter, 2005; Rosenberg, 2007; Than et al., 2007; Rosenberg and Degnan, 2010; Rosenberg, 2013; Disanto and Rosenberg, 2015, 2016). In such comparisons, we expect that the ancestral configurations will often grow slower, as is seen in comparing the polynomial growth of the number of ancestral configurations in the caterpillar case with the corresponding exponential growth of the number of coalescent histories. However, the trees with the largest numbers of coalescent histories and the largest number of ancestral configurations are not the same, so that potential exists for each type of algorithm to be favorable in different cases. It remains to be seen whether the complexity of gene tree probability calculations can be reduced by choosing the computational approach based on tree sizes and shapes under consideration.

Many enumerative problems on ancestral configurations remain open. First, we assumed that the gene tree and species tree have the same labeled topology, and we did not study nonmatching gene trees and species trees. As has been seen for coalescent histories (Rosenberg and Degnan, 2010), however, the nonmatching case merits further analysis, as a nonmatching gene tree labeled topology can have more root configurations and more total configurations than the topology that matches the species tree. Consider a caterpillar species tree topology Inline graphic , labeling the unique internal node with k descendants b_k for . For a matching caterpillar gene tree, by Proposition 1, the number of configurations at node b_k is , so that the number of root configurations is and the total number of configurations is .

Now consider a pseudocaterpillar gene tree topology Inline graphic with , continuing with as the species tree topology. Topology differs from only in the placement of a₄. We label the node of ancestral to a₁ and a₂ by d₂, the node ancestral to a₃ and a₄ by , and the unique node ancestral to k taxa, , by d_k. At nodes b₂, b₃, b₄, and b₅ of Inline graphic , the configurations are , , , and , with , , , and . For , is obtained by adding taxon a_k to each configuration in and noting the existence of one additional configuration, , so that . The number of root configurations of for is , and the number of total configurations is . Because Inline graphic and for , root configurations and total configurations are more numerous for the nonmatching pseudocaterpillar topology than for the matching caterpillar.

Second, when ancestral configurations are grouped according to an equivalence relationship defined in the appendix of Wu (2012) that accounts for symmetries in gene trees, the number of the resulting equivalence classes—the number of nonequivalent ancestral configurations—remains to be investigated. For gene trees and species trees with a matching labeled topology, our enumerations can be used as upper bounds for the number of nonequivalent ancestral configurations, and they can help in measuring the decrease in the number of ancestral configurations when the equivalence relationship is taken into account. We defer this analysis for future work.

7. Appendix 1. Proof of Equation (9)

Given a tree t, fix without loss of generality one of the possible planar representations of the tree t: one of the possible drawings of t in which edges do not cross and intersect only at their endpoints (Fig. 1A).

A root configuration of t uniquely determines a partition of the set of leaves of t in the following way. If Inline graphic is a root configuration of t, where each k_i is a node of t, then the associated partition is where is the set of leaves of t descended from node k_i (including k_i itself when k_i is a leaf). For instance, the partition of the leaf label set associated with the root configuration Inline graphic depicted in Figure 1B is . Note that for each pair of indices with , the leaves in are either all on the left or all on the right of the leaves in in the planar representation of t.

Without loss of generality, we can assume that the set Inline graphic is indexed such that if , then the leaves in are all depicted in the planar representation to the left of the leaves in . Taking the cardinality of each element of determines the vector which represents a composition, or ordered partition, of the integer . For instance, for the root configurations of the tree of size Inline graphic depicted in Figure 1A, we obtain the following compositions of 6:

As can be seen in this example, for a given planar representation of t, the mapping Inline graphic is injective (i.e., ). For , there are compositions of n into i parts, as demarcations must be placed among possible positions between entries of the length-n vector to separate groups of 1s that will be aggregated together. Using the binomial theorem to sum over all possible values of i, the number of distinct compositions of n is Inline graphic . Because each root configuration is associated with a distinct composition of n, we obtain , and the proof of Equation (9) is complete.

8. Appendix 2. Proof of Equation (17)

We obtain Equation (17) from Equation (16) by noting that for z close to 0, the following expansion holds:

9. Appendix 3. Proof of Corollary 1

The proof follows from the properties of Inline graphic and stated in Proposition 4. Part (i) is immediate from Proposition 4 and the definition of the exponential order.

For (ii), we start with m_n. Let Inline graphic be the exponential growth of the sequence m_n, so that k_m is its exponential order. Denote by the caterpillar family of trees, where t_n is the caterpillar with taxa. Thus, is the total number of configurations in t_n and is its number of root configurations. By Equation (11), we have Inline graphic , and from part (i) of the corollary. Thus, Because total configurations are at least as numerous as root configurations, . Then the growth of m_n has exponential order at most that of , so that . Clearly, however, we cannot have , because for and would imply that the sequence m_n decreases below 1 with increasing n. Thus, Inline graphic .

For the sequence M_n, let Inline graphic be the exponential growth of the sequence This sequence has exponential order k_M. Suppose is any sequence of trees with such that ; that is, t_n has the largest total number of configurations among trees of size n. From Equation (11), , where the latter sequence has order smaller than or equal to k₀ because by definition Inline graphic for all n, and from part (i) of the corollary. Thus, At the same time, for all n, we have , as the largest total number of configurations is larger than the largest number of root configurations. Thus, . It follows that .

10. Appendix 4. Proof of Equation (32)

The proof follows from the tree decomposition procedure that is illustrated in Figure 8. According to this procedure, each tree t of size n is either the 1-taxon tree Inline graphic or it can be created in a unique way by relabeling and appending to a shared root node two smaller trees t₁ and t₂ that become the root subtrees of t. From Proposition 1, the number of root configurations of t can be computed in this case as the product . Summing over all possible trees t, the tree decomposition described in Figure 8 translates into the following decomposition for the generating function Inline graphic :

FIG. 8. — Composition of two trees t₁ and t₂ of sizes and to obtain a tree t of size . **(A)** Trees t₁ and t₂, with leaves labeled by and . As in Section 2.1, we impose without loss of generality a linear order for the leaves of a tree; here, we have and . **(B)** Relabeling of trees t₁ and t₂. After relabeling, t₁ and t₂ have leaves labeled in the set of size . For the relabeling procedure, we choose (dotted circles) n₁ elements among the n possible new labels . There are exactly different choices. The chosen elements relabel t₁, whereas the elements not selected (dotted squares) relabel t₂. With respect to the order , the ith label of t₁ is assigned the label determined by the ith circle. Similarly, the ith label of t₂ is assigned the label determined by the ith square. **(C)** After relabeling t₁ and t₂, the new tree t is obtained by appending t₁ and t₂ to a shared root node. Starting with trees t₁ and t₂ in **(A)**, the same procedure can generate different trees t, one for each possible choice of the n₁ elements (dotted circles) among the n new labels. The only exception is when , for which the relabelings generate each tree exactly twice.

The first equality is the definition of Inline graphic . In the second equality, the set of trees over which the sum is evaluated is partitioned into two parts, the 1-taxon tree and the trees of size larger than 1. In the third equality, the set of trees t with is realized taking all possible pairs of trees and applying to each pair the procedure in Figure 8, considering all Inline graphic possible relabelings of t₁ and t₂. The quantity in the sum is replaced by the product and the term is replaced by . Note the factor that appears in Equation (44) before the summation. This factor takes into account the fact that for each pair with , there exists a symmetric pair Inline graphic . Symmetric pairs generate exactly the same trees according to the procedure in Figure 8, and multiplying by is required to avoid double counting. When , the factor is still required because only half of the relabelings of t₁ and t₂ (Fig. 8B) create nonisomorphic trees when t₁ and t₂ are appended to a shared root node. Finally, observe that the number Inline graphic of root configurations in the 1-taxon tree is 0.

From Equation (44) and the definitions of Inline graphic and in Equations (2) and (31), algebraic manipulations yield

11. Appendix 5. Proof of Equation (38)

The proof follows the case of Equation (32). For Inline graphic , the number can be obtained as the product , where t₁ and t₂ are the root subtrees of t. The tree decomposition described in Figure 8 yields

Acknowledgments

The authors thank Elizabeth Allman, James Degnan, and John Rhodes for discussions and NIH grant R01 GM117590 for financial support.

Author Disclosure Statement

No competing financial interests exist.

References

Aho A.V., and Sloane N.J.A. 1973. Some doubly exponential sequences. Fibonacci Q. 11, 429–437 [Google Scholar]
Brown J.K.M. 1994. Probabilities of evolutionary trees. Syst. Biol. 43, 78–91 [Google Scholar]
Degnan J.H., and Rosenberg N.A. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2, 762–768 [DOI] [PMC free article] [PubMed] [Google Scholar]
Degnan J.H., Rosenberg N.A., and Stadler T. 2012. The probability distribution of ranked gene trees on a species tree. Math. Biosci. 235, 45–55 [DOI] [PubMed] [Google Scholar]
Degnan J.H., and Salter L.A. 2005. Gene tree distributions under the coalescent process. Evolution 59, 24–37 [PubMed] [Google Scholar]
de Mier A., and Noy M. 2012. On the maximum number of cycles in outerplanar and series-parallel graphs. Graphs Combinator. 28, 265–275 [Google Scholar]
Disanto F., and Rosenberg N.A. 2015. Coalescent histories for lodgepole species trees. J. Comput. Biol. 22, 918–929 [DOI] [PubMed] [Google Scholar]
Disanto F., and Rosenberg N.A. 2016. Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Trans. Comput. Biol. Bioinf. 13, 913–925 [DOI] [PMC free article] [PubMed] [Google Scholar]
Disanto F., Schlizio A., and Wiehe T. 2013. Yule-generated trees constrained by node imbalance. Math. Biosci. 246, 139–147 [DOI] [PubMed] [Google Scholar]
Disanto F., and Wiehe T. 2013. Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model. Math. Biosci. 242, 195–200 [DOI] [PubMed] [Google Scholar]
Felsenstein J. 1978. The number of evolutionary trees. Syst. Zool. 27, 27–33 [Google Scholar]
Flajolet P., and Sedgewick R. 2009. Analytic Combinatorics. Cambridge University Press, Cambridge [Google Scholar]
Hammersley J.M., and Grimmett G.R. 1974. Maximal solutions of the generalized subadditive inequality. Pages 270–285 in Harding E.F., and Kendall D.G. eds. Stochastic Geometry. Wiley, London [Google Scholar]
Harding E.F. 1971. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab. 3, 44–77 [Google Scholar]
Harding E.F. 1974. The probabilities of the shapes of randomly bifurcating trees. Pages 259–269 in Harding E.F., and Kendall D.G., eds. Stochastic Geometry. Wiley, London [Google Scholar]
Maddison W.P. 1997. Gene trees in species trees. Syst. Biol. 46, 523–536 [Google Scholar]
McKenzie A., and Steel M. 2000. Distributions of cherries for two models of trees. Math. Biosci. 164, 81–92 [DOI] [PubMed] [Google Scholar]
Rosenberg N.A. 2006. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Ann. Comb. 10, 129–146 [Google Scholar]
Rosenberg N.A. 2007. Counting coalescent histories. J. Comput. Biol. 14, 360–377 [DOI] [PubMed] [Google Scholar]
Rosenberg N.A. 2013. Coalescent histories for caterpillar-like families. IEEE/ACM Trans. Comp. Biol. Bioinf. 10, 1253–1262 [DOI] [PubMed] [Google Scholar]
Rosenberg N.A., and Degnan J.H. 2010. Coalescent histories for discordant gene trees and species trees. Theor. Popul. Biol. 77, 145–151 [DOI] [PubMed] [Google Scholar]
Steel M., and McKenzie A. 2001. Properties of phylogenetic trees generated by Yule-type speciation models. Math. Biosci. 170, 91–112 [DOI] [PubMed] [Google Scholar]
Than C., and Nakhleh L. 2009. Species tree inference by minimizing deep coalescences. PLoS Comp. Biol. 5, e1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Than C., Ruths D., Innan H., and Nakhleh L. 2007. Confounding factors in HGT detection: Statistical error, coalescent effects, and multiple solutions. J. Comput. Biol. 14, 517–535 [DOI] [PubMed] [Google Scholar]
Wu Y. 2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66, 763–775 [DOI] [PubMed] [Google Scholar]
Yule G.U. 1925. A mathematical theory of evolution based on the conclusions of Dr. J.C. Willis, F.R.S. Phil. Trans. R. Soc. Lond. B 213, 21–87 [Google Scholar]

[B1] Aho A.V., and Sloane N.J.A. 1973. Some doubly exponential sequences. Fibonacci Q. 11, 429–437 [Google Scholar]

[B2] Brown J.K.M. 1994. Probabilities of evolutionary trees. Syst. Biol. 43, 78–91 [Google Scholar]

[B3] Degnan J.H., and Rosenberg N.A. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2, 762–768 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Degnan J.H., Rosenberg N.A., and Stadler T. 2012. The probability distribution of ranked gene trees on a species tree. Math. Biosci. 235, 45–55 [DOI] [PubMed] [Google Scholar]

[B5] Degnan J.H., and Salter L.A. 2005. Gene tree distributions under the coalescent process. Evolution 59, 24–37 [PubMed] [Google Scholar]

[B6] de Mier A., and Noy M. 2012. On the maximum number of cycles in outerplanar and series-parallel graphs. Graphs Combinator. 28, 265–275 [Google Scholar]

[B7] Disanto F., and Rosenberg N.A. 2015. Coalescent histories for lodgepole species trees. J. Comput. Biol. 22, 918–929 [DOI] [PubMed] [Google Scholar]

[B8] Disanto F., and Rosenberg N.A. 2016. Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Trans. Comput. Biol. Bioinf. 13, 913–925 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Disanto F., Schlizio A., and Wiehe T. 2013. Yule-generated trees constrained by node imbalance. Math. Biosci. 246, 139–147 [DOI] [PubMed] [Google Scholar]

[B10] Disanto F., and Wiehe T. 2013. Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model. Math. Biosci. 242, 195–200 [DOI] [PubMed] [Google Scholar]

[B11] Felsenstein J. 1978. The number of evolutionary trees. Syst. Zool. 27, 27–33 [Google Scholar]

[B12] Flajolet P., and Sedgewick R. 2009. Analytic Combinatorics. Cambridge University Press, Cambridge [Google Scholar]

[B13] Hammersley J.M., and Grimmett G.R. 1974. Maximal solutions of the generalized subadditive inequality. Pages 270–285 in Harding E.F., and Kendall D.G. eds. Stochastic Geometry. Wiley, London [Google Scholar]

[B14] Harding E.F. 1971. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab. 3, 44–77 [Google Scholar]

[B15] Harding E.F. 1974. The probabilities of the shapes of randomly bifurcating trees. Pages 259–269 in Harding E.F., and Kendall D.G., eds. Stochastic Geometry. Wiley, London [Google Scholar]

[B16] Maddison W.P. 1997. Gene trees in species trees. Syst. Biol. 46, 523–536 [Google Scholar]

[B17] McKenzie A., and Steel M. 2000. Distributions of cherries for two models of trees. Math. Biosci. 164, 81–92 [DOI] [PubMed] [Google Scholar]

[B18] Rosenberg N.A. 2006. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Ann. Comb. 10, 129–146 [Google Scholar]

[B19] Rosenberg N.A. 2007. Counting coalescent histories. J. Comput. Biol. 14, 360–377 [DOI] [PubMed] [Google Scholar]

[B20] Rosenberg N.A. 2013. Coalescent histories for caterpillar-like families. IEEE/ACM Trans. Comp. Biol. Bioinf. 10, 1253–1262 [DOI] [PubMed] [Google Scholar]

[B21] Rosenberg N.A., and Degnan J.H. 2010. Coalescent histories for discordant gene trees and species trees. Theor. Popul. Biol. 77, 145–151 [DOI] [PubMed] [Google Scholar]

[B22] Steel M., and McKenzie A. 2001. Properties of phylogenetic trees generated by Yule-type speciation models. Math. Biosci. 170, 91–112 [DOI] [PubMed] [Google Scholar]

[B23] Than C., and Nakhleh L. 2009. Species tree inference by minimizing deep coalescences. PLoS Comp. Biol. 5, e1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Than C., Ruths D., Innan H., and Nakhleh L. 2007. Confounding factors in HGT detection: Statistical error, coalescent effects, and multiple solutions. J. Comput. Biol. 14, 517–535 [DOI] [PubMed] [Google Scholar]

[B25] Wu Y. 2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66, 763–775 [DOI] [PubMed] [Google Scholar]

[B26] Yule G.U. 1925. A mathematical theory of evolution based on the conclusions of Dr. J.C. Willis, F.R.S. Phil. Trans. R. Soc. Lond. B 213, 21–87 [Google Scholar]

PERMALINK

Enumeration of Ancestral Configurations for Matching Gene Trees and Species Trees

Filippo Disanto

Noah A Rosenberg

Abstract

1. Introduction

2. Preliminaries

2.1. Labeled topologies

FIG. 1.

2.2. Exponential growth and analytic combinatorics

2.3. Gene trees, species trees, and ancestral configurations

2.3.1. Ancestral configurations

2.3.2. Total configurations and root configurations

2.4. Root configurations in small trees

FIG. 2.

FIG. 3.

3. Root Configurations for Unbalanced and Balanced Families of Trees

FIG. 4.

3.1. Unbalanced families

Table 1.

3.2. Balanced families

3.3. Comparing unbalanced and balanced families

4. Smallest and Largest Numbers of Root Configurations for Trees of Fixed Size

4.1. Smallest number of root configurations

4.2. Largest number of root configurations

FIG. 5.

FIG. 6.

5. The Number of Root Configurations in a Random Labeled Topology

5.1. Mean number of root configurations

FIG. 7.

5.2. Variance of the number of root configurations

6. Conclusions

7. Appendix 1. Proof of Equation (9)

8. Appendix 2. Proof of Equation (17)

9. Appendix 3. Proof of Corollary 1

10. Appendix 4. Proof of Equation (32)

FIG. 8.

11. Appendix 5. Proof of Equation (38)

Acknowledgments

Author Disclosure Statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases