Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Sep 1.
Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2015 Oct 5;13(5):913–925. doi: 10.1109/TCBB.2015.2485217

Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees

Filippo Disanto 1, Noah A Rosenberg 2
PMCID: PMC5096406  NIHMSID: NIHMS822293  PMID: 26452289

Abstract

Coalescent histories provide lists of species tree branches on which gene tree coalescences can take place, and their enumerative properties assist in understanding the computational complexity of calculations central in the study of gene trees and species trees. Here, we solve an enumerative problem left open by Rosenberg concerning the number of coalescent histories for gene trees and species trees with a matching labeled topology that belongs to a generic caterpillar-like family. By bringing a generating function approach to the study of coalescent histories, we prove that for any caterpillar-like family with seed tree t, the sequence (hn)n≥0 describing the number of matching coalescent histories of the nth tree of the family grows asymptotically as a constant multiple of the Catalan numbers. Thus, hn ~ βtcn, where the asymptotic constant βt > 0 depends on the shape of the seed tree t. The result extends a claim demonstrated only for seed trees with at most 8 taxa to arbitrary seed trees, expanding the set of cases for which detailed enumerative properties of coalescent histories can be determined. We introduce a procedure that computes from t the constant βt as well as the algebraic expression for the generating function of the sequence (hn)n≥0.

Keywords: Catalan numbers, caterpillar-like trees, coalescent, enumeration, generating functions, phylogenetics

1 Introduction

Coalescent histories, mathematical structures representing combinatorially distinct ways in which a given gene tree can coalesce along the branches of a given species tree, are important in a variety of phylogenetic problems [6], [14], [15]. They arise, for example, in proofs concerning theoretical properties of species tree inference algorithms [1], [18], in empirical analyses of gene tree probability distributions [16], and in studies of gene trees under hybridization [21]. Many of these applications trace to the appearance of coalescent histories in a sum performed in a fundamental calculation for inference of species trees from information on multiple genetic loci, the evaluation of gene tree probabilities conditional on species trees [5].

Owing to uses of coalescent histories in sets over which sums are computed, as well as in state spaces of certain phylogenetic Markov chains [7], [10], [11], solutions to enumerative problems involving coalescent histories contribute to an understanding of the computational complexity of phylogenetic calculations. A recursion for the number of coalescent histories for a given gene tree and species tree has been established [13], and several studies have reported exact numerical results and closed-form expressions for the number of coalescent histories for small trees and for specific types of trees of arbitrarily large size [4]–[6], [13]–[15], [19]. The latter computations have proceeded both by solving or deploying the recursion in specific cases [13]–[15], [19], as well as by identifying correspondences between coalescent histories and other combinatorial structures for which enumerative results have already been established [4]–[6].

One class of gene trees and species trees of particular interest for enumeration of coalescent histories is the caterpillar-like families, trees that have a caterpillar shape, except that the caterpillar subtree with r taxa is replaced by a subtree of size r that is not necessarily a caterpillar subtree (Fig. 1). For the simplest caterpillar-like family, the caterpillar trees themselves, if the gene tree and species tree have the same caterpillar labeled topology with n taxa, then, as reported in [5], the number of coalescent histories is a Catalan number,

cn1=1n(2n2n1). (1)

Fig. 1.

Fig. 1

A caterpillar-like family of species trees (t(n))n≥0. For a seed tree t, by adding n ≥ 0 branches each with 1 leaf, we obtain the nth tree of the family, t(n). If t has 2 taxa, then (t(n))n≥0 is simply the caterpillar family.

For Tr-caterpillar-like families, in which the r-taxon subtree of an n-taxon caterpillar species tree is replaced by an r-taxon subtree Tr (Fig. 1), by employing the recursion, Rosenberg [14] obtained the exact number of coalescent histories for all n, for each Tr with r ≤ 8, in the case that the gene tree and species tree have the same labeled topology. Rosenberg [14] argued that in each of these cases, as n → ∞, the number of coalescent histories is asymptotic to a constant multiple of the Catalan numbers. A proof of this result has been presented in full for each case with r ≤ 5 [4], [13], [14], and by computer algebra for cases with r = 6, 7, and 8 [14].

Each case considered by [14] involved cumbersome computations specific to the choice of Tr, limiting the generality of the approach. While no reason exists to suspect that the method of [14] would not extend to larger r, it is desirable to find another method that is practical for a general Tr. Here, using a substantially different strategy that brings to studies of coalescent histories the methods of analytic combinatorics, we produce an enumeration result that covers caterpillar-like families in general. We show that the result of [14] applies to all caterpillar-like families, not only those for which Tr has r ≤ 8. That is, we demonstrate that for any Tr, as n → ∞, the number of coalescent histories in the Tr-caterpillar-like family is asymptotic to a constant multiple of the Catalan numbers—thus extending a result known only for r ≤ 8 to arbitrarily large r. We describe a method and symbolic tool for computing the constant. Finally, we discuss the impact of the results in mathematical phylogenetics.

2 Preliminaries

2.1 Species trees and coalescent histories

We consider binary rooted leaf-labeled species trees, taking a single arbitrary labeling (without loss of generality) to represent a given unlabeled species tree topology. We consider an arbitrarily labeled species tree and its unlabeled tree interchangeably, treating the labeling as implicit.

We examine coalescent histories for the case in which gene trees and species trees have the same labeled topology t, terming a coalescent history in this case a matching coalescent history. To be a matching coalescent history, a mapping h from the internal nodes of t (viewed as the gene tree) to the branches of t (viewed as the species tree) must satisfy two conditions (Fig. 2): (a) for each leaf x in t, if x descends from node k in t, then x descends from branch h(k) in t; (b) for each pair of internal nodes k1 and k2 in t, if k2 descends from k1 in t, then branch h(k2) descends from or coincides with branch h(k1) in t. We henceforth consider only matching coalescent histories, treating “matching” as implicit; we also refer simply to histories for short.

Fig. 2.

Fig. 2

Matching coalescent histories. (A) A matching coalescent history. (B) A mapping from the internal nodes of a tree to its branches that does not satisfy condition (a). Leaf B is descended from node k but does not descend from branch h(k). (C) A mapping from the internal nodes of a tree to its internal branches that does not satisfy condition (b). Node k2 is descended from node k1, but branch h(k2) is strictly ancestral to branch h(k1).

2.2 Caterpillar-like families of species trees

For a binary species tree t with at least 2 taxa, we denote by (t(n))n≥0 the caterpillar-like family generated by seed tree t. This family is recursively defined by taking t(0) = t and letting t(n+1) be the tree obtained by appending t(n) and a single leaf to a shared root (Fig. 1).

Our interest is in the number of matching coalescent histories of t(n) for n ≥ 0, a quantity we denote by hn(t) or simply hn. We note that whereas [14] indexed trees by their numbers of taxa, here n represents the number of taxa appended above the root of the seed tree, so that if seed tree t has |t| taxa, then |t| + n gives the number of taxa in t(n).

2.3 Principles of analytic combinatorics

We rely on techniques of analytic combinatorics [8] to obtain our enumerative results, and recall several key points. In general, an integer sequence (an)n≥0 can be associated with a formal power series A(z)=n=0anzn, also termed the generating function of the integers an. Considering z as a complex variable, typically in a neighborhood of 0, features of the function A(z) are related to the growth of the coefficients an.

More precisely, generating functions, considered as complex functions, enable analyses of the asymptotic growth of the associated integer sequences through the analysis of their singularities in the complex plane. In particular, under suitable conditions, there exists a general correspondence between the singular expansion of a generating function A(z) near its dominant singularities—those nearest the origin—and the asymptotic behavior of the associated coefficients an (Chapter VI of [8]). We make use of theorems that describe this correspondence.

2.4 Catalan numbers

The Catalan sequence appears often in combinatorics [8], [9], [17] and features prominently in our analysis. Rewriting eq. (1) with index n rather than n − 1,

cn=1n+1(2nn). (2)

The associated generating function is well known [17]:

C(z)=n=0cnzn=114z2z. (3)

By definition, if [zn]f(z) denotes the nth term in the power series expansion of f(z) at z = 0, we have

cn=[zn]C(z)=12[zn+1](114z)=12[zn+1](14z). (4)

Here, 114z is replaced by 14z, as the constant 1 does not contribute to the power series expansion for terms of order n+1, with n ≥ 0. Asymptotically, applying Stirling's formula n!2πn(ne)n to eq. (2), the Catalan sequence satisfies

cn4nn32π. (5)

3 The number of matching coalescent histories for caterpillar-like families

We aim to find a procedure that evaluates the number of coalescent histories hn(t) for matching gene trees and species trees in the caterpillar-like family that begins with seed tree t, and moreover, to show that

hn(t)βtcn, (6)

where the multiplier βt > 0 for the Catalan sequence is a constant depending on t. In other words, we wish to demonstrate that as n → ∞, hn/cn converges to a constant βt > 0 that depends on the seed tree t.

First, in Section 3.1, we determine a lower bound for the number of matching coalescent histories of the nth tree t(n) of the caterpillar-like family with seed tree t. Next, in Section 3.2, we introduce a concept of m-rooted histories of a species tree t(n). The section provides an iterative construction of the rooted histories of t(n+1) from those of t(n), describing the construction by means of a convenient labeling scheme. We follow a commonly used combinatorial enumeration strategy [2], [3] that determines a recursive succession rule for successive collections of objects in a sequence and then uses this rule to compute a generating function. In Section 3.3, we use the iterative construction to produce a bivariate generating function whose coefficients hn,m are the numbers of m-rooted histories for trees t(n). We next obtain the generating function for the integer sequence (hn)n≥0 describing the number of matching coalescent histories for the t(n). Finally, using the lower bound from Section 3.1, in Section 3.4, we apply methods of analytic combinatorics to study the asymptotic behavior of hn.

3.1 Lower bound for hn

For our asymptotic analysis, we will need an initial lower bound for hn. To produce this bound, we first define V as the tree with 2 taxa. Recalling that we index trees so that the number of taxa in a tree exceeds by n the number of taxa in the seed tree, we have [4], [13], [14]

hn(V)=cn+1.

We can then use a constructive procedure, illustrated in detail in Figure 3, to show that for any seed tree t with |t| ≥ 2,

hn(t)hn(V)=cn+1. (7)

For a seed tree t, we can superimpose V on t so that the root rV of V matches the root rt of t (Fig. 3B). The two leaves of V are identified with two of the leaves of t, one on each side of the root of t. Generating caterpillar-like families by adding n single branches separately to V and to t, the superposition of V on t extends, so that V(n) is superimposed on t(n) (Fig. 3C). The n caterpillar branches of t(n) and V(n) then correspond.

Fig. 3.

Fig. 3

Superposition of the caterpillar tree family on a caterpillar-like tree family with arbitrary seed tree of size |t| ≥ 2. (A) A seed tree t and the seed tree V for the caterpillar family. (B) Superposition of V on t, so that the roots rV and rt overlap. (C) Superposition of V(2) (shaded internal nodes) on t(2) (shaded and unshaded nodes). The n = 2 caterpillar branches in V(2) and t(2) overlap, and rV still matches rt. (D) A matching coalescent history of t(2) (dashed and dotted arrows) determines a matching coalescent history of V(2) (dashed arrows) by ignoring arrows from the unshaded nodes.

Each matching coalescent history h of t(n) determines a corresponding matching coalescent history h′ of V(n) by considering the restriction of h to the set of internal nodes of t(n) that correspond to internal nodes of V(n) (Fig. 3D). Thus, for any seed tree t, the number of matching coalescent histories of t(n) is greater than or equal to that of V(n). In symbols, we have eq. (7). We will use this result in Section 3.4.

3.2 Iterative generation of rooted histories

This section describes the iterative procedure that for a seed tree t eventually enables us to determine a formula for hn. First, in Section 3.2.1, we discuss m-rooted histories, which extend the concept of matching coalescent histories, introducing an additional parameter m. Next, in Section 3.2.2, we examine the relationship between rooted histories and the extended coalescent histories of [13], importing results on extended coalescent histories into the more convenient framework of rooted histories. We expand our goal of enumerating matching coalescent histories for t(n), considering a more general problem of enumerating for m ≥ 1 the m-rooted histories of t(n).

In Section 3.2.3, we define an operator Ω for constructing the rooted histories of t(n+1) from the rooted histories of t(n). Next, in Section 3.2.4, we introduce a labeling scheme that in Section 3.2.5 enables us to switch from counting rooted histories to counting multisets of labels. At the end of Section 3.2, we will have converted our enumeration problem into an enumeration that is more convenient for constructing a generating function.

3.2.1 m-rooted histories

Consider a tree t with |t| ≥ 2, and suppose that the branch above the root of t (the root-branch) is divided into infinitely many components. A matching coalescent history mapping the internal nodes of t onto the branches of t is said to be m-rooted for m ≥ 1 if the root of t is mapped exactly onto the mth component of the root (Fig. 4). It is said to be rooted if it is m-rooted for some m. Branches are numbered so that branch m = 1 is immediately above the root node, and m is greater for components that are farther from the root.

Fig. 4.

Fig. 4

Rooted histories of a tree. (A) A 3-rooted history. The root-branch is divided into infinitely many components, the third of which receives the image of the root. (B) A 1-rooted history. The number of 1-rooted histories corresponds to the number of matching coalescent histories of the tree.

For a rooted history h of a tree t, m = m(h) denotes the component of the root-branch of t that receives the image of the root of t. Hn,m(t) denotes the set of m-rooted histories of t(n), and Hn(t)=m=1Hn,m(t) the set of its rooted histories. The number of m-rooted histories of t(n) is hn,m = |Hn,m|, and the number of 1-rooted histories hn = hn,1 is also the number of matching coalescent histories. Enumerating the matching coalescent histories of t(n) is equivalent to enumerating its 1-rooted histories.

3.2.2 Rooted histories and extended histories

Rooted histories are closely related to extended coalescent histories, as defined by [13]. We use this relationship to study properties of rooted histories. Rosenberg [13] defined the set of k-extended coalescent histories of a tree t with |t| ≥ 1 for integers k ≥ 1; we also consider k = 0 by setting the number of 0-extended histories to 0.

A k-extended history is defined as a coalescent history for a species tree whose root-branch is divided into exactly k ≥ 0 parts. In other words, the root-branch has exactly k ≥ 0 possible components onto which a k-extended history can map the gene tree root. Here we consider matching k-extended histories, so that the internal nodes of a tree t are mapped to the branches of t and its k components above the root. For convenience, we refer to extended histories by the index k, reserving the index m for rooted histories.

By the definitions of k-extended and m-rooted histories, for each k ≥ 0, the set of k-extended histories of a tree is exactly the set of all m-rooted histories with 1 ≤ mk. Therefore, for a tree t with at least 2 leaves, if we label by et,k its number of k-extended histories, then for each m ≥ 1 the number of m-rooted histories of t is

h0,m=et,met,m1. (8)

Note that for m = 1, we explicitly use in eq. (8) the fact that et,0 is defined and equal to 0. In addition to setting et,0 = 0 for any tree t, as in [13] we set et,k = 1 for all k ≥ 1 in the case that t has exactly 1 leaf.

Suppose |t| ≥ 1 and k ≥ 0. Denote by tL and tR the left and right subtrees of the root of t. We can compute et,k recursively as in Theorem 3.1 of [13]:

et,k={0ift1andk=01ift=1andk1i=1ketL,i+1etR,i+1ift2andk1.} (9)

As was already observed in the remarks following Corollary 3.2 of [13], by eq. (9), for any tree t with |t| ≥ 1, for positive integers k ≥ 1, the function f(k) = et,k is a polynomial in k. With our extension to permit k = 0, we can extend this fact to k ≥ 0 for |t| ≥ 2: for any tree t with |t| ≥ 2, and for k ≥ 0, we claim that the function f(k) = et,k is a polynomial in k. Note that in allowing k = 0, we claim et,k is a polynomial in k only for |t| ≥ 2; for |t| = 1, et,k is not a polynomial in k because et,0 = 0 and et,k = 1 for k ≥ 1.

To prove the claim, fix t with |t| ≥2 and consider the variable k over domain [1, ∞). We demonstrate that f(k) is a polynomial in k for domain [0, ∞) by showing that the closed-form for f(k) has a factor of k, so that our choice et,0 = 0 in eq. (9) is compatible with the polynomial expression valid for k ≥ 1.

Observe that for i ≥ 1, etL,i and etR,i are polynomials in i, say PtL(i) and PtR(i). Replacing terms etL,i+1 and etR,i+1 in the recursion in eq. (9) by polynomials PtL(i + 1) and PtR(i + 1), we obtain

i=1ketL,i+1etR,i+1=i=1kPtL(i+1)PtR(i+1)=i=1kP(i), (10)

where P′(i) denotes a polynomial in i that results from the product of PtL(i + 1) and PtR(i + 1). By Faulhaber's formula for sums of powers of integers, symbolic sums of the form i=1kip for a fixed integer p ≥ 0 are polynomials containing a factor of k in their closed forms (Section 6.5 of [9])—for example, i=1ki3=k2(k+1)24. Thus, because the polynomial P′(i) is a linear combination of terms of the form ip, the closed-form expression for the sum i=1kP(i) appearing in eq. (10) also has a factor of k. It therefore has a value of 0 at k = 0.

Functions et,k for trees t with 1 ≤ |t| ≤ 9 and k ≥ 1 appear in Tables 1-4 of [13]. For |t| ≥ 2, as we have shown, these example polynomials are divisible by the variable representing the number of components of the root-branch. By eq. (8), we immediately obtain the following result.

Proposition 1

For any tree t with |t| ≥ 2 and for m ≥ 1, the number h0,m of m-rooted histories of t is a polynomial in m that can be computed by the difference in eq. (8) using et,k as in eq. (9).

As an example of Proposition 1, consider the tree t = ((A, B), (C, D)), identifying this arbitrary labeling with the unlabeled tree (()()). By applying the recursive procedure in eq. (9), we find that for k ≥ 0, the number of k-extended coalescent histories for t is et,k=16k(2k2+9k+13) [13]. The difference eq. (8) yields that for m ≥ 1, the number of m-rooted histories of t is h0,m = et,met,m−1 = m2 + 2m + 1.

3.2.3 Rooted histories of t(n+1) from those of t(n)

This section introduces an operator Ω that generates the rooted histories of t(n+1) from those of t(n). For each rooted history h′ of t(n+1), there exists exactly one rooted history h of t(n) with h′ ∈ Ω(h). Recalling the definitions of the sets Hn,m(t) and Hn(t) of m-rooted and rooted histories of t(n), we define Ω as follows.

Definition

Let P(X)={x:xX} denote the power set of set X, and fix tree t. The operator Ω is a function

Ω:Hn(t)P(Hn+1(t)),

where for a rooted history hHn(t), Ω(h) is the set of rooted histories h′ ∈ Hn+1(t) for which the restriction of h′ to t(n+1) excluding its most basal caterpillar branch coincides with the rooted history h of t(n).

Denote by b1, b2, . . . , bn+1 the caterpillar branches in t(n+1), from the least basal b1 to the most basal bn+1 (Fig. 5). Upon removal of the most basal caterpillar branch bn+1 from t(n+1), the root of t(n+1)—to which branch bn+1 is attached—is replaced by a demarcation between the first and second components of the root-branch of t(n). For instance, in Fig. 5A, starting from tree t = ((A, B), (C, D)), we consider h‴, a 3-rooted history of t(3). By removing the most basal caterpillar branch b3 of t(3), we reduce to the 1-rooted history h″ of t(2) (Fig. 5B). Next, by removing the caterpillar branch b2 of t(2), we reduce to the 2-rooted history h′ of t(1) (Fig. 5C). By removing the remaining caterpillar branch b1 from t(1), we reduce to the 2-rooted history h of t = t(0) (Fig. 5D). Therefore, by the definition of Ω, we have h′ ∈ Ω(h), h″ ∈ Ω(h′), and h‴ ∈ Ω(h″).

Fig. 5.

Fig. 5

The relationships among rooted histories for sequential members of caterpillar-like families. For a rooted history h‴ of t(3), with t = ((A, B), (C, D)), the figure sequentially removes caterpillar branches. By definition, a rooted history h′ of t(n+1) belongs to the set Ω(h) if, by removing the most basal caterpillar branch bn+1 in t(n+1), we recover the rooted history h of t(n). Note that when we remove the basal caterpillar branch bn+1 from t(n+1), the root of t(n+1)—to which the branch bn+1 is attached—becomes the boundary between the first and second components of the root-branch of t(n), and is depicted as a horizontal segment. (A) h‴ ∈ Ω(h″). (B) h″ ∈ Ω(h′). (C) h′ ∈ Ω(h). (D) h. For each rooted history, the value of the parameter m, representing the component of the root-branch that receives the image of the root, is shown.

By definition, Ω has the property that for each rooted history h′ ∈ Hn+1(t), with n ≥ 0, there exists exactly one rooted history hHn(t) such that h′ ∈ Ω(h). In other words, for each n ≥ 0, the set of rooted histories Hn+1(t) can be partitioned as a disjoint union,

Hn+1(t)=hHn(t)Ω(h). (11)

The set Hn+1(t) is therefore generated without double occurrences of any rooted history by applying Ω to the rooted histories in Hn(t). It follows immediately that in performing n iterations of Ω to obtain Ω[. . . [Ω[Ω(H0)]] . . .] from the set H0 of rooted histories of t(0), all the rooted histories of t(n) are generated exactly once.

3.2.4 Labels for rooted histories

The operator Ω, starting from the rooted histories of t(n), generates the rooted histories of t(n+1). In this section, we introduce a labeling scheme, giving each m-rooted history h of t(n) a label L(h) = (n, m). We then describe how Ω acts on the labels of the rooted histories, characterizing the set of labels L[Ω(h)] = {L(h′) : h′ ∈ Ω(h)}. Our goal is to represent each set Hn of rooted histories of t(n) by the multiset of its labels, reducing the enumeration of |Hn,m| to the problem of counting certain ordered pairs (n, m) iteratively generated by simple rules that reflect how the rooted histories in Hn+1 are generated according to rule Ω from the rooted histories in Hn by eq. (11).

In our labeling, each rooted history hHn(t) that maps the root of t(n) onto the mth component of the root-branch of t(n) receives label L(h) = (n, m). Enumeration of hn = |Hn,1| then reduces to enumeration of those rooted histories labeled by (n, 1).

Note that a label (n, m) does not uniquely specify an m-rooted history of t(n): a tree t(n) has in general many m-rooted histories, each receiving the label (n, m). In other words, if h, Hn(t) and L(h) = L(), then h and are not necessarily the same rooted history of t(n). We will, however, consider for n ≥ 0 multisets of labels in which we find a copy of the label (n, m) for each m-rooted history of t(n).

To characterize how the operator Ω acts on the labels for rooted histories, consider an m-rooted history hHn(t), so that h maps the root of t(n) onto the mth component of the root-branch of t(n). This history is labeled L(h) = (n, m). For instance, taking the seed tree t = ((A, B), (C, D)), the history h of t = t(0) depicted in Figure 6A is labeled L(h) = (0, 3), whereas the history h of t(1) in Figure 6C has L(h) = (1, 1).

Fig. 6.

Fig. 6

Generation of rooted histories of t(n+1) from rooted histories of t(n), as given by rule Ω applied to seed tree t = ((A, B), (C, D)). To obtain rooted histories of t(n+1) (right) from rooted histories of t(n) (left), we choose the component m′ of the root-branch of t(n+1) onto which the root of t(n+1) is mapped (solid arrows). The smallest among infinitely many possible choices are depicted. For all nodes of t(n+1) except the root, the rooted history generated for t(n+1) coincides with the generating rooted history of t(n) (dashed arrows). (A) A case with m ≥ 2. A 2-rooted history h of t(0), labeled (0, 3), is shown. (B) Ω(h) for h in (A). 2-, 3-, and 4-rooted histories of t(1) belonging to Ω(h) are shown and are labeled (1, 2), (1, 3), and (1, 4), respectively. Because m ≥ 2, m′ ≥ m − 1 as in eq. (12). (C) A case with m = 1. A 1-rooted history h of t(1), labeled (1, 1), is shown. (D) Ω(h) for h in (C). 1- and 2-rooted histories of t(2) belonging to Ω(h) are shown and are labeled (2, 1) and (2, 2), respectively. Because m = 1, m′ ≥ m.

By applying Ω to a history h of t(n) with L(h) = (n, m), we produce a set of rooted histories Ω(h) ⊆ Hn+1(t). The set of labels for Ω(h),

L[Ω(h)]={L(h):hΩ(h)},

is determined according to the rule:

L[Ω(h)]={{(n+1,m):mm}ifm=1{(n+1,m):mm1}ifm2,} (12)

where m′ denotes the value of the parameter m—the component of the root-branch of t(n+1) to which the root is mapped—for the rooted histories h′ ∈ Ω(h) of t(n+1).

The rule in eq. (12) distinguishes between two cases depending on whether the value of the parameter m = m(h) of the generating rooted history h is equal to or exceeds 1. In both cases, the set L[Ω(h)] contains infinitely many labels, each with its first component equal to n+1, as the labels refer to rooted histories of t(n+1). The value of the second component m′ ranges in [m − 1, ∞) if m ≥ 2, and in [1, ∞) if m = 1.

Recall that according to the definition of Ω, from an m-rooted history h of t(n) (Fig. 6A and 6C), we generate an m′-rooted history h′ ∈ Ω(h) of t(n+1) (Fig. 6B and 6D) by (i) choosing the component m′ of the root-branch of t(n+1) onto which h′ maps the root of t(n+1), and (ii) letting h′ coincide with h on all nodes of t(n+1) except the root. The rooted history h′ coincides with h once we remove the most basal caterpillar branch of t(n+1).

Figure 6 illustrates both cases of eq. (12). In step (i), infinitely many choices of m′ are possible, because the root-branch of t(n+1) is divided into infinitely many parts. The most basal caterpillar branch in t(n+1) is attached at the border between the first and second components of the root-branch of t(n). Thus, the addition of the (n + 1)st caterpillar branch eliminates a component of the root-branch, so that if the starting rooted history h has m ≥ 2 (Fig. 6A), then the root of t(n) maps to component m − 1 of the root-branch of t(n+1). The root of t(n+1) can map to this same branch, or to any branch m′ with m′ ≥ m − 1. For instance, in Figure 6B, one of the rooted histories h′ generated by a rooted history h with m = 3 has m′ = m − 1 = 2.

If h has m = 1, however, then production of h′ is slightly different (Fig. 6C). By definition, the parameter m for a rooted history cannot be smaller than 1. The value m′ = m − 1 is not permitted, and m′ remains greater than or equal to m = 1 (Fig. 6D).

3.2.5 Counting the labels of rooted histories

The labeling scheme in Section 3.2.4 encodes the application of the operator Ω to the rooted histories of t(n). Now that we have described the set of labels L[Ω(h)] arising from the label L(h) according to the rule in eq. (12), the problem of counting a set of rooted histories becomes a problem of counting the set of the associated labels along with their multiplicities—or the multiset of the labels.

For n ≥ 0 and m ≥ 1, we use Ω((n, m)) to denote, with an abuse of notation, the set of labels L[Ω(h)] when L(h) = (n, m). Recalling that iterative application of Ω to the rooted histories H0 of tree t0) generates the rooted histories Hn of t(n), the enumeration of |Hn,m| for tree t = t(0) becomes a problem of counting those labels of the form (n, m) that are generated when we iteratively apply the operator Ω as Ω[. . . [Ω[Ω(L0)]] . . .] starting from the multiset of labels L0 = {L(h) : hH0(t)} (Fig. 7).

Fig. 7.

Fig. 7

Iterative application of a rule for generating the multiset of the labels of the rooted histories of a tree t(n). The iterative procedure starts with the multiset L0 that contains those labels of the form {(0, m) : m ≥ 1} associated with the rooted histories of a seed tree t = t(0). In the first step of the iteration, we apply Ω (eq. (13)) to each label of L0. In the second step, we apply Ω to each label resulting from the first step, and so on. The number of m-rooted histories of t(n) corresponds to the number of labels (n, m), considered with their multiplicity, generated after the nth step of the iteration.

Eq. (12) characterizes the set of labels L[Ω(h)] of the rooted histories in Ω(h) in terms of the label L(h) of rooted history h. If L(h) = (n, m), then Ω((n, m)) denotes the set of labels L[Ω(h)]. Thus, converting the notation from histories to labels, eq. (12) becomes

Ω((n,m))={{(n+1,m):mm}ifm=1{(n+1,m):mm1}ifm2.} (13)

For the seed tree t, we count hn,m = |Hn,m| by evaluating number of occurrences of the ordered pair (n, m) in the multiset Ln defined as

Ln=L[Hn(t)]={L(h):hHn(t)}. (14)

In symbols, we have

hn,m={Ln:=(n,m)}. (15)

By eq. (11), each multiset Ln is generated iteratively (Fig. 7). We start with the multiset of labels

L0={L(h):hH0(t)}. (16)

For each n ≥ 0, the multiset Ln+1 is obtained as

Ln+1=(n,m)LnΩ((n,m)), (17)

where the symbol ⨄ denotes the union operator for multisets. Thus, in M = M1M2, if an element x appears n1 times in M1 and n2 times in M2, then it appears n1 + n2 times in M. Eq. (17) provides an iterative generation of the labels for the rooted histories of Hn+1(t) from the labels of the rooted histories of Hn(t), retaining information about the multiplicity of occurrences of each label.

3.3 Rooted histories and generating functions

We have now obtained eq. (15), which gives an equivalence between the number of m-rooted histories of t(n) and the number of labels (n, m) in the multiset Ln, and eqs. (16) and (17), which give through Ω (eq. (13)) an iterative procedure that generates the family of multisets (Ln)n≥0. In this section, we translate the iterative procedure into algebraic terms, determining the generating function associated with the integer sequence (hn)n≥0.

First, in Section 3.3.1, we characterize a generating function g(y) for the sequence (h0,m)m≥1. Next, in Section 3.3.2, we deduce an equation satisfied by the bivariate generating function F(y, z) for (hn,m)n≥0,m≥1. In Section 3.3.3, we solve the equation, obtaining the desired generating function f(z) for the sequence (hn,1)n≥0. This generating function can be written in turn as a function of g(y).

3.3.1 Generating function for (h0,m)m≥1

In this section, we characterize the generating function g(y) that counts for a given seed tree t the labels in the multiset L0 describing the labels of the rooted histories of t.

Fix the seed tree t. Recalling the equivalence in eq. (15), define the generating function

g(y)=(0,m)L0ym=m=1h0,mym, (18)

the mth coefficient of whose power series expansion provides the number h0,m of labels (0, m) appearing in L0. By Proposition 1, h0,m can be expressed as a polynomial in the variable m and can thus be decomposed as a finite linear combination of terms of the form mk, where k is a non-negative integer. That is, for a certain finite set of non-negative integers with largest element K,

h0,m=k=0Kwkmk, (19)

where the wk are constants.

We introduce generating functions gmk, one for each k from 0 to K, in which the mth coefficient is mk:

gmk(y)=m=1mkym. (20)

Because K is finite, the desired generating function g(y) can be written as a finite linear combination of this new collection of generating functions gm0 (y), gm1 (y), . . . , gmK (y). More precisely, by substituting in eq. (18) the polynomial in eq. (19) and switching the order of summation, we obtain

g(y)=k=0Kwkgmk(y). (21)

We now state a lemma that characterizes the generating functions gmk (y)

Lemma 1

For each non-negative integer k from 0 to K, the generating function gmk (y) in eq. (20) is rational with denominator (1 − y)k+1. That is, gmk (y) has the form

gmk(y)=P(y)(1y)k+1,

where P(y) is a polynomial in y.

Proof

We proceed by induction on k. If k = 0, then by eq. (20), gm0 (y) = 1/(1 − y)−1 = y/(1 − y). Assume the inductive hypothesis for gmk (y). Applying eq. (20) to gmk+1 (y), we can recover gmk+1 (y) as

gmk+1(y)=ygmk(y)y, (22)

which by the quotient rule for derivatives is a rational function with denominator (1 − y)k+2.

The proof of the lemma gives a recursive procedure in eq. (22) to compute the functions gmk (y). By eq. (21), we immediately obtain from the lemma a result about the generating function g(y).

Proposition 2

The generating function g(y) whose mth coefficient [ym]g(y) is the number of m-rooted histories h0,m of a seed tree t can be written as a finite linear combination

g(y)=j=1Jqjyaj(1y)b, (23)

where b ≥ 1 and J ≥ 1 are positive integers, each aj is a non-negative integer, and the qj are constants.

As an example, we show how the procedure in Proposition 2 can determine the generating function g(y) for t = ((A, B), (C, D)), the same example seed tree for which we computed the polynomial h0,m via Proposition 1. Recall from Section 3.2.2 that h0,m = m2 + 2m + 1. To obtain the generating function g(y) that has coefficients [ym]g(y) = m2 + 2m + 1, we sum generating functions for monomials m2, 2m, and 1. We already know gm0 (y), and by applying eq. (22), we have

gm0(y)=y1ygm1(y)=ygm0(y)y=y(1y)2gm2(y)=ygm1(y)y=y(y+1)(1y)3.

Thus,

g(y)=gm0(y)+2gm1(y)+gm2(y)=y33y2+4y(1y)3. (24)

In eq. (24), g(y) is written as in eq. (23), taking b = 3, J = 3, (a1, a2, a3) = (1, 2, 3), and (q1, q2, q3) = (4, −3, 1).

3.3.2 Bivariate generating function for (hn,m)n≥0,m≥1

Given t, the polynomial nature of h0,m in m enabled us to obtain a generating function for h0,m. We now use the iterative procedure in eq. (17) to determine an equation that characterizes the bivariate generating function with coefficients hn,m. We represent each label of the form (n, m) by a symbolic algebraic expression in the variables y and z, so that (n, m) is replaced by znym. Let L=n=0Ln be the multiset of all m-rooted histories for all trees t(n). Considering y and z as complex variables in two sufficiently small neighborhoods of 0, we aim to characterize the bivariate function F(y, z) that admits the expansion

F(y,z)=(n,m)Lznym,

where the sum is over all labels in the multiset L and thus has a term for each m-rooted history of each t(n). In particular, the function F(y, z) is the bivariate generating function of the integers hn,m, and its Taylor expansion can be written as

F(y,z)=m=1n=0hn,mznym, (25)

where the coefficients hn,m appear explicitly.

By differentiating F(y, z) with respect to y and then taking y = 0, we obtain

Fy(0,z)=n=0hn,1zn. (26)

Thus, for each n ≥ 0, we have

hn=hn,1=[zn](Fy(0,z)).

By representing each label of the form (n, m) by the symbolic expression znym and assuming the complex variables y and z are sufficiently close to 0, the recursive generation in eq. (17) of the multisets of labels L0, L1, L2, . . . determines an equation for F(y, z), demonstrated in Appendix 1:

F(y,z)[1zy(1y)]=g(y)zFy(0,z). (27)

Eq. (27) holds if the complex variables y and z are in two sufficiently small neighborhoods of 0, and it characterizes the generating function F(y, z).

3.3.3 Generating function for (hn,1)n≥0

We now have an equation satisfied by the bivariate generating function F(y, z). Further, we have eq. (26), which demonstrates that the desired generating function for the sequence (hn)n≥0 is obtained from Fy(0,z). By applying the kernel method [2], [12], we can determine the power series Fy(0,z) from eq. (27).

The idea of the method consists of coupling the two variables (z, y) as (z, y(z)) in such a way that two conditions hold. First, (i) substituting y = y(z) cancels the kernel of the equation, that is, the factor 1−z/[y(1−y)] on the left-hand side of eq. (27). Second, (ii) for z near 0, the value of y(z) remains in a sufficiently small neighborhood of y = 0, so that eq. (27) still holds near z = 0 after substituting y = y(z). This condition is required, as the power series expansion in eq. (25) for F(y, z) has been assumed to be valid in a neighborhood of (y, z) = (0, 0), and the derivation of eq. (27) relies on the fact that y and z are sufficiently close to 0. If the two conditions hold, then

zFy(0,z)=g(y(z)),

so that g(y(z)) must be a power series for z = 0, because so must be zFy(0,z).

The required substitution couples y and z in such a way that 1 − z/[y(1 − y)] = 0, so that y(z)=(1±14z)2. To determine whether to take the negative root y1(z) or the positive root y2(z), we note that if z is near 0, then y1(z) approaches 0, so that y1(z) lies in a neighborhood of y = 0 and g(y1(z)) admits a power series expansion for z near 0. For y2(z), however, if z is near 0, then y2(z) approaches 1, and thus, g(y2(z)) is not a power series for z near 0 due to the pole of the function g(y) at y = 1 (Proposition 2). The only solution satisfying both (i) and (ii) is consequently

Y(z)=y1(z)=114z2, (28)

which, with the generating function C(z) of the Catalan numbers as in eq. (3), satisfies Y(z) = zC(z). Substituting y = Y(z) in eq. (27), we have Fy(0,z)=g(Y(z))z, yielding the following result.

Proposition 3

Fix tree t. Let g(y) be the generating function associated with the polynomial h0,m (eq. (18)). Let Y(z) be as in eq. (28). Then the generating function f(z)=n=0hnzn is given by

f(z)=Fy(0,z)=g(Y(z))z=g(114z2)z. (29)

The proposition thus determines the generating function f(z) = g(Y(z))/z for the integer sequence describing the number of matching coalescent histories of species trees in the caterpillar-like family (t(n))n≥0. The function g depends on the seed tree t, whereas Y(z) is fixed in eq. (28) and does not depend on t.

As an example, recall that for t = ((A, B), (C, D)), in eq. (24), we have computed the generating function g for the number h0,m of m-rooted histories of t = t(0). By Proposition 3, the generating function for the number hn of matching coalescent histories of t(n) is

f(z)=n=0hnzn=g(114z2)z=4(114z)(3z+14z)z(1+14z)3.

Taking the Taylor expansion of f, we obtain

f(z)=4+13z+42z2+138z3+462z4+1573z5+5434z6+19006z7+67184z8+ (30)

The coefficients hn accord with the enumeration of matching coalescent histories reported in Corollary 3.9 of [13] and Table 3 of [14] for caterpillar-like families with seed tree t = ((A, B), (C, D)), except that those results tabulated numbers of coalescent histories by the number of taxa, whereas here, we use the index of the caterpillar-like family. Thus, in this example, the coefficient of zn gives the number of matching coalescent histories for a tree with n+4 taxa, as |t| = 4. Shifting the index in the formula from [13], [14] to agree with our indexing scheme, we obtain [(5(n+4) − 12)/(4(n + 4) − 6)]c(n+4)−1 = [(5n + 8)/(4n + 10)]cn+3 for the number of matching coalescent histories of t(n). This formula gives precisely the coefficients in the Taylor expansion in eq. (30).

3.4 Asymptotic behavior of hn

From Proposition 3, we have the generating function f that counts matching histories of t(n) for a given fixed seed tree t. Applying techniques of analytic combinatorics as introduced in Section 2.3, we can determine the asymptotic behavior of the coefficients of the generating function

f~(z)=n=1hn1zn=zf(z)=g(Y(z)), (31)

with Y(z) as in eq. (28). To simplify notation, we work with instead of f.

First, in Section 3.4.1, we obtain an asymptotic equivalence between hn and βtcn, where βt is a constant depending on the seed tree t, and the cn are the Catalan numbers (eq. (1)). Next, in Section 3.4.2, we produce a general procedure to determine the constants βt, employing this procedure to obtain values of βt for all seed trees t with |t| ≤ 9. We demonstrate that our values of βt accord with constant multiples of the Catalan numbers previously obtained according to a different method [14] for seed trees with |t| ≤ 8.

3.4.1 A general asymptotic result

Recall that given t, Proposition 2 provides a procedure to determine the rational function g in eq. (31). Writing g as the finite linear combination in eq. (23), the values of b, J, and the (aj)1≤jJ and (qj)1≤jJ can all be computed.

As noted in Section 2.3, the expansion of at its dominant singularity characterizes the asymptotic behavior of the coefficients hn−1. Appendix 2 obtains this expansion at the dominant singularity z=14,

f~(z)=αt+βt(14z2)±O(14z) (32)
αt+βt(14z2), (33)

with

αt=j=1J2bajqj (34)
βt=j=1J2b+1aj(aj+b)qj. (35)

Note that in eq. (32), the seed tree affects only the constants αt and βt computed in eqs. (34) and (35) from g, as written in the linear combination in eq. (23). Excluding the constant αt that does not influence the asymptotic behavior of the coefficients, the main term of the expansion of (z) (eq. (33)) is the product of βt and the generating function 14z2, whose nth is Catalan number cn−1 (eq. (4)).

Theorem VI.4 of [8] indicates that under conditions satisfied by , the asymptotic coefficients of a generating function as n → ∞ are obtained from the expansion of the function at the dominant singularity; moreover, the error term in the asymptotic coefficients can be computed from the error term in the singular expansion. Applying the theorem to the expansion in eq. (32), we obtain the asymptotic behavior of the coefficients [zn](z) = hn−1.

Proposition 4

For any seed tree t, when n → ∞, the number hn of matching coalescent histories for t(n) satisfies

hn1=[zn]f~(z)βt[zn](14z2)±O(4nn2)=βtcn1±O(4nn2), (36)

where βt is a constant that depends on t. The constant βt is computed in eq. (35) once the function g, defined in eq. (18), is written as the linear combination in eq. (23).

We immediately obtain the following corollary, corresponding to our initial claim in eq. (6).

Corollary 1

For any seed tree t, there exists a constant βt > 0 (eq. (35)) such that when n → ∞,

hnβtcn. (37)
Proof

The result follows from Proposition 4 by noting that if βt > 0, then

limnhn1βtcn1=1±limnO(4nn2)βtcn1=1.

Note that we are claiming βt > 0. From the definition of βt in eq. (35), because the qj are permitted to be negative, it is not immediately clear that βt > 0. Proposition 4 eliminates the possibility that βt is negative, as hn−1 is necessarily positive. To show that βt ≠ 0, note that by eq. (36), βt = 0 would give

hn1=O(4nn2), (38)

so that hn−1/(4n/n2) would remain bounded by a constant as n → ∞.

We now apply the lower bound hncn+1 from eq. (7). By eq. (7), we have

hn14nn2cn4nn2=nπcn4n(n32π).

As n → ∞, nπ diverges to ∞, while cn[4n(n32π)] converges to 1 by eq. (5). Therefore, the sequence hn−1/(4n/n2) must diverge and eq. (38) cannot hold. Thus, βt ≠ 0.

As an example of Corollary 1, consider t = ((A, B), (C, D)). By decomposing the function g of eq. (24) as in eq. (23), we have already obtained the parameters b, J, (aj)1≤jJ, and (qj)1≤jJ in Section 3.3.1. Therefore, computing βt as in eq. (35), we obtain

βt=21+31(1+3)(4)+21+32(2+3)(3)+21+33(3+3)(1)=80.

Eq. (37) then produces hn ~ 80cn. Note that the limit hn54cn+3 produced for this tree from hn = [(5n + 8)/(4n + 10)]cn+3 in Section 3.3.3 agrees with the limiting result hn ~ 80cn. Recalling eq. (2),

hncn=5n+84n+10cn+3cn54(2n+6n+3)(n+3)(2nn)(n+1)5443=80.

3.4.2 Determining βt from the seed tree t

We have shown in Corollary 1 that the number of matching coalescent histories hn for the caterpillar-like family t(n) is, for a constant βt, asymptotic to βtcn. We can now assemble our results to describe a procedure that given a seed tree t with |t| ≥ 2 determines both the generating function with coefficients hn and the constant βt.

  • (i)

    Determine by eq. (9) the polynomial et,k in k ≥ 0 that counts k-extended histories of t.

  • (ii)

    Compute from eq. (8) the polynomial in m that counts for m ≥ 1 the number of m-rooted histories of t.

  • (iii)

    Obtain the generating function g(y)=m=1h0,mym with coefficients h0,m by using Proposition 2.

  • (iv)

    Determine the generating function f(z)=n=0hnzn with coefficients hn by applying Proposition 3.

  • (v)

    Write g(y) as a linear combination according to eq. (23), determining the values of b, J, and the aj and qj.

  • (vi)

    Compute the asymptotic constant βt from eq. (35).

We have programmed this procedure in Mathematica; starting from a given seed tree t, our program CatFamily.nb can automatically compute for the caterpillar-like family t(n) the generating function with coefficients hn and the asymptotic constant βt. Using this program, we have evaluated βt for each seed tree with 9 taxa, collecting the results in Table 1.

TABLE 1.

Asymptotic constants βt with hn ~ βtcn, for seed trees t with 9 taxa.

Seed tree t β t βt Seed tree t β t βt

graphic file with name nihms-822293-t0008.jpg 65,536 1 graphic file with name nihms-822293-t0009.jpg 128,864 4,0272,048
graphic file with name nihms-822293-t0010.jpg 81,920 54 graphic file with name nihms-822293-t0011.jpg 166,624 5,2072,048
graphic file with name nihms-822293-t0012.jpg 94,208 2316 graphic file with name nihms-822293-t0013.jpg 197,296 12,3314,096
graphic file with name nihms-822293-t0014.jpg 104,448 5132 graphic file with name nihms-822293-t0015.jpg 224,704 3,5111,024
graphic file with name nihms-822293-t0016.jpg 138,240 13564 graphic file with name nihms-822293-t0017.jpg 308,576 9,6432,048
graphic file with name nihms-822293-t0018.jpg 118,784 2916 graphic file with name nihms-822293-t0019.jpg 262,000 16,3754,096
graphic file with name nihms-822293-t0020.jpg 113,408 443256 graphic file with name nihms-822293-t0021.jpg 250,272 7,8212,048
graphic file with name nihms-822293-t0022.jpg 148,480 14564 graphic file with name nihms-822293-t0023.jpg 339,504 21,2194,096
graphic file with name nihms-822293-t0024.jpg 177,664 347128 graphic file with name nihms-822293-t0025.jpg 417,632 13,0512,048
graphic file with name nihms-822293-t0026.jpg 141,312 6932 graphic file with name nihms-822293-t0027.jpg 326,240 10,1952,048
graphic file with name nihms-822293-t0028.jpg 193,536 18964 graphic file with name nihms-822293-t0029.jpg 464,128 1,813256
graphic file with name nihms-822293-t0030.jpg 121,472 949512 graphic file with name nihms-822293-t0031.jpg 182,912 1,429512
graphic file with name nihms-822293-t0032.jpg 157,888 2,4671,024 graphic file with name nihms-822293-t0033.jpg 243,904 3,8111,024
graphic file with name nihms-822293-t0034.jpg 187,776 1,467512 graphic file with name nihms-822293-t0035.jpg 296,064 2,313512
graphic file with name nihms-822293-t0036.jpg 214,720 3,3551,024 graphic file with name nihms-822293-t0037.jpg 344,512 5,3831,024
graphic file with name nihms-822293-t0038.jpg 296,192 1,157256 graphic file with name nihms-822293-t0039.jpg 487,808 3,811512
graphic file with name nihms-822293-t0040.jpg 251,136 981256 graphic file with name nihms-822293-t0041.jpg 410,112 801128
graphic file with name nihms-822293-t0042.jpg 162,560 635256 graphic file with name nihms-822293-t0043.jpg 214,016 20964
graphic file with name nihms-822293-t0044.jpg 219,136 10732 graphic file with name nihms-822293-t0045.jpg 306,112 4,7831,024
graphic file with name nihms-822293-t0046.jpg 268,288 13132 graphic file with name nihms-822293-t0047.jpg 294,784 2,303512
graphic file with name nihms-822293-t0048.jpg 177,664 347128 graphic file with name nihms-822293-t0049.jpg 425,216 1,661256
graphic file with name nihms-822293-t0050.jpg 249,344 487128 graphic file with name nihms-822293-t0051.jpg 366,720 2,865512
graphic file with name nihms-822293-t0052.jpg 353,536 1,381256 graphic file with name nihms-822293-t0053.jpg 532,224 2,079256

Values of βt appear for each of the 46 unlabeled species trees with 9 taxa. For each species tree t, we also provide the constant βt=βt48 (eq. (40)). Trees are listed in increasing order by rank as defined in Section 2 of [14]. In the left column, each seed tree t belongs to a caterpillar-like family ((n))n, with || < 9. In these cases, we recover the values of βt as determined in Table 3 of [14].

Recall that Rosenberg [14] reported the asymptotic constant multiples of the Catalan numbers, βt, which represent asymptotic numbers of coalescent histories for seed trees with up to 8 taxa, indexing the results by the number of taxa m rather than by the index n of the caterpillar-like family. Also recall that for seed tree t, tree t(n) has m = |t| + n taxa (Fig. 1). In the notation of [14], writing Atm,1 as the number of matching coalescent histories in the caterpillar-like tree with seed tree t and m ≥ |t| taxa, we have hn = Atm,1.

By eq. (5), we have the asymptotic equivalence cn ~ cn+k/4k for each positive integer k. Therefore,

Atm,1=hnβtcnβt4t1cn+t1=βtcm1, (39)

where the asymptotic constant βt of Corollary 1 is normalized to obtain

βt=βt4t1. (40)

This computation converts the asymptotic constant multiple βt of cn into a corresponding multiple βt of cm−1, as reported in [14] for small trees. Comparing Table 1 with Table 3 of [14], we see that for the cases examined by [14], the values of βt we compute from the associated βt agree with the values that were previously reported. This agreement is unsurprising; our method for calculating the constants βt and βt is simply a computational implementation based on our theorems, and the agreement confirms the validity of the implementation. Although [14] considered only |t| ≤ 8, our method applies for arbitrary |t|.

Evaluation of βt proceeds quadratically in |t|. The recursive step (i) requires at most |t| − 1 recursive calls, one for each internal node of t. Step (ii) is a polynomial subtraction at most linear in |t|, producing the polynomial h0,m with order at most equal to the order of et,m minus 1—that is, at most |t| − 2. Step (iii) determines the generating function g(y) (eq. (18)) from h0,m and the generating functions gmk (y) (eq. (20)). For each k with 0 ≤ k ≤ |t| − 2, gmk (y) is computed in k recursive calls of eq. (22). As the order of h0,m is at most |t| − 2, the total cost for calculating g(y) is thus quadratic in |t|. Steps (iv), (v), and (vi) do not involve recursion and are at most linear in |t|. Thus, because step (iii) is the most expensive step, we see that the cost of the procedure that determines the asymptotic constant βt increases as O(t2).

4 Conclusions

In this paper, we have solved a problem left open by [14] on determining the number of coalescent histories for gene trees and species trees that have a matching labeled topology and that belong to a generic caterpillar-like family. We have proven that for any seed tree t, the integer sequence (hn)n≥0, whose nth element represents the number of matching coalescent histories of the caterpillar-like tree t(n), grows asymptotically as a constant multiple of the Catalan numbers, that is, hn ~ βtcn, where the constant βt > 0 depends on the shape of the seed tree t. Rosenberg [14] had previously obtained this result for seed trees with at most 8 taxa; here, by using a succession rule for recursive enumeration and then applying techniques of analytic combinatorics, we have not only proven the existence of the constant βt for seed trees of any size, we have also produced a procedure that computes βt as well as the expression for the generating function of the integers (hn)n≥0.

The numerical results on the constants βt extend the empirical observation of [14] that the caterpillar-like families that produce the largest numbers of matching coalescent histories are those whose seed tree has a high level of balance. By extending from seed trees with |t| ≤ 8 taxa to those with |t| = 9, we observe that the constants βt for the caterpillar-like families with the largest and smallest numbers of matching coalescent histories become further separated, so that for n large, many more coalescent histories exist by which a gene tree can match the species tree for some species trees than for others. For the 9-taxon seed tree with the largest βt, βt8.12 compared to βt=1 for the seed tree with the smallest βt. Our procedure for evaluating βt and βt as a function of the seed tree can now enable further systematic analyses of the correlates of the constants βt and βt, to facilitate additional explorations of determinants of the numbers of matching coalescent histories.

Nevertheless, although the constants βt and βt do depend on the seed tree, we have shown that otherwise, all caterpillar-like families are asymptotically equivalent in their numbers of matching coalescent histories. Computation time is often a challenge in phylogenetic problems, as the discrete structures of phylogenetics can grow rapidly in number with the number of taxa. Our results contribute to the study of computational complexity in phylogenetics, as the complexity of the evaluation of probabilities important in characterizing gene tree distributions [5] is proportional to the number of coalescent histories. That all caterpillar-like families have the same growth pattern up to a constant suggests that as the number of taxa increases, such evaluations will be comparably complex for all caterpillar-like trees. In large trees, the caterpillar branches contribute to the asymptotic growth of the number of matching coalescent histories—which follows a multiple of the Catalan numbers—and the seed tree only to the constant by which the Catalan numbers are multiplied.

The extent to which other tree families follow the Catalan sequence in their numbers of matching coalescent histories remains unknown, though we have recently found a family, the lodgepole family—defined iteratively by setting λ0 to a tree with one taxon and sequentially forming λn+1 by appending λn and a cherry to a shared root—for which the number of matching coalescent histories grows faster than with a constant multiple of the Catalan numbers [6]. Further analysis of this heterogeneous behavior of the increase in the number of coalescent histories will be useful in performing comparisons of coalescent history algorithms with algorithms that obtain similar phylogenetic probabilities but that do not rely on coalescent histories [20]. The use of our substantially different approach employing analytic combinatorics opens new methods for theoretical analysis of coalescent histories and can potentially assist in understanding when Catalan-like growth, the rapid growth of the lodgepole family, and intermediate or perhaps still faster growth patterns will apply.

We note, however, that our strategy for evaluating the asymptotic properties of the number of coalescent histories in caterpillar-like families has, like the work of [14], relied on the fact that the difficulty of the general problem of enumerating coalescent histories is partly evaded by restricting attention to caterpillar-like trees. In the recursion for the number of coalescent histories given a matching gene tree and species tree [13, eq. 1], a term arising from the subtree with fewer branches collapses to 1 for the caterpillar case, greatly simplifying the recursion. This reduction enabled the work of [14] for caterpillar-like families, and it also enables our approach of iteratively adding single-taxon branches to define the operator Ω and the generating function hn,m. Thus, in enumerating coalescent histories for matching lodgepole gene trees and species trees, we proceeded by a different method, establishing a bijection between coalescent histories and established combinatorial structures [6]. We do expect, however, that a generating function approach will be fruitful in other scenarios, perhaps including cases with gene trees and species trees that are caterpillar-like, but non-matching.

Acknowledgments

We acknowledge grant support from the National Science Foundation (DBI-1146722) and the National Institutes of Health (R01 GM117590). A Mathematica notebook CatFamily.nb implementing the procedure in Section 3.4.2 for obtaining from a seed tree t the generating function f(z), the coefficients hn, and the constant βt is available from the authors.

Biographies

Filippo Disanto received the PhD degree in theoretical computer science from both the University of Siena and the University of Paris VII in 2010. After receiving the PhD degree, he was a postdoc at CNRS in Montpellier and at the Institut für Genetik, University of Cologne. Since November 2013, he has been a postdoc in the Rosenberg Laboratory, Stanford University. His main research interests include combinatorics and its applications.

Noah A. Rosenberg received the PhD degree in biological sciences from Stanford University in 2001 and completed postdoctoral training at the University of Southern California. He was on the faculty of the University of Michigan from 2005 to 2011, and he is currently a professor in the Department of Biology at Stanford University. His research interests include human evolutionary genetics, population-genetic theory, and mathematical phylogenetics.

Appendix 1. The equation for F(y, z)

In this appendix, we complete the derivation of eq. (27) satisfied by F(y, z). In the generating function F(y, z) (eq. (25)), each monomial znym corresponds to a label (n, m) ∈ Ln that in turn represents an m-rooted history of t(n). Recall that the multisets of labels L0, L1, L2, . . . (eq. (14)) can be iteratively generated according to eq. (17) through the operator Ω defined in eq. (13), starting from the multiset L0. Also recall that by considering the multiset of labels L=n=0Ln, we can write F(y,z)=(n,m)Lznym. We use the iterative generation of the family of multisets (Ln)n≥0 to obtain an equation for F.

By eq. (13), for n ≥ 0 and m ≥ 2, for each occurrence in Ln of a label (n, m), a copy of each label in set

Ω((n,m))={(n+1,m+j):j1}

belongs to the multiset Ln+1. Thus, in algebraic terms, each time that an expression znym with n ≥ 0 and m ≥ 2 is counted in the generating function F—written znymF in what follows—the terms zn+1j=m1yj appear in F as well. Summing over all znymF with n ≥ 0 and m ≥ 2, we obtain

znymF:n0,m2(zn+1j=m1yj)=zyznymF:n0,m2(znymj=0yj). (41)

Similarly, for n ≥ 0 and m = 1, for each occurrence in Ln of a label (n, 1), a copy of each label in set Ω((n, 1)) = {(n + 1, j) : j ≥ 1} appears in multiset Ln+1. Thus, for each term znyF, with n ≥ 0, the terms zn+1j=1yj are counted in F as well. Summing these terms for all znyF with n ≥ 0,

znyF:n0(zn+1j=1yj)=zyznyF:n0(znj=0yj). (42)

Notice that the sum of the expressions in eqs. (41) and (42) is the algebraic representation of the multiset of labels L \ L0. More precisely, each term znymF associated with a label (n, m) ∈ Ln, with n ≥ 1, is counted—and counted exactly once—in the sum of eqs. (41) and (42). Therefore, to complete the description of F, we require only those terms z0ym associated with labels (0, m) ∈ L0. These terms are represented

(0,m)L0z0ym=m=1h0,mym=g(y), (43)

considering that h0,m={L0:=(0,m)} (eq. (15)) and that by definition, g(y)=m=1h0,mym (eq. (18)).

We can now equate the full generating function F(y, z) to the sum of eqs. (43), (41), and (42), obtaining

F(y,z)=g(y)+zyznymF:n0,m2(znymj=0yj)+zyznyF:n0(znj=0yj).

Applying the fact that j=0yj=1(1y) for y near 0 in the complex plane, we then have

F(y,z)=g(y)+zy(1y)(znymF:n0,m2znym)+zy1y(znyF:n0zn). (44)

By eq. (25) and the fact that the multisets Ln of labels (n, m) for m-rooted histories of t(n) have hn,m elements,

znyF:n0zn=Fy(0,z)znymF:n0,m2znym=(znymF:n0,m1znym)(znyF:n0zny)=F(y,z)yFy(0,z).

Substituting in eq. (44), the last two expressions yield

F(y,z)=g(y)+zy(1y)(F(y,z)yFy(0,z))+zy1yFy(0,z), (45)

which can be rewritten as in eq. (27).

Appendix 2. The dominant singularity and singular expansion of (z)

This appendix obtains the singular expansion of (z) described in eq. (32). In eq. (31), we have defined (z) as a composition (z) = g(Y(z)), with the internal function Y(z) as in eq. (28) and the external function g(y) as in eq. (23). Owing to the presence of the square root in the expression for Y(z), the dominant singularity of the internal function Y(z)—the singularity nearest the origin of the complex plane—is at z=14. Computing the value of Y(z) at its dominant singularity, we obtain Y(14)=12. In particular, we have Y(14)<1, where 1 is the radius of convergence of the finite series corresponding to the external function g in . Indeed, it immediately follows from Proposition 2 that y = 1 is the dominant singularity of g(y).

As detailed in Section VI.9 of [8], on dominant singularities of compositions, we are in the setting of the subcritical case, in which the inequality Y(14)<1 implies that the dominant singularity of g(Y(z)) coincides with the dominant singularity z=14 of the internal function Y(z) rather than the dominant singularity y = 1 of the external function g(y). The desired singular expansion of (z) = g(Y(z)) at the dominant singularity z=14 can be obtained by inserting y = Y(z) in the regular (non-singular) expansion of g(y) at y=Y(14)=12.

To recover the expansion of g(y) at y=12, we expand and then sum each term qj[yaj/(1 − y)b] of the finite linear combination in eq. (23). At y=12, each of these terms is an analytic function, and we can thus use Taylor's formula to produce the desired expansion. We obtain at y=12

qjyaj(1y)b=2bajqj+2b+1aj(aj+b)qj(y12)±O((y12)2).

By summing over the indices 1 ≤ jJ of eq. (23), the expansion of g(y) at y=12 is

g(y)=αt+βt(y12)±O((y12)2), (46)

with the constants αt and βt defined as in eqs. (34) and (35). Plugging y = Y(z) from eq. (28) into eq. (46), we finally obtain the singular expansion of (z) at z=14 as in eq. (32).

Contributor Information

Filippo Disanto, Department of Biology, Stanford University, Stanford, CA, USA. fdisanto@stanford.edu..

Noah A. Rosenberg, Department of Biology, Stanford University, Stanford, CA, USA. noahr@stanford.edu.

REFERENCES

  • 1.Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 2011;62:833–862. doi: 10.1007/s00285-010-0355-7. [DOI] [PubMed] [Google Scholar]
  • 2.Banderier C, Bousquet-Mélou M, Denise A, Flajolet P, Gardy D, Gouyou-Beauchamps D. Generating functions for generating trees. Discr. Math. 2002;246:29–55. [Google Scholar]
  • 3.Barcucci E, Del Lungo A, Pergola E, Pinzani R. ECO: a methodology for the enumeration of combinatorial objects. J. Differ. Equ. Appl. 1999;5:435–490. [Google Scholar]
  • 4.Degnan JH. PhD thesis. University of New Mexico; Albuquerque: 2005. Gene tree distributions under the coalescent process. [PubMed] [Google Scholar]
  • 5.Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution. 2005;59:24–37. [PubMed] [Google Scholar]
  • 6.Disanto F, Rosenberg NA. Coalescent histories for lodgepole species trees. J. Comp. Biol. 2015;22 doi: 10.1089/cmb.2015.0015. doi:10.1089/cmb.2015.0015. [DOI] [PubMed] [Google Scholar]
  • 7.Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH. Ancestral population genomics: the coalescent hidden Markov model approach. Genetics. 2009;183:259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Flajolet P, Sedgewick R. Analytic Combinatorics. Cambridge University Press; Cambridge: 2009. [Google Scholar]
  • 9.Graham RL, Knuth DE, Patashnik O. Concrete Mathematics. 2nd ed. Addison-Wesley; Boston: 2008. [Google Scholar]
  • 10.Hobolth A, Christensen OF, Mailund T, Schierup MH. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet. 2007;3:294–304. doi: 10.1371/journal.pgen.0030007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hobolth A, Dutheil JY, Hawks J, Schierup MH, Mailund T. Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widepsread selection. Genome Res. 2011;21:349–356. doi: 10.1101/gr.114751.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Prodinger H. The kernel method: a collection of examples. Sém. Lothar. Combin. 2004;50:B50f. [Google Scholar]
  • 13.Rosenberg NA. Counting coalescent histories. J. Comp. Biol. 2007;14:360–377. doi: 10.1089/cmb.2006.0109. [DOI] [PubMed] [Google Scholar]
  • 14.Rosenberg NA. Coalescent histories for caterpillar-like families. IEEE/ACM Trans. Comp. Biol. Bioinf. 2013;10:1253–1262. doi: 10.1109/tcbb.2013.123. [DOI] [PubMed] [Google Scholar]
  • 15.Rosenberg NA, Degnan JH. Coalescent histories for discordant gene trees and species trees. Theor. Pop. Biol. 2010;77:145–151. doi: 10.1016/j.tpb.2009.12.004. [DOI] [PubMed] [Google Scholar]
  • 16.Rosenberg NA, Tao R. Discordance of species trees with their most likely gene trees: the case of five taxa. Syst. Biol. 2008;57:131–140. doi: 10.1080/10635150801905535. [DOI] [PubMed] [Google Scholar]
  • 17.Stanley RP. Enumerative Combinatorics Volume 2. Cambridge University Press; New York: 1999. [Google Scholar]
  • 18.Than CV, Rosenberg NA. Consistency properties of species tree inference by minimizing deep coalescences. J. Comp. Biol. 2011;18:1–15. doi: 10.1089/cmb.2010.0102. [DOI] [PubMed] [Google Scholar]
  • 19.Than C, Ruths D, Innan H, Nakhleh L. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J. Comp. Biol. 2007;14:517–535. doi: 10.1089/cmb.2007.A010. [DOI] [PubMed] [Google Scholar]
  • 20.Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution. 2012;66:763–775. doi: 10.1111/j.1558-5646.2011.01476.x. [DOI] [PubMed] [Google Scholar]
  • 21.Yu Y, Than C, Degnan JH, Nakhleh L. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst. Biol. 2011;60:138–149. doi: 10.1093/sysbio/syq084. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES