Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 1.
Published in final edited form as: Adv Appl Math. 2021 Aug 23;131:102265. doi: 10.1016/j.aam.2021.102265

Enumeration of coalescent histories for caterpillar species trees and p-pseudocaterpillar gene trees

Egor Alimpiev 1, Noah A Rosenberg 1
PMCID: PMC8415704  NIHMSID: NIHMS1731199  PMID: 34483422

Abstract

For a fixed set X containing n taxon labels, an ordered pair consisting of a gene tree topology G and a species tree topology S bijectively labeled with the labels of X possesses a set of coalescent histories—mappings from the set of internal nodes of G to the set of edges of S describing possible lists of edges in S on which the coalescences in G take place. Enumerations of coalescent histories for gene trees and species trees have produced suggestive results regarding the pairs (G, S) that, for a fixed n, have the largest number of coalescent histories. We define a class of 2-cherry binary tree topologies that we term p-pseudocaterpillars, examining coalescent histories for non-matching pairs (G, S) in the case in which S has a caterpillar shape and G has a p-pseudocaterpillar shape. Using a construction that associates coalescent histories for (G, S) with a class of “roadblocked” monotonic paths, we identify the p-pseudocaterpillar labeled gene tree topology that, for a fixed caterpillar labeled species tree topology, gives rise to the largest number of coalescent histories. The shape that maximizes the number of coalescent histories places the “second” cherry of the p-pseudocaterpillar equidistantly from the root of the “first” cherry and from the tree root. A symmetry in the numbers of coalescent histories for p-pseudocaterpillar gene trees and caterpillar species trees is seen to exist around the maximizing value of the parameter p. The results provide insight into the factors that influence the number of coalescent histories possible for a given gene tree and species tree.

Keywords: Catalan numbers, coalescent histories, Dyck paths, monotonic paths, phylogenetics

Mathematics subject classification (2010): 05A15, 05A16, 05A19, 05C05, 92D10

1. Introduction

In mathematical phylogenetics, a coalescent history represents the paired list of coalescences in a gene tree together with their associated edges of a species tree. Consider two binary, rooted, leaf-labeled trees, G and S, with leaves labeled by the same label set X, such that each label in X is associated with exactly one leaf of G and exactly one leaf of S. We regard G as a gene tree representing the evolution of genealogical lineages in a group of species, and S as the species tree representing the evolutionary descent of the species themselves.

For a gene tree G evolving on a species tree S, a coalescent history is a mapping from the set of internal nodes of G to the set of internal edges of S, such that two rules are followed: (i) the image of an internal node v of G is ancestral in S to each leaf of S that shares a label with some leaf descended from v in G; (ii) the image of an internal node v of G is ancestral in S to the images of each of its descendant nodes. The biological interpretation of (i) is that a set of gene lineages can only find a common ancestor on a species tree edge that it is possible for them all to reach; the interpretation of (ii) is that the gene lineages descended from a descendant node coalesce at least as recently as do the gene lineages descended from its ancestral nodes. Note that we regard a node as trivially ancestral to and descended from itself. The coalescent histories for (G, S) can be viewed as describing a discrete class of evolutionary scenarios for the lineages of G on the edges of S.

A variety of studies have enumerated the coalescent histories for pairs (G, S), both by a recursive approach that applies for all (G, S) [13, 18], and by closed-form formulas and bijective constructions suited to particular families of trees [1, 5, 6, 9, 13, 14, 15, 16]. These enumeration studies, primarily considering matching gene trees and species trees with G = S and having particular emphasis on shapes such as caterpillars, 4-pseudocaterpillars, and caterpillar-like families (Figure 1), have informally observed that in specified classes of trees, the largest number of coalescent histories tends to occur when the pair (G, S) has two features: multiple sequences exist in which the coalescences of G can be arranged, and many edges of S exist on which those coalescences can take place.

Figure 1:

Figure 1:

Three tree shapes with n = 11 leaves. (A) Caterpillar tree shape. (B) p-pseudocaterpillar tree shape. For this tree, p = 8. (C) Caterpillar-like tree shape. The seed tree has size 9.

Rosenberg [13] observed that for small trees with at most n = 9 leaves and G = S, the largest numbers of coalescent histories among tree pairs (G, G) with fixed n were seen for trees that had structure similar to caterpillar trees, but that unlike caterpillars, had more than one possible sequence of coalescences. Rosenberg [14] and Disanto & Rosenberg [6] examined tree families (G, G), with n growing arbitrarily large in specified caterpillar-like tree families. Beginning with a seed tree, these studies generated families of increasingly large trees by sequentially adding taxa so that the next tree in a family was formed by placing the current tree and a single leaf on opposite sides of a new root. They saw that across all seed trees of a fixed small size, as the number of leaves grew without bound, the largest numbers of coalescent histories were achieved when the seed tree had many different sequences in which its coalescences could be arranged. Disanto & Rosenberg [5] constructed a tree family, the lodgepole family, that, unlike caterpillar families, grows so that as the number of leaves increases, trees accumulate both new sequences in which coalescences can be arranged and new places for them to occur. This family is the family of matching tree pairs (G, G) with the largest-known number of coalescent histories as n increases without bound.

Despite many observations suggesting that coalescent histories tend to increase in number when G has many sequences in which coalescences can be arranged and many edges on which those coalescences can occur, existing results in support of this view have focused on small trees [13] and on informal interpretations of specific families with large limits as n → ⋈ [6, 14]; no result has formally demonstrated the observation in a class of trees for a fixed finite n of arbitrary size. We devise a scenario to formalize this idea characterizing cases with the largest numbers of coalescent histories. We fix the species tree S to be a caterpillar, and we consider a family of non-matching gene trees G, the p-pseudocaterpillars. We show that in this class of non-matching pairs (G, S) with fixed n, the largest number of coalescent histories is achieved precisely when G combines the two elements: many coalescence sequences, and many edges on which those coalescences can take place.

Our approach is one of relatively few to examine enumerations of coalescent histories in the case that G is not necessarily equal to S [9, 13, 15, 16, 18]. The strategy employs a construction that enumerates coalescent histories for non-matching caterpillar trees. Generalizing a result of Degnan [1] for enumeration of coalescent histories for matching caterpillar trees, Himwich & Rosenberg [9] produced a bijection with monotonic paths for use in enumerating coalescent histories for non-matching caterpillar pairs (G, S). We use this monotonic-path construction to enumerate coalescent histories for the class of non-matching trees that considers a caterpillar species tree S and a p-pseudocaterpillar gene tree G.

Section 2 introduces definitions and notation. Section 3 gives an example that motivates the general calculation. In Section 4, we enumerate coalescent histories in the general case. Section 5 gives special cases with specified values of p. Finally, in Section 6, for a specified caterpillar species tree S of fixed size n, considering all possible values of p, we obtain the maximal number of coalescent histories across all non-matching p-pseudocaterpillar trees G. Section 7 discusses a symmetry in p for fixed n, and we conclude with a discussion in Section 8. The computations illustrate how the monotonic path approach of Himwich & Rosenberg [9] in the case of caterpillar species trees can be extended to enumerate coalescent histories in more cases beyond that of caterpillar gene trees.

2. Preliminaries

We formally define coalescent histories and p-pseudocaterpillars in Sections 2.1 and 2.2, and we introduce results concerning the Catalan numbers in Section 2.3. In Section 2.4, we describe the use of monotonic paths to enumerate coalescent histories for caterpillar tree pairs.

2.1. Coalescent histories

The definitions in this article closely follow Himwich & Rosenberg [9]. Henceforth, we treat all “trees” as binary and rooted; trees are leaf-labeled, but the labels are sometimes omitted for convenience. The set of vertices or nodes of a tree can be divided into leaf nodes and non-leaf internal nodes. For tree G, we say that a node v1 is descended from a node v2 if the path from v1 to the root of G travels through v2; v2 is then ancestral to v1. Ancestor–descendant relationships also apply to edge–edge pairs and edge–node pairs. A node or edge is trivially descended from and ancestral to itself. Each internal node, including the root, possesses an associated internal edge immediately ancestral to it. Thus, viewed forward in time, the root—like other internal nodes—has in-degree 1 and out-degree 2.

We consider pairs (G, S) in which G represents a gene tree, describing the descent of a set of genealogical lineages, and S represents a species tree, describing the descent of a set of species. G and S are assumed to have the same number of leaves, n. We assume that the leaf set of G and the leaf set of S are labeled by the same label set X, and that each label in X is assigned to exactly one leaf of G and to exactly one leaf of S. This assumption corresponds to an assumption that exactly one gene lineage is sampled in each of the n species.

For the pair (G, S), we can formally define the functions known as coalescent histories.

Definition 1.Consider a pair of trees (G, S) that are binary, rooted, and leaf-labeled, with the labels in bijective correspondence. A coalescent history is a function α from the set of internal nodes of G to the set of internal edges of S, satisfying two conditions:

  1. For each internal node v in G, all labels for leaves descended from v in G label leaves descended from edge α(v) in S.

  2. For each pair of internal nodes υ1and υ2in G, if υ2is descended from υ1, then α(v2) is descended from α(υ1) in S.

In this definition, nodes of G represent coalescent events for the gene lineages, and edges of S represent species tree edges along which the gene lineages evolve. A coalescent history reflects the biological process of coalescence, in which descendants cannot coalesce farther back in time than their ancestors. Ancestor–descendant relations are preserved under the mapping α.

Informally, considering the biological process of coalescence, we often refer to internal nodes as coalescences. We refer to gene tree coalescences “happening,” “occurring,” or “taking place” on species tree edges. For example, a gene tree coalescence v in G “can occur” on a species tree edge w in S if and only if there exists a coalescent history α with α(υ) = w.

2.2. Caterpillars and p-pseudocaterpillars

As our goal is to enumerate the coalescent histories in the case that G has a p-pseudocaterpillar topology and S has a caterpillar topology, we define caterpillar and p-pseudocaterpillar shapes for binary, rooted tree topologies.

Definition 2.A caterpillar tree is a binary, rooted tree that has an internal node that is descended from all other internal nodes.

In a caterpillar tree, each internal node has at least one leaf as an immediate descendant (Figure 1A). Equivalently, a caterpillar tree is a tree that has only one cherry node: an internal node with exactly two descendant leaves. Rosenberg [13] defined binary, rooted pseudocaterpillar trees with n ≥ 4 leaves as trees in which all internal nodes except one have at least one immediate leaf descendant. The node that provides the exception has two cherry nodes as its immediate descendants. We generalize the earlier definition of pseudocaterpillar trees to consider generalized pseudocaterpillar trees. To define this concept, we denote by υL and υR the left and right descendant nodes of an internal node υ.

Definition 3.A generalized pseudocaterpillar tree is a binary, rooted tree that has at least four leaves and that satisfies two conditions. (i) The tree possesses exactly two cherry nodes. (ii) For each internal node v, at least one of υL, υR has no more than two descendant leaves.

In other words, a generalized pseudocaterpillar tree is formed from a caterpillar tree, replacing one of the leaves not descended from the unique cherry node by a second cherry node (Figure 1B).

A generalized pseudocaterpillar can be described by two numbers: the total number of leaves n and the position p of the “second” cherry. To precisely identify p for a generalized pseudocaterpillar tree, we label the leaves by natural numbers starting from left to right, placing the “first” cherry—the one present in the caterpillar from which the generalized pseudocaterpillar has been generated—on the left. We define the position of the second cherry as the number corresponding to its second leaf from the left (Figure 2A). A generalized pseudocaterpillar tree with a second cherry in position p, 4 ≤ pn, is termed a p-pseudocaterpillar tree. The pseudocaterpillar trees in the sense of Rosenberg [13] are 4-pseudocaterpillars.

Figure 2:

Figure 2:

Labeled p-pseudocaterpillar gene tree and caterpillar species tree. (A) A p-pseudocaterpillar gene tree G with coordinates for the leaves. (B) A species tree S with the internal edges labeled. Both trees have n leaves. The trees are drawn in canonical form, so that the path from the left-most leaf to the root contains all other internal nodes for the caterpillar, and all other internal nodes except one for the p-pseudocaterpillar.

Note that for convenience, our description of the value of p in a p-pseudocaterpillar tree relies on a canonical orientation of the tree. This value of p can also be identified as the number of leaves in the smallest subtree that contains both cherries of the tree.

Our interest is in the case in which the gene tree has a p-pseudocaterpillar topology for some p, and the species tree has a caterpillar topology. A species tree with n leaves has n − 1 edges on which gene tree coalescences can happen; we also label these edges with natural numbers, following the order from Degnan & Salter [3] (Figure 2B).

2.3. Catalan numbers

It is useful to introduce the Catalan number sequence 1, 1, 2, 5, 14, 42, 132, 429, …, as it features prominently in our analysis. Letting Cn be the nth Catalan number for n ≥ 0,

Cn=(2nn)(2nn1)=1n+1(2nn). (1)

Considering the many combinatorial interpretations of this sequence [8, 17], we make use of the fact that Cn is the number of monotonic paths that travel from (0, 0) to (n, n) on a square lattice of size n × n, that begin with a 1-unit step to the right, and that do not cross the diagonal from (0, 0) to (n, n), where a monotonic path is a path that proceeds exclusively by 1-unit steps up or to the right.

A Catalan triangle is a combinatorial structure that counts monotonic paths to lattice points on or below the diagonal [12]. Entry (n, k) of the Catalan triangle gives the number of monotonic paths on the square lattice that travel from the origin to a point (n, k) and that do not travel above the y = x line. The number of such paths, which have n “right-steps” and k “up-steps,” is [12]:

C(n,k)={1k=0(n+kk)(n+kk1)1kn0k>n. (2)

For n ≥ 1 and 1 ≤ kn, this function satisfies the first-order recurrence

C(n,k)=C(n1,k)+C(n,k1).

A Catalan trapezoid is obtained in a similar way, except that we allow m − 1 additional up-steps to happen starting at the origin, so that monotonic paths that do not travel above the diagonal from (0, m − 1) to (n, n + m − 1) are tabulated. The number m is called the order of the trapezoid; m = 1 corresponds to the Catalan triangle. Entry (n, k) of the Catalan trapezoid of order m is given by

Ct(n,k,m)={(n+kk)0km1(n+kk)(n+kkm)mkn+m10k>n+m1, (3)

and it satisfies a similar recurrence Ct(n, k, m) = Ct(n − 1, k, m) + Ct(n, k − 1, m) for n ≥ 1 and 1 ≤ kn + m − 1. With the origin in the lower left corner, the first colums of the Catalan triangle and the Catalan trapezoid of order 3 appear below.

2.

2.4. Bijection between coalescent histories and monotonic paths for caterpillars

Building on work of Degnan [1], the bijective construction of Himwich & Rosenberg [9] enumerates coalescent histories for pairs consisting of a caterpillar gene tree and a caterpillar species tree by bijectively associating each coalescent history with a monotonic path that does not travel above the diagonal of a square lattice. The coalescent histories are then enumerated by counting the bijectively-associated monotonic paths.

In the construction, given a caterpillar species tree S and a caterpillar gene tree G with n leaves, a square (n − 1) × (n − 1) lattice is examined. The coalescent histories for (G, S) correspond to monotonic paths from (0, 0) to (n − 1, n − 1), with each right-step corresponding to a species tree internal edge, and each up-step corresponding to a gene tree coalescence. The pair (G, S) specifies a set of roadblocks, points in the lattice through which monotonic paths are not permitted to travel. The number of coalescent histories for (G, S) then equals the number of monotonic paths that do not travel above the diagonal and that do not travel through any of the roadblocks. In the case that G and S have the same caterpillar labeled topology, no roadblocks exist, and the number of monotonic paths, and hence the number of coalescent histories, is the Catalan number Cn1 [1].

The construction of Himwich & Rosenberg [9] also applies to caterpillar subtrees. Suppose G possibly has fewer leaves than S, so that the label set for G is a subset of the label set for S. If we use the term partial coalescent history to describe mappings that satisfy Definition 1 except that the label set of G is a subset of the label set of S rather than a bijectively-associated label set, then the number of partial coalescent histories for a caterpillar pair (G, S) is obtained by counting roadblocked monotonic paths to an associated point that is not necessarily the point (n − 1, n − 1).

For details, see Himwich & Rosenberg [9]. We illustrate the construction in an example.

3. Example

Our approach to extending the construction of Himwich & Rosenberg [9] to count coalescent histories for a caterpillar species tree and a non-matching p-pseudocaterpillar gene tree—a tree with one extra cherry—can be understood with an example. Consider a gene tree G with 10 leaves, with cherry node (E,F) as shown in Figure 3A, and a species tree S as shown in Figure 3B.

Figure 3:

Figure 3:

Example (gene tree, species tree) pair. (A) 6-pseudocaterpillar gene tree G with “second cherry” (E,F). The pivotal coalescence is circled in red. (B) Caterpillar species tree S.

The key to counting coalescent histories for (G, S) is to examine the specific gene tree coalescence circled in red in Figure 3A, indicating the most recent common ancestor of both cherries of G. We call this node the pivotal coalescence. In a coalescent history, this pivotal coalescence can take place on any internal edge ancestral to species F in the species tree. We partition all coalescent histories for (G, S) by the position of this pivotal coalescence. For each placement of the pivotal coalescence, we then count the number of coalescent histories by counting monotonic paths on particular diagrams for the subtrees generated by the pivotal coalescence.

Suppose the pivotal coalescence of G happens on edge 5 of S. Then all the coalescences in the “left” subtree descended from the pivotal coalescence must happen on or before edge 5. This left subtree is now a caterpillar (((A,B),C),D), coalescing on a caterpillar (((((A,B),C),D),E),F). We can now follow the construction of Himwich & Rosenberg [9] to enumerate partial coalescent histories through a bijection with monotonic paths.

In particular, the number of ways that the gene tree coalescences of (((A,B),C),D) can occur on species tree (((((A,B),C),D),E),F) is equal to the number of monotonic paths on a Catalan triangle restricted to 5 right-steps and 3 up-steps (Figure 4A). The up-steps correspond to the 3 coalescences in the subtree (((A,B),C),D), and the right-steps correspond to the 5 edges of (((((A,B),C),D),E),F) on which they can take place. Following eq. (2), the number of monotonic paths that travel from (0, 0) to (5, 3) and that do not travel above the diagonal is (83)(82)=28. Because gene tree coalescence (E,F) must occur on species tree edge 5 when the pivotal coalescence occurs on edge 5, coalescence (E,F) does not introduce additional coalescent histories. Thus, 28 possible partial coalescent histories place the pivotal coalescence on species tree edge 5.

Figure 4:

Figure 4:

Catalan triangle construction for enumerating coalescent histories for Figure 3. Following Himwich & Rosenberg [9], up-steps represent gene tree coalescences and are labeled in the diagram by the leaf participating in the coalescence (by both leaves for the first coalescence). Right-steps represent species tree internal edges and are labeled in the diagram by the leaf immediately descended from the associated edge (by both leaves for the first edge). The numbers indicated represent counts of monotonic paths according to eq. (2). (A) Diagram corresponding to the left subtree descended from the pivotal gene tree coalescence. (B) Diagram corresponding to the portion of the gene tree ancestral to the pivotal coalescence. The arrow indicates that all coalescences other than those depicted in the diagram have already happened by the starting point, the first species tree internal edge ancestral to F.

We now need to consider the coalescences ancestral to the pivotal coalescence. Coalescences involving leaf G can happen on edge 6 or on any edge ancestral to 6, coalescences with leaf H can happen on edge 7 or any edge ancestral to 7, provided that leaf G has already participated in a coalescence, and so on. Again following the construction of Himwich & Rosenberg [9], the possible assignments of gene tree coalescences to species tree edges in this upper part of the species tree can be described by a Catalan triangle with 4 right-steps and 4 up-steps (Figure 4B). There are 14 possible monotonic paths.

To obtain the total number of coalescent histories with pivotal coalescence on edge 5, we now multiply the two numbers we already have: for each of the 28 partial coalescent histories for coalescences descended from the pivotal coalescence, there are 14 ways for the coalescences ancestral to it to happen. Hence, 392 coalescent histories exist with pivotal coalescence on edge 5.

To obtain the total count of coalescent histories for (G, S), we must consider all other possible locations of the pivotal coalescence, and sum their associated numbers of coalescent histories. With this idea, however, we are now ready for the general case.

4. General construction

The example in Section 3 illustrates that we can enumerate coalescent histories for a p-pseudocaterpillar gene tree and a caterpillar species tree by dividing the problem into three components: placement of the pivotal coalescence, and two enumerations, one for coalescences descended from the pivotal coalescence, and the other for coalescences ancestal to it. We describe these two enumerations in full generality, and complete the calculation by summing over all placements of the pivotal coalescence.

Consider a p-pseudocaterpillar gene tree G and a caterpillar species tree S, both with n leaves, that are bijectively labeled with the same set of distinct labels. Suppose G and S have an identical leaf labeling, by which we mean that when G and S are drawn in canonical form (Figure 2), the gene tree and species tree labels are listed in the same order when reading them from left to right. Figure 3 illustrates an identical leaf labeling. Note that labelings in which the labels in one or both cherries of G are transposed with respect to S also qualify as identical.

Using our numerical labeling scheme for edges of gene trees and species trees (Figure 2), the pivotal coalescence can take place on any species tree edge from p − 1 to n − 1. Suppose it happens on edge k, p − 1 ≤ kn − 1.

4.1. Coalescences descended from the pivotal coalescence

Label by Sk the subtree of S whose root is the node immediately descended from edge k. Label the subtree of G whose root node is the pivotal coalescence by G. The left subtree of G, which we denote G, is a caterpillar with p − 3 coalescences. By the assumption that the pivotal coalescence takes place on species tree edge k, all coalescences in G must occur on edges 1, 2, …, k.

Following Himwich & Rosenberg [9], the partial coalescent histories for (G, Sk), with p − 3 gene tree coalescences and k species tree edges on which they take place, correspond to monotonic paths from (0, 0) to (k, p − 3) that do not travel above the y = x line. The number of partial coalescent histories therefore corresponds to Catalan triangle entry (k, p − 3). By eq. (2), this quantity, which we denote k, equals

k=(k+p3p3)(k+p3p4). (4)

The right subtree of G, or Gr, has exactly one coalescence, which can happen on any of the edges p − 1, p, …, k. Hence, the number of coalescent histories for (Gr, Sk) is

rk=kp+2. (5)

Combining the left and right subtrees of G, from eqs. (4) and (5), the number of partial coalescent histories for (G, Sk) is krk.

4.2. Coalescences ancestral to the pivotal coalescence

To examine coalescences ancestral to the pivotal coalescence, the pivotal coalescence can be viewed as a “leaf” of a caterpillar gene tree G whose coalescences occur on species tree edges numbered k or greater. In this view, G is the (np + 1)-leaf caterpillar tree in which the subtree rooted at the pivotal coalescence is replaced by a leaf, so that the pivotal coalescence is a leaf in the cherry of G.

G has np gene tree coalescences, which take place on species tree edges k, k + 1, …, n − 1, a total of nk edges. It is possible for multiple coalescences in G to occur on edge k; taking into account that edge k has k + 1 descendant leaves, and p − 1 coalescences have already occured including the pivotal coalescence, at most kp + 1 coalescences of G can occur on edge k.

The coalescences of G therefore correspond to monotonic paths that do not travel above a specified diagonal of a trapezoidal lattice. The number of right-steps is nk − 1, one for each non-root edge on which coalescences take place, and the number of up-steps is np, one for each gene tree coalescence in G. The order of the trapezoid is kp + 2, one more than the number of coalescences of G that can occur on the initial edge k.

We can count these monotonic paths using eq. (3), or by noting that the number of monotonic paths is symmetric with respect to interchange of the starting and ending points. In other words, the number of monotonic paths up and to the right from the lower left vertex of a trapezoid to the upper right vertex is the same as the number of monotonic paths down and to the left from the upper right vertex to the lower left vertex. By this symmetry, the number of paths on a Catalan trapezoid is then equal to one of the entries in the right-most column of some Catalan triangle. Assigning coordinates on the lattice, we have

C(np,nk1)=Ct(nk1,np,kp+2). (6)

A visual explanation appears in Figure 5.

Figure 5:

Figure 5:

Monotonic path construction for coalescences ancestral to the pivotal coalescence. (A) Exchanging the starting and ending points of monotonic paths, the number of monotonic paths in a trapezoidal lattice is equal to an entry in a Catalan triangle. Following the notation of Figure 3 with n = 10 and p = 6, suppose the pivotal coalescence happens on species tree edge k = 7. Up to two additional coalescences can happen on edge k = 7, producing a trapezoid. Thus, the figure encodes partial coalescent histories for gene tree coalescences (((((A,B),C),D),E),F) with G, ((((((A,B),C),D),E),F),G) with H, (((((((A,B),C),D),E),F),G),H) with I, and ((((((((A,B),C),D),E),F),G),H),I) with J on species tree edges ancestral to A, B, C, D, E, F, G, and H. The number of monotonic paths is obtained from a Catalan trapezoid of order kp + 2 = 3 with nk − 1 = 2 right-steps and np = 4 up-steps. By a symmetry in which the number of permissible paths from starting point (0, 0) to ending point (nk − 1, np) is equivalent to the number of permissible paths from the ending point to the starting point, we can count paths from this alternate perspective, evaluating entry (np, nk − 1) = (4, 2) of a Catalan triangle. (B) Monotonic path construction for each of the np + 1 = 5 options for placement of the pivotal coalescence. For a placement of the pivotal coalescence shown in the left-hand diagram, the number of coalescent histories for (G, S) is obtained by counting monotonic paths in the right-hand diagram from the lower-left vertex to an associated point on the right-hand edge.

Denote by uk the number of coalescent histories for (G, S). Using eq. (3) or the symmetry argument with eq. (2) to count monotonic paths in a triangular lattice with np right-steps and nk − 1 up-steps, we have

uk=C(np,nk1)=(2npk1nk1)(2npk1nk2). (7)

4.3. Full formula

We have shown that the number of partial coalescent histories for (G, S) that place the pivotal coalescence on edge k is krk, and that for each of these partial coalescent histories, the number of partial coalescent histories for (G, S) is uk. Because each coalescent history for (G, S) consists of a partial coalescent history for (G, S), a placement of the pivotal coalescence, and a partial coalescent history for (G, S), the number of coalescent histories for the case in which the pivotal coalescence happens on edge k is krkuk. Summing over values of k, we have proven the following theorem.

Theorem 4.Consider a caterpillar species tree S with n ≥ 4 leaves and an identically-labeled p-pseudocaterpillar gene tree G with n leaves and 4 ≤ pn. With C(n, k) as in eq. (2), the number of coalescent histories for (G, S) is

h(n,p)=k=p1n1krkuk=k=p1n1C(k,p3)(kp+2)C(np,nk1). (8)

Note that eq. (8) can be seen to apply for (n, p) with p = 3 and 3 ≤ pn, and hence for n = 3. In this case, G is viewed as a caterpillar gene tree whose cherry joins leaves 2 and 3. G has no coalescences, so k = 1; this enumeration accords with the definition of the function C in eq. (2), where we have C(k, p − 3) = C(k, 0) = 1 for all k.

A convenient form of eq. (8) for computation is as follows:

h(n,p)=k=p1n1(kp+2)2(kp+4)(2npk1)!(k+p3)!(k+1)!(nk1)!(np+1)!(p3)!. (9)

4.4. Identical and non-identical leaf labelings

The results of Himwich & Rosenberg [9] enable a result on leaf labelings. We claim that for a fixed caterpillar species tree, an identically-labeled p-pseudocaterpillar gene tree—the focus of our analysis—has strictly more coalescent histories than any non-identically-labeled p-pseudocaterpillar. The argument is that any non-identically-labeled gene tree introduces at least one “roadblock,” decreasing its associated number of monotonic paths compared to the case of identical labels.

Proposition 5.Consider a caterpillar species tree S with n ≥ 4 leaves and a value of p, 4 ≤ pn. The number of coalescent histories for (G, S), with G a p-pseudocaterpillar gene tree bijectively labeled with the same n labels as S, is bounded above by h(n, p), with equality if and only if G and S are identically labeled.

Proof. Theorem 4 demonstrates that the number of coalescent histories is h(n, p) in the identically-labeled case. We must show that a non-identically-labeled G produces fewer coalescent histories.

Fix n and p. Consider caterpillar species tree S and p-pseudocaterpillar gene tree G, bijectively labeled with n labels {A1, A2, …, An}, but not necessarily identically labeled. Suppose that from left to right, A1, A2, …, An label the leaves of S when S appears in canonical form.

In eq. (8), the form of the equation h(n,p)=k=p1n1krkuk has a sum from k = p − 1 to n − 1 of a product of three quantities. Each quantity counts the number of monotonic paths on a Catalan triangle—trivially so in the case of rk=C(kp+2,1)=kp+2, which represents the number of monotonic paths that proceed kp + 2 steps to the right and one step up.

If we now change G to a possibly non-identically-labeled p-pseudocaterpillar G′, then the coalescent histories can be enumerated by a corresponding decomposition h(n,p)=k=p1n1krkuk, where, with the pivotal coalescence on edge species tree edge k, k, rk, and uk count partial coalescent histories for (G*,Sk), (G*r,Sk) and (G*,S), respectively.

To demonstrate that h′ (n, p) < h(n, p), we argue that kk, rkrk and ukuk, and that for G′≠ G, at least one of these inequalities is strict. Following the argument of Corollary 11 of Himwich & Rosenberg [9], the quantities k, rk, and uk count monotonic paths that lie below or on the y = x line, that respectively proceed from (0, 0) to (k, p − 3), (0, 0) to (kp + 2, 1), and (0, 0) to (np, nk − 1), possibly with roadblocks

When G′ = G, no roadblocks occur, so that k=k, rk=rk and uk=uk. When G′ ≠ G, however, at least one of the following three statements holds: (i) G*G*; (ii) G*rG*r (iii) G*G*. In the first case, for at least one k, a roadblock occurs in tabulating coalescent histories for G, so that k=k Similarly, in the second case, for at least one k, a roadblock occurs in tabulating coalescent histories for Gr, so that rkrk; in the third case, for at least one k, a roadblock occurs in tabulating coalescent histories for G, so that ukuk. □

Note that for the sum describing the number of coalescent histories of (G, S) to even proceed over the full range from k = p − 1 to n − 1, the first p labels of G from left to right when G is written in canonical form must be a permutation of A1, A2, …, Ap. Otherwise, at least one label of G must be indexed by a value that exceeds p and therefore cannot descend from edge p − 1 of S.

5. Small p

The case of identically-labeled G and S produces the largest number of coalescent histories among all p-pseudocaterpillar gene trees and caterpillar species trees with fixed (n, p) and 4 ≤ pn. Note that for p = 3, the case of identically-labeled G and S produces more coalescent histories than any non-identically-labeled pair; both G and S are caterpillars in this case, and fixing n, the number of coalescent histories for matching caterpillars exceeds that for any non-matching pair of caterpillars (Remark 15 of [2]; Corollary 11 of [9]).

We now return to the case of identically-labeled (G, S) and evaluate eq. (8) for fixed small p.

5.1. Exact formulas for fixed p

Fixing the variable p allows us to obtain exact formulas for h(n, p) as a rational function of n. The smallest case is p = 3, so that the function h(n, 3) is defined for all n ≥ 3:

h(n,3)=k=2n1C(k,0)(k1)C(n3,nk1).

We rewrite the summand for h in the expanded form from eq. (9). We then obtain the sum using the Wilf-Zeilberger algorithm for computing sums that involve binomial coefficients.

Proposition 6.For all n ≥ 3, the following identity holds

h(n,3)=k=2n1(k1)2(2nk4)!(n2)!(nk1)!=3(2n4)!n!(n3)!. (10)

Proof. Set m = n − 1. Let the function F (m, k) be the ratio of the summand to the right-hand side of eq. (10):

F(m,k)=(k1)2m(m+1)(m2)!(2mk2)!6(m1)(2m3)!(mk)!.

This function and a proof certificate

R(m,k)=(k2)(2mk1)(k2mk2+k2m)2m2(2m1)(k1)(mk+1)

satisfy the assumptions of the Wilf-Zeilberger theorem [11, Theorem 7.1.1]. Hence, the sum k=2mF(m,k) does not depend on m. We know that k=2mF(m,k)=1 when m = 2, from which eq. (10) follows by substituting n = m + 1. □

It is convenient to write eq. (10) as a product of a rational function of n and a Catalan number,

h(n,3)=3(n2)2(2n3)Cn1.

For other small values of p, we follow the proof in Proposition 6 to obtain analogous expressions (Table 1). The corresponding proof certificates R(m, k) appear in Appendix A.

Table 1:

Closed-form expressions for the function h(n, p) for fixed values of p (eq. (8)). Wilf-Zeilberger proof certificates appear in Appendix A. The next three terms for limn[h(n,p)/Cn1] are 179587/65536 for p = 10, 384199/131072 for p = 11, and 1631605/524288 for p = 12.

p h(n, p) limnh(n,p)Cn1
3 3(n2)2(2n3)Cn1 34
4 (19n40)(n3)4(2n3)(2n5)Cn1 1916
5 (49n2254n+315)(n4)4(2n3)(2n5)(2n7)Cn1 4932
6 (467n34319n2+12798n12096)(n5)16(2n3)(2n5)(2n7)(2n9)Cn1 467256
7 (1067n415263n3+78997n2174673n+138600)(n6)16(2n3)(2n5)(2n7)(2n9)(2n11)Cn1 1067512
8 (4751n596706n4+762163n32898044n2+5296836n3706560)(n7)32(2n3)(2n5)(2n7)(2n9)(2n11)(2n13)Cn1 47512048
9 (10393n6284776n5+3155822n418055844n3+56078685n289321220n+56756700)(n8)32(2n3)(2n5)(2n7)(2n9)(2n11)(2n13)(2n15)Cn1 103934096

5.2. Asymptotic behavior for small p

We can extend beyond the exact formulas for h(n, p) for small p in Section 5.1 to show that for each fixed p, there exists a constant βp such that limnh(n,p)βpCn1. The approach follows Disanto & Rosenberg [6], who considered matching gene trees and species trees in caterpillar-like families, in which trees had a caterpillar shape with the caterpillar subtree of size replaced by a “seed tree” t of size . They assumed G = S = t(n), with t(n) consisting of t augmented by n “caterpillar branches” appended to its root.

The framework makes use of additional definitions. An r-extended coalescent history is a coalescent history for the case in which a species tree is assumed to have its root-branch divided into r ≥ 1 components [13]. Labeling these components from 1 to r with branch 1 closest to the species tree root, an m-rooted coalescent history is an r-extended coalescent history in which the gene tree root coalesces on species tree branch m, 1 ≤ mr. The number of m-rooted coalescent histories hn,m for G = S = t(n) then equals hn,m = en,men,m−1, with en,0 = 0, where en,m is its number of m-extended coalescent histories.

Disanto & Rosenberg [6] devised an iterative procedure for obtaining the coalescent histories for t(n+1) from the coalescent histories for t(n), n ≥ 0. For a fixed seed tree t, the generating function for the sequence h0,m(t) counting m-rooted coalescent histories for t is written

g(y)=m=1h0,m(t)ym.

The bivariate generating function for the sequence hn,m(t), counting m-rooted histories for (G,S)=(t(n),t(n)), is denoted

F(y,z)=m=1n=0hn,m(t)znym.

The univariate generating function f (z) for the sequence hn,1(t), counting coalescent histories for (G,S)=(t(n),t(n)), satisfies

f(z)=n=0hn,1(t)zn=F(0,z)y.

Disanto & Rosenberg [6] obtained the result

f(z)=g(114z2)z. (11)

By examining the expansion of f (z) around its dominant singularity, they showed that given t, there exists a positive constant βt such that hn,1(t)~βtCn1.

The construction of Disanto & Rosenberg [6] that enumerated coalescent histories of t(n+1) from those of t(n) does not use G = S. Thus, it applies for identically-labeled caterpillar-like families generated from nonmatching seed trees tG and tS of the same size, as does the associated procedure for obtaining the generating function f (z). In particular, it applies for the number of coalescent histories for caterpillar-like families with p-pseudocaterpillar G and identically-labeled caterpillar S.

We first derive an expression for e(n,p),r, the number of r-extended coalescent histories for the p-pseudocaterpillar gene tree of np leaves on an identically-labeled caterpillar species tree of n leaves. In our notation, dividing the root-branch amounts to adding right-steps on the diagram for uk (Section 4.2) and increasing the range of the index k. We have

e(n,p),r=k=p1n+r2C(k,p3)(kp+2)Ct(np,n+rk2,r). (12)

To obtain this expression, note that the extension of the species tree from 1 to r branches ancestral to the root does not affect the C(k, p − 3) and kp + 2 terms, representing coalescences descended from the pivotal coalescence. However, the Catalan trapezoid that tabulates coalescent histories ancestral to the pivotal coalescence is affected. The number of coalescences ancestral to the pivotal coalescence continues to be np. The number of available branches is now n + rk − 2 instead of nk − 1. Traversing the paths “forward,” the trapezoid has order kp + 2, one more than the number of coalescences that can occur on the species tree branch on which the pivotal coalescence takes place, giving a count of Ct(n + rk − 2, np, kp + 2) (eq. (3)). The number of vertices on the upper edge of the trapzeoid is r, so that if paths are traversed in reverse order, the number of r-extended coalescent histories is, equivalently, Ct(np, n + rk − 2, r).

The associated number of m-rooted coalescent histories, 1 ≤ mr, then satisfies

h(n,p),m=e(n,p),me(n,p),m1. (13)

Noting that n = p for the seed tree for the p-pseudocaterpillar family and applying eqs. (12) and (13) gives generating function gp(y),

gp(y)=m=1h(p,p),mym=m=1m(m+2)(2p+m5)!(p3)!(p+m1)!ym. (14)

For small p, the generating functions gp can be simplified as in Table 2.

Table 2:

Generating functions. Generating function gp(y) counts m-rooted histories h(p,p),m for caterpillar-like families with a seed p-pseudocaterpillar gene tree and identically-labeled caterpillar species tree (eq. (14)); generating function fp(z) counts coalescent histories h(p+k,p),1 (eq. (15)).

p gp(y) fp(z)
3 y(y1)2 2(14z+1)z(14z+1)2
4 y23y(y1)3 8(z14z+1)z(14z+1)3
5 2y38y2+9y(y1)4 8(14z1)(2z314z6)z(14z+1)4
6 5y425y3+44y228y(y1)5 16(10z2+33z+15z14z414z+4)z(14z+1)5

From gp(y), we then obtain the generating function fp(z) that counts coalescent histories as the numbers of leaves in the gene tree and species tree increase from p:

fp(z)=k=0h(p+k,p),1zk=gp(114z2)z. (15)

Using eq. (14), these generating functions can also be simplified for small p (Table 2).

Expanding the entries in Table 2, we obtain, for example:

f3(z)=1+3z+9z2+28z3+90z4+297z5+1001z6+3432z7+O(z8)
f4(z)=3+11z+37z2+124z3+420z4+1441z5+5005z6+17576z7+O(z8).

Each function gives the values h(n, p) (eq. (8)) as n is incremented beginning with n = p.

For p-pseudocaterpillar gene trees, we can compare the values for the limiting constants βp for two choices of the species tree S: the case in which the species tree has the same p-pseudocaterpillar labeled topology, and the case of an identically-labeled caterpillar species tree. The former value, from Disanto & Rosenberg [6], exceeds the latter for p = 3 and p = 4 (Table 3). For p = 5 to p = 9, however, βp for the non-matching caterpillar S exceeds that for a matching p-pseudocaterpillar.

Table 3:

Numerical values of the constant βp describing limn[h(n,p)/Cn1], the asymptotic ratio of the number of coalescent histories h(n, p) to the Catalan number Cn1. The gene tree has a p-pseudocaterpillar topology. Values for the case that the gene tree and species tree S have a matching p-pseudocaterpillar topology are taken from Table 1 of Disanto & Rosenberg [6]; values for identically-labeled caterpillar S are taken from Table 1.

 β p
p Matching p-pseudocaterpillar S Caterpillar S
3 1.0000 0.7500
4 1.2500 1.1875
5 1.4375 1.5313
6 1.5938 1.8242
7 1.7305 2.0840
8 1.8535 2.3198
9 1.9663 2.5374

Rosenberg & Degnan [16] had shown that the case of a 4-pseudocaterpillar gene tree and an identically-labeled caterpillar species tree produced more coalescent histories (βp = 1.1875) than the case of matching caterpillar gene tree and species tree (Cn1 coalescent histories, and hence a limiting ratio of 1). The table demonstrates that p-pseudocaterpillar gene trees for each p from 5 to 9 also produce more coalescent histories than the matching caterpillar gene tree.

6. Maximal number of coalescent histories for fixed n

Applying Theorem 4, we can calculate h(n, p) systematically for small n and all p with 3 ≤ pn. Table 4 shows the values of h(n, p) for all (n, p) with n ≤ 12.

Table 4:

Values of the function h(n, p) (eq. (8)) for small values of n and p.

p
n 3 4 5 6 7 8 9 10 11 12
3 1
4 3 3
5 9 11 9
6 28 37 37 28
7 90 124 134 124 90
8 297 420 473 473 420 297
9 1001 1441 1665 1735 1665 1441 1001
10 3432 5005 5885 6291 6291 5885 5005 3432
11 11934 17576 20930 22766 23354 22766 20930 17576 11934
12 41990 62322 74932 82537 86149 86149 82537 74932 62322 41990

The table suggests two patterns. First, we can see that a symmetry exists in which h(n, p) = h(n, np + 3). We will verify this symmetry in Section 7. Second, we observe that for each n, the value of p that maximizes h(n, p) lies in the middle, repeating for two adjacent values of p when n is even. We state this result formally in the following theorem.

Theorem 7.Consider a caterpillar species tree S with n ≥ 4 leaves. Among identically-labeled p-pseudocaterpillar gene trees G with n leaves and 3 ≤ pn, the value of p that maximizes the number of coalescent histories h(n, p) for (G, S) is

pm=n+32 (16)

if n is odd. If n is even, then two adjacent maxima exist:

pm1=n+22,pm2=n+42. (17)

For n = 3 and n = 4, the result is trivial, as n = 3 requires p = 3, and for n = 4, h(n, 3) = h(n, 4) = 3. For n ≥ 5, the proof proceeds in three steps.

  1. First, in Section 6.1, for n ≥ 5 and 4 ≤ pn, we describe a difference function D(n, p) that measures the change in the function h(n, p) when we increment p by 1 for fixed n.

  2. Next, in Section 6.2, we show that the difference function D(n, p) is positive for p = 4 (Lemma 8) and negative for p = n (Lemma 9), and that it monotonically decreases as the integer p is incremented from 4 to n (Lemma 10).

  3. Finally, in Section 6.3, we deduce that for fixed n ≥ 5, if D(n, p) ≠ 0 for all p, 4 ≤ pn, then a unique integer p exists at which h(n, p) is maximal; two maxima exist if D(n, p) = 0 for some p. We confirm that the maxima of h(n, p) are described by eqs. (16) and (17).

6.1. Difference function

For n ≥ 5 and 4 ≤ pn, we define the difference function of h(n, p):

D(n,p)=h(n,p)h(n,p1). (18)

Because h(n, p) is defined as a sum, D(n, p) is also an expression involving a summation.

We find a closed-form expression for D(n, p). We start by expanding eq. (18) using eq. (9):

D(n,p)=k=p1n1[(kp+2)2(kp+4)(k+p3)!(2npk1)!(k+1)!(p3)!(nk1)!(np+1)!(kp+3)2(kp+5)(k+p4)!(2npk)!(k+1)!(p4)!(nk1)!(np+2)!]3(2n2p+2)!(2p6)!(np+1)!(np+2)!(p4)!(p1)!.

Notice that because the sums in expressions for h(n, p) and h(n, p − 1) have different summation limits, we obtain an additional term outside the sum. The sum in the expression for D(n, p) can be transformed into a closed form, which gives the following formula:

D(n,p)=2p(np+7p2p26)(2n2p+2)!(2p5)!n(n1)(np)!(np+2)!(p3)!p!3(2n2p+2)!(2p6)!(np+1)!(np+2)!(p4)!(p1)!. (19)

The proof appears in Appendix B.

6.2. Sign of the difference function

Using eq. (19) for D(n, p), we prove three lemmas concerning the sign of D(n, p).

Lemma 8.For p = 4, the function D(n, p) is positive for all n ≥ 5.

Proof. When we substitute p = 4 into eq. (19), we obtain

D(n,4)=4(2n5)!n!(n4)!(2n6)!(n2)!(n3)!=2(7n15)(2n7)!n!(n5)!.

Because n ≥ 5, all terms in this fraction are positive. □

Lemma 9.For p = n, the function D(n, p) is negative for all n ≥ 5.

Proof. Substituting p = n into eq. (19), we obtain

D(n,n)=2(n6)(2n5)!n!(n3)!3(2n6)!(n4)!(n1)!.

For n ≥ 7, D(n, n) is quickly seen to be a sum of two negative numbers. It remains to check the cases of n = 5 and n = 6: D(5, 5) = −2 and D(6, 6) = −9. □

Lemma 10.D′(n, p) = D(n, p) − D(n, p − 1) is negative for n ≥ 5 and 5 ≤ pn. That is, for each n ≥ 5, D(n, p) monotonically decreases as the integer p is incremented from p = 4 to p = n.

Proof. The expression for D′(n, p) can be simplified to

D(n,p)=2(4p24np20p+11n+27)(2n2p+2)!(2p8)!(np+1)!(np+3)!(p4)!(p2)!.

Because 5 ≤ pn, the term that determines the sign of D′(n, p) is the polynomial f (n, p) = 4p2 − 4np − 20p + 11n + 27 in the numerator. Solving the inequality f (n, p) < 0 for p, we obtain

n+5212n2n2<p<n+52+12n2n2.

The left-hand term is bounded above by 3 for all n ≥ 5, and the right-hand term exceeds n for all n ≥ 5. Hence, because 5 ≤ pn, all possible values of (n, p) satisfy the inequality. □

6.3. Location of the maximum

As a result of Lemmas 8–10, for n ≥ 5, as p is incremented from 4 to n, D(n, p) monotonically decreases (Lemma 10) from a positive value at p = 4 (Lemma 8) to a negative value at p = n (Lemma 9). Hence, h(n, p) increases from p = 3 to a maximum then decreases until p = n.

Two cases are possible. Given n, a unique value p = pm1 could exist at which D(n, p) = 0, in which case h(n, pm1) = h(n, pm1 − 1), and both pm1 and pm2 = pm1 − 1 are maxima. Alternatively, if D(n, p) ≠ 0 for all p, then h(n, p) is maximized at the largest value of p for which D(n, p) > 0.

If n ≥ 6 is even, then inserting p=n+42 into eq. (19), we obtain D(n,n+42)=0. Hence, h(n,n+42)=h(n,n+22), and maxima of h(n, p) occur at both pm1=n+42 and pm2=n+22.

If n ≥ 5 is odd, we show that a value pm ≥ 4 exists for which D(n, pm) > 0 and D(n, pm + 1) < 0. This value pm maximizes D(n, p).

Lemma 11.For odd n ≥ 5, with n = 2k + 1 and k ≥ 2, (i) D(n, k + 2) > 0, and (ii) D(n, k + 3) < 0.

Proof. (i) We insert (n, p) = (2k + 1, k + 2) into eq. (19), obtaining the positive quantity

D(2k+1,k+2)=(2k)!(2k2)!(k1)!(k!)2(k+1)!.

(ii) Inserting (n, p) = (2k + 1, k + 3) into eq. (19), we obtain

D(2k+1,k+3)=2(2k)!(2k3)!(k2)!(k!)2(k+1)!,

a quantity that is negative. □

We conclude h(2k + 1, k + 2) > h(2k + 1, k + 1), but h(2k + 1, k + 3) < h(2k + 1, k + 2). Hence, for odd n ≥ 5, writing k=n12, pm=n+32 maximizes h(n, p). The proof of Theorem 7 is complete.

6.4. Asymptotic growth of the maximal number of coalescent histories

With the value pm that maximizes h(n, p) established, we now examine the asymptotic growth of the maximum. We quickly verify that for a fixed caterpillar species tree with n leaves, across all p-pseudocaterpillar gene trees with fixed n, the maximal number of coalescent histories exceeds the Catalan number Cn1 describing the number of coalescent histories for the matching caterpillar. In Section 5.2, we showed that for fixed small p ≥ 4, as n increases, the number of coalescent histories grows with a constant multiple of Cn1, with the constant exceeding 1. Here we show that for each n ≥ 7, the maximal number of coalescent histories, that is, h(n, pm) for odd n and h(n, pm1) = h(n, pm2) for even n, exceeds the corresponding Catalan number. We abbreviate pm = pm1 = pm2 for even n, so that the sequence of values of h(n, pm) is well-defined for n ≥ 4.

Proposition 12.For odd n ≥ 7, h(n,pm)>Cn1, and for even n ≥ 8, h(n,pm)=h(n,pm1)=h(n,pm2)>Cn1, where h is defined by eq. (8), pm by eq. (16), pm1 and pm2 by eq. (17), and Cn1 by eq. (1).

Proof. For n = 7, we have h(n, pm) = h(7, 5) = 134, which exceeds C6 = 132. For n = 8, h(n, pm1) = h(n, pm2) = h(8, 5) = h(8, 6) = 473, which exceeds C7 = 429.

Lemma 4.2 of Rosenberg & Degnan [16] showed that for n ≥ 9, h(n,4)>Cn1. By Theorem 7, for odd n ≥ 9, h(n, pm) > h(n, 4), and for even n ≥ 10, h(n, pm1) = h(n, pm2) > h(n, 4). Thus, because the maximal number of coalescent histories across all p exceeds the number for p = 4, and because the number of coalescent histories for p = 4 exceeds Cn1, the maximum exceeds Cn1. □

We introduce the definition of the exponential order of a sequence: a sequence {an} has exponential order k if limsupnann=k [7]. In other words, an = kns(n) where s(n) is a subexponential factor with limsupns(n)n=1. If sequences an and bn have the same exponential order, we write anbn.

The Catalan numbers Cn have exponential order 4, as Stirling’s approximation to Cn=(2n)!/[(n+1)(n!)2] gives Cn4n/(n3/2π). Plotting log h(n, pm) and logCn1 as functions of n, we see that they grow approximately linearly with similar slopes (Figure 6). We therefore claim that the sequence h(n, pm) also has exponential order 4.

Figure 6:

Figure 6:

For caterpillar species trees of size n, the natural logarithms of the maximal number of coalescent histories across p-pseudocaterpillar gene trees (h(n, pm)) and the number of coalescent histories of the matching caterpillar gene tree topology (Cn1). The quantity h(n, pm) is computed according to Theorem 7, and Cn1 follows eq. (1).

Proposition 13.For caterpillar species trees with n leaves, the sequence h(n, pm) describing the maximal number of coalescent histories across all p-pseudocaterpillar gene trees of size n has exponential order 4, so thath(n,pm)Cn.

Proof. By Lemmas 17 and 19 in Appendix C, Cn2h(n,pm)nCn+2 for n ≥ 3, from which

limxCn2nlimxh(n,pm)nlimxnCn+2n.

As the left-hand and right-hand limits both equal 4, we conclude limnh(n,pm)n=4, h(n,pm)=4ns(n) for some subexponential s(n), and h(n,pm)Cn. □

7. Symmetry

We now verify the symmetry h(n, p) = h(n, np + 3) observed in Table 4 for all (n, p) with 3 ≤ pn. For convenience, given a pseudocaterpillar tree with second cherry at position p, we define its dual as the pseudocaterpillar tree with second cherry at position np + 3 (Figure 7).

Figure 7:

Figure 7:

Dual p-pseudocaterpillar gene trees with the same number of coalescent histories (1665) if paired with the identically-labeled caterpillar species tree. (A) (n, p) = (9, 5). (B) (n, p) = (9, 7).

We show that for a fixed caterpillar species tree, the number of coalescent histories of an identically-labeled pseudocaterpillar is equal to the number of coalescent histories of its dual. Thus, the formula for the number of coalescent histories has a symmetry in the position of the second cherry on the p-pseudocaterpillar gene tree.

Theorem 14.For all (n, p) with 3 ≤ pn, h(n, p) = h(n, np + 3).

For n = 3, the claim is trivial, as p = np + 3 = 3. For n = 4, the claim is also trivial, as h(4, 3) = h(4, 4) = 3. For n ≥ 5, we proceed in three steps.

  1. First, in Section 7.1, we introduce a dual difference function D(n, p) that measures the change in h(n, np + 3) as p is incremented for fixed n.

  2. Next, in Section 7.2, we show that the dual difference function D(n, p) is equal to the difference function D(n, p) for all allowed values of (n, p).

  3. Finally, in Section 7.3, we use this equality of difference functions to complete the proof of the symmetry of h(n, p).

7.1. Dual difference function

We define a function D “dual” to the difference function D (eq. (18)):

D(n,p)=h(n,np+3)h(n,np+4). (20)

The function is well defined for n ≥ 5 and 4 ≤ pn, where np + 3 and np + 4 lie in [3, n].

Using the definition of h(n, p) from eq. (9), we obtain

D*(n,p)=k=np+3n1[(kn+p1)2(kn+p+1)(k+np)!(n+pk4)!(k+1)!(p2)!(nk1)!(np)!(kn+p2)2(kn+p)(k+np+1)!(n+pk5)!(k+1)!(p3)!(nk1)!(np+1)!]+3(2p6)!(2n2p+2)!(p3)!(p2)!(np)!(np+3)!.

This sum can be simplified to get a closed form for D:

D*(n,p)=4(p3)(n2+2p23np+5n9p+10)(2p7)!(2n2p+3)!n(n1)(np+1)(p4)!(p2)!(np)!(np+3)!+3(2p6)!(2n2p+2)!(p3)!(p2)!(np)!(np+3)!. (21)

The proof appears in Appendix D.

7.2. Dual difference function is equal to the difference function

This section verifies the equality of the difference function and its dual.

Lemma 15.For all n ≥ 5 and 4 ≤ pn, the dual difference function equals the difference function

D(n,p)=D(n,p).

Proof. We simplify D(n, p)/D(n, p) using eqs. (19) and (21), verifying that this ratio equals 1. □

7.3. Completing the proof

Rearranging terms in the definitions of the difference functions by eqs. (18) and (20), we have

h(n,p)h(n,np+3)=h(n,p1)h(n,np+4) (22)

Decrementing p from n to 4, eq. (22) gives a chain of equalities h(n,n)h(n,3)=h(n,n1)h(n,4)=h(n,n2)h(n,5)==h(n,3)h(n,n)..

In particular, for each p from 3 to n, h(n,p)h(n,np+3)=[h(n,p)h(n,np+3)]. Both sides of this equation must then equal zero, from which we conclude h(n,p)=h(np+3) for each p, 3 ≤ pn. The proof of Theorem 14 is complete.

Theorem 14 can strengthen Proposition 12. We now know that h(n,p)>Cn1 for all (n, p) with n ≥ 9 and 4 ≤ pn − 1. From Rosenberg & Degnan [16], h(n,4)>Cn1 for n ≥ 9. By Theorem 14, h(n, n 1) = h(n, 4) > n−1. In the proof of Theorem 7, we show that for n odd, h(n, p) increases as p increases from 4 to n+32, and Theorem 14 indicates that h(n, p) decreases as p increases from n+32 to n − 1; similarly, for n even, h(n, p) increases as p increases from 4 to n+22, with h(n,n+22)=h(n,n+42), then decreases as p increases from n+42 to n − 1. Thus, h(n,p)>Cn1 for all (n, p) with n ≥ 9 and 4 ≤ pn − 1.

8. Discussion

We have developed a method for counting coalescent histories in cases in which the gene tree and species tree topologies do not match, considering p-pseudocaterpillar gene trees together with an identically-labeled caterpillar species tree. Using a combinatorial construction, we find that the recursive formula from Rosenberg [13] can be evaluated non-recursively as a sum (eq. (9))—which can in turn be simplified to a closed form for fixed small p (Section 5). The number of coalescent histories h(n, p) (eq. (8)) has a symmetry in p (Theorem 14), and the maximum over values of p for each n is attained when the “second cherry” lies in the “middle” of the gene tree (Theorem 7).

Results on the value of p that maximizes h(n, p) verify an informal observation from previous studies. It has been noted that for fixed n, large numbers of coalescent histories tend to occur when two conditions are met: the number of distinct sequences in which coalescences can be arranged is large, as is the number of species tree branches describing potential placements of those coalescences [5, 6, 13, 14, 16]. For a fixed caterpillar species tree, identically-labeled p-pseudocaterpillar gene trees represent a tradeoff of these two features. As p increases, more sequences exist for coalescences descended from the pivotal coalescence. However, the number of species tree branches on which the pivotal coalescence can occur decreases, so that fewer species tree branches exist on which the larger number of coalescence sequences can occur. That h(n, p) is maximized when p lies in the “middle” aligns with the informal observation that both conditions—many coalescence sequences, and many species tree branches on which coalescences take place—are important for generating large numbers of coalescent histories.

Table 5 compares coalescent histories in three cases: matching caterpillars, matching p-pseudocaterpillars, and caterpillar species trees with identically-labeled non-matching p-pseudocaterpillar gene trees. For a caterpillar species tree, as the number of species n grows to 9 or greater, the number of coalescent histories for identically-labeled p-pseudocaterpillar gene trees with 4 ≤ pn − 1 exceeds the Catalan number of coalescent histories for the matching gene tree (Proposition 12). For fixed p, more coalescent histories can occur for the non-matching p-pseudocaterpillar gene tree and identically-labeled caterpillar species tree than for matching p-pseudocaterpillars (Tables 3 and 5).

Table 5:

Numbers of coalescent histories for matching caterpillar gene trees and species trees (Cn1, right-hand column), caterpillar species trees and identically-labeled p-pseudocaterpillar gene trees (top entry in each cell), and matching p-pseudocaterpillar gene trees and species trees (bottom entry). Top entries are from Table 4, and bottom entries are from Rosenberg [13].

n p
Matching caterpillar
3 4 5 6 7 8 9
3 1 2
2
4 3 3 5
5 4
5 9 11 9 14
14 13 10
6 28 37 37 28 42
42 42 37 28
7 90 124 134 124 90 132
132 138 130 112 84
8 297 420 473 473 420 297 429
429 462 453 416 354 264
9 1001 1441 1665 1735 1665 1441 1001 1430
1430 1573 1584 1511 1368 1155 858

In related work, Disanto & Munarini [4] considered matching caterpillar gene trees and species trees, identifying the leaf whose replacement by a cherry in both trees would give rise to the greatest increase in the number of coalescent histories (measured as a ratio). This speciation—the splitting of a leaf node of G and S into two child nodes—can be interpreted as extending the trees by adding the “second cherry” that converts a caterpillar into a p-pseudocaterpillar. Disanto & Munarini [4] determined the value of p with which the p-pseudocaterpillar tree pair with n leaves would have the largest number of coalescent histories. Asymptotically, this value of p is equal to n2 [4, i(n) for Dn in Table 1]—“in the middle,” as in our result in Section 6. Theorem 7 can then be seen to prove an analogous result in a nonmatching case, as the gene tree gains a cherry node whereas the species tree gains only a caterpillar leaf.

Our p = 3 case has a direct geometric interpretation in the framework of Himwich & Rosenberg [9], as it describes a non-matching pair of caterpillar trees (Figure 8). Its coalescent histories are described by monotonic paths on a lattice with a single roadblock. The p = n case, for which the number of coalescent histories is equal to the p = 3 case, can also be represented in a diagram with one roadblock, obtained by reflecting monotonic paths of the p = 3 case across y = n−1− x. Because monotonic paths not traveling above the y = x diagonal of a square lattice correspond to Dyck paths, and the coalescent histories for p = 3 correspond to Dyck paths beginning with two up-steps, the sequence an = h(n, 3) = h(n, n) for n ≥ 3 gives the number of Dyck paths of length n − 1 beginning with two up-steps (OEIS A000245). We can also write h(n,3)=h(n,n)=a(n)=Cn1Cn2.

Figure 8:

Figure 8:

Symmetry in the number of coalescent histories for a caterpillar species tree and identically-labeled p-pseudocaterpillar gene trees, for the cases of (n, 3) and (n, n). (A) For p = 3, coalescent histories correspond to roadblocked monotonic paths on a lattice with one roadblock. The first coalescence (B,C) on a gene tree cannot happen on the species tree edge ancestral to A and B. (B) The p = n case can also be seen to correspond to roadblocked monotonic paths. The second-to-last coalescence (E,F) can occur only on the edge ancestral to the species tree root.

Our work provides an extension of an earlier study of coalescent histories for non-matching caterpillars [9]. We expect that the method we have used has potential for extension to cases with more than two cherries, with the species tree remaining a caterpillar. In such an extension, each additional cherry would generate an additional “pivotal” coalescence and an additional summation based on the placement of that coalescence.

Acknowledgements.

We thank two reviewers for detailed comments. We acknowledge NIH grant R01 GM131404 for support.

A. Wilf-Zeilberger certificates for formulas in Table 1

This appendix gives the proof certificates for the identities in Table 1, all of which have similar proofs to Proposition 6. Only the Wilf-Zeilberger proof certificate R(m, k) differs across the cases. We list the proof certificates for the remaining identities in Table 6.

Table 6:

Wilf-Zeilberger proof certificates R(m, k) for expressions in Table 1.

p R(m, k)
4 −{(k − 3)(k − 2m + 2)[k3(19m2 − 59m + 42) + k2(−19m2 + 86m − 72) + k(−65m2 + 134m − 68) − 13m2 + m + 14]}/[2(k − 2)k(m − 1)2(2m − 3)(19m − 2)(km − 1)]
5 −{(k − 4)(k − 2m + 3)[k4(49m3 − 303m2 + 578m − 330) + k3(−49m3 + 388m2 − 858m + 540) + k2(−428m3 + 2483m2 − 4470m + 2394) − 2k(59m3 + 13m2 − 459m + 342) + 4(86m3 − 477m2 + 847m − 480)]}/[2(k − 3)(k − 1)(k + 2)(m − 2)2(2m − 5)(49m2 − 58m + 3)(km − 1)]
6 −{(k−5)(k−2m+4)[k5(467m4−4786m3+17233m2−25394m+12600)+7k4(133m3−867m2+ 1754m − 1080) + k3(−8870m4 + 90676m3 − 324682m2 + 473420m − 230304) + k2(−9780m4 + 80459m3 − 234273m2 + 285790m − 127176) + 3k(7531m4 − 84096m3 + 314435m2 − 463542m + 227352) + 18(1705m4 − 14774m3 + 47147m2 − 65566m + 32088)]}/[2(k − 4)(k − 2)(k + 2)(k + 3)(m − 3)2(2m − 7)(467m3 − 1517m2 + 1126m − 40)(km − 1)]
7 −{(k − 6)(k − 2m + 5)[k6(1067m5 − 16330m4 + 94585m3 − 256250m2 + 319728m − 143640) + k5(2134m5 − 30293m4 + 163673m3 − 415318m2 + 486024m − 204120) + k4(−34377m5 + 532064m4−3106039m3+8451124m2−10543892m+4705320)+k3(−104394m5+1517539m4−8380799m3+21722234m2−26044120m+11362200)+2k2(62807m5−1114568m4+7105518m3−20362003m2 + 25974226m − 11663880) + 4k(165353m5 − 2490002m4 + 14086017m3 − 37154572m2 + 45368164m − 20112720) + 288(2001m5 − 25450m4 + 128585m3 − 322100m2 + 388324m − 173040)]}/[2(k − 5)(k − 3)(k + 2)(k + 3)(k + 4)(m − 4)2(2m − 9)(1067m4 − 6727m3 + 13027m2 − 7697m + 210)(km − 1)]
8 −{(k − 7)(k − 2m + 6)[k7(4751m6 − 101457m5 + 860555m4 − 3681375m3 + 8289854m2 − 9182568m + 3825360) + k6(23755m6 − 495790m5 + 4118327m4 − 17279672m3 + 38197692m2 − 41536512m + 16964640) + k5(−206286m6 + 4477837m5 − 38491018m4 + 166405493m3 − 377604962m2 + 420063216m − 174878640) − 3k4(468054m6 − 9846179m5 + 82298858m4 − 346924603m3 + 769533814m2 − 839428224m + 344802960) + k3(−228007m6 + 517454m5 + 26651409m4 − 221650784m3 + 680141896m2 − 877046808m + 386542080) + k2(13435271m6 − 292844363m5 + 2510774503m4 − 10762507513m3 + 24114363462m2 − 26483935680m + 10933917840) + k(30799166m6 − 626694930m5 + 5077031510m4 − 20873258910m3 + 45641745164m2 − 49626013680m + 20475669600) + 600(37811m6 − 656541m5 + 4768367m4 − 18525795m3 +39673190m2 −43034952m+17843760)]}/[2(k −6)(k −4)(k +2)(k +3)(k +4)(k +5)(m−5)2(2m−11)(4751m5−49196m4+178555m3−265930m2+136104m−3024)(km−1)]
9 −{(k − 8)(k − 2m + 7)[k8(10393m7 − 295169m6 + 3444763m5 − 21287315m4 + 74661412m3 − 147149156m2 + 148827072m − 58378320) + k7(93537m7 − 2629468m6 + 30398601m5 − 186195970m4+647532468m3−1265548552m2+1268906064m−492972480)+k6(−463559m7+13467688m6 − 160270925m5 + 1007098102m4 − 3582455306m3 + 7142486032m2 − 7285730520m + 2869943328) + k5(−7040967m7 + 199285124m6 − 2316820929m5 + 14253708290m4 − 49732292718m3 + 97411690736m2 − 97808625096m + 38061051840) + k4(−14147654m7 + 380248249m6 − 4210420826m5 + 24757920421m4 − 82938771506m3 + 157002585316m2 − 153809636904m + 59074800624) + 2k3(39720207m7 − 1170499052m6 + 14047475388m5 − 88527503360m4 + 314151700473m3 − 621935009348m2 + 628262828172m − 245216321280) + 12k2(34497145m7 − 972446354m6 + 11227172479m5 − 68469189584m4 + 236767720390m3 − 460533277076m2 + 460767938016m − 179311977936) + 72k(9940368m7 − 263143441m6 + 2885073381m5 − 16942211875m4 + 57248478267m3 − 110112505504m2 + 109806816564m − 42781636800) + 8640(57016m7 − 1276772m6 + 12427702m5 − 68227355m4 + 223726804m3 − 426503693m2 + 425617338m − 166486320)]}/[2(k − 7)(k − 5)(k + 2)(k + 3)(k + 4)(k + 5)(k + 6)(m − 6)2(2m − 13)(10393m6 − 160060m5 + 931642m4 − 2537428m3 + 3195589m2 − 1476928m + 27720)(km − 1)]

B. Proof of the closed form for D(n, p) from eq. (19)

In this appendix, we prove the closed-form expression for the difference function D(n, p). In particular, we focus on the term that contains a summation over k.

Lemma 16.For all (n, p) with n ≥ 5 and 4 ≤ pn, the following identity holds:

F(n,p)=k=p1n1[(kp+2)2(kp+4)(k+p3)!(2npk1)!(k+1)!(p3)!(nk1)!(np+1)!(kp+3)2(kp+5)(k+p4)!(2npk)!(k+1)!(p4)!(nk1)!(np+2)!]=2p(np+7p2p26)(2n2p+2)!(2p5)!n(n1)(np)!(np+2)!(p3)!p!.

Proof. Let ∆k denote the forward difference operator in k, meaning that ∆k(f ) = f (k + 1) − f (k). Let f (n, p, k) be the summand in the expression for F (n, p). We sum the equation

f(n,p,k)=Δk(hn,p(k)) (23)

over k, from k = p − 1 to k = n − 1. The left-hand side of eq. (23) is the summand in the statement of the lemma, and the function hn,p(k) is the output of Gosper’s algorithm [10, 11]:

hn,p(k)=[(k+1)(2npk)(k3n2+2k3nk3+3k2n2p9k2n24k2np+15k2n3k2p+6k23kn2p2+18kn2p29kn2+2knp26knp+4km3kp2+12kp11k+n2p39n2p2+25n2p21n2+3np214np+15np3+6p211p+6)(k+p4)!(2npk1)!]/[(n1)n(p3)(k+1)!(np+2)(p4)!(nk1)(np+1)!]. (24)

We verify eq. (23) by using eq. (24). After summation, the left-hand side becomes F (n, p). The right-hand side telescopes, so that all terms except the first and the last cancel:

F(n,p)=hn,p(n)hn,p(p1). (25)

We obtain the statement of the lemma by algebraic simplification of the right-hand side of eq. (25). □

C. Proofs of inequalities required for the proof of Proposition 13

In the proof of Proposition 13, we make use of lower and upper bounds for h(n, pm). First, we prove the lower bound.

Lemma 17.For all n ≥ 3, h(n,pm)Cn2.

Proof. From Theorem 7, h(n, pm) ≥ h(n, 3). Comparing h(n, 3) from eq. (10) and Cn2 from eq. (1), we obtain

h(n,3)Cn2=3(n2)n1

for all n ≥ 3. Hence, Cn2h(n,3)h(n,pm) as desired. □

To prove the upper bound, we first need an identity concerning Catalan numbers.

Lemma 18.For n ≥ 3, the Catalan number Cn can be decomposed as a sum.

  1. For even n ≥ 4, with n = 2m + 2 for m ≥ 1,
    C2m+2=k=m+22m+2C(k,m+1)C(m,2m+2k).
  2. For odd n ≥ 3, with n = 2m + 1 for m ≥ 1,
    C2m+1=k=m+12m+1C(k,m)C(m,2m+1k).

Proof. (i) We use two ways of counting monotonic paths. C2m+2 gives the number of monotonic paths that travel from (0, 0) to (2m + 2, 2m + 2) on a square lattice, without traveling above the diagonal connecting (0, 0) to (2m + 2, 2m + 2). Each of these paths passes through exactly one vertical edge from a point (k, m + 1) to a point (k, m + 2), where k ranges from m + 2 to 2m + 2.

The number of monotonic paths that travel from (0, 0) to (k, m + 1) and that do not travel above the diagonal is C(k, m + 1). The number of monotonic paths from (k, m + 2) to (2m + 2, 2m + 2) that do not travel above the diagonal is obtained by traversing the paths in reverse order, from (2m + 2, 2m + 2) down and to the left, reaching (k, m + 2) (Figure 5A). The associated number of paths is C(m, 2m+2− k). Hence, the total number of monotonic paths from (0, 0) to (2m+2, 2m+2) that do not travel above the diagonal is k=m+22m+2C(k,m+1)C(m,2m+2k).

(ii) The argument in the odd case proceeds in the same way. Each path from (0, 0) to (2m + 1, 2m + 1) passes through exactly one vertical edge from a point (k, m) to a point (k, m + 1), where k ranges from m + 1 to 2m + 1. The number of paths from (0, 0) to (k, m) is C(k, m), and the number of paths from (k, m + 1) to (2m + 1, 2m + 1) is C(m, 2m + 1 − k). □

We are now ready to prove the upper bound.

Lemma 19.For all n ≥ 3, h(n,pm)nCn+2.

Proof. We split the proof into two cases, according to the expressions for pm from Theorem 7.

First, assume n is even, with n = 2m for m ≥ 2. Then pm = m + 1, and by Theorem 4,

h(2m,m+1)=k=m2m1(km+1)C(k,m2)C(m1,2mk1),

or, equivalently,

h(2m,m+1)=k=m+22m+1(km1)C(k2,m2)C(m1,2mk+1).

Using the decomposition in Lemma 18,

2mC2m+2h(2m,m+1)=k=m+22m+1[2mC(k,m+1)C(m,2mk+2)(km1)C(k2,m2)C(m1,2mk+1)]+2m(m+2)(3m+3)!(m+1)!(2m+3)!.

The summand in the first term is nonnegative, as function C(n, k) is monotonically increasing with respect to both arguments, and 2mkm − 1 because k ≤ 2m + 1. The remaining term is also nonnegative. Hence, for even n we indeed have h(n,pm)nCn+2.

Now assume n is odd, with n = 2m − 1 and m ≥ 2. We then must show (2m1)C2m+1h(2m1,m+1). By Theorem 4, we have

h(2m1,m+1)=k=m2m2(km+1)C(k,m2)C(m2,2mk2),

or, equivalently,

h(2m1,m+1)=k=m+12m1(km)C(k1,m2)C(m2,2mk1).

Using the decomposition in Lemma 18, we have

(2m1)C2m+1h(2m1,m+1)=k=m+12m1[(2m1)C(k,m)C(m,2mk+1)(km)C(k1,m2)C(m2,2mk1)]+(2m1)(2m3+7m2+9m+2)(3m)!m!(2m+2)!.

As is true in the even case, the summand is termwise nonnegative, as is the remaining term. We conclude that for odd n,h(n,pm)nCn+2, completing the proof. □

D. Proof of the closed form for D(n, p) from eq. (21)

Here we prove the closed-form expression for the dual difference function in eq. (21) from Section 7.1.

Lemma 20.For all (n, p) with n ≥ 5 and 4 ≤ pn, the following identity holds:

F(n,p)=k=np+3n1[(kn+p1)2(kn+p+1)(k+np)!(n+pk4)!(k+1)!(p2)!(nk1)!(np)!(kn+p2)2(kn+p)(k+np+1)!(n+pk5)!(k+1)!(p3)!(nk1)!(np+1)!]=4(p3)(n2+2p23np+5n9p+10)(2p7)!(2n2p+3)!n(n1)(np+1)(p4)!(p2)!(np)!(np+3)!.

Proof. As in Appendix B, let ∆k denote the forward difference operator in k, meaning that ∆k(f ) = f (k + 1) − f (k). Let f (n, p, k) be the summand in the expression for F (n, p).

We sum the equation

f(n,p,k)=Δk(hn,p(k)) (26)

over k, from k = np + 3 to k = n − 1. The left-hand side is the summand in the statement of the lemma, and the function hn,p(k) is the output of Gosper’s algorithm [10, 11]:

hn,p(k)=[(k+1)(k3n22k3n+k33k2n3+3k2n2p+k2n24k2np+4k2n3k2p+6k2+3kn46kn3p+4kn3+3kn2p22kn2p2kn22knp2+4knp+3kp212kp+11kn5+3n4p3n43n3p2+6n3p3n3+n2p33n2p2+4n2p3n22np+4np3+6p211p+6)(n+kp)!(n+pk4)!]/[n(n1)(k+1)!(p2)!(nk1)!(np+1)!]. (27)

With h as in eq. (27), eq. (26) is verified algebraically. After summation of eq. (26), the left-hand side becomes F (n, p), and the right-hand side telescopes. All terms except the first and the last cancel, leaving

F(n,p)=hn,p(n)hn,p(np+3). (28)

The lemma then follows by algebraic simplification of the right-hand side of eq. (28). □

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Degnan JH Gene tree distributions under the coalescent process. PhD thesis, University of New Mexico Albuquerque, 2005. [PubMed] [Google Scholar]
  • [2].Degnan JH and Rhodes JA There are no caterpillars in a wicked forest. Theoretical Population Biology, 105:17–23, 2015. [DOI] [PubMed] [Google Scholar]
  • [3].Degnan JH and Salter LA Gene tree distributions under the coalescent process. Evolution, 59(1):24–37, 2005. [PubMed] [Google Scholar]
  • [4].Disanto F and Munarini E Local height in weighted Dyck models of random walks and the variability of the number of coalescent histories for caterpillar-shaped gene trees and species trees. SN Applied Sciences, 1:578, 2019. [Google Scholar]
  • [5].Disanto F and Rosenberg NA Coalescent histories for lodgepole species trees. Journal of Computational Biology, 22:918–929, 2015. [DOI] [PubMed] [Google Scholar]
  • [6].Disanto F and Rosenberg NA Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13:913–925, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Flajolet P and Sedgewick R Analytic Combinatorics. Cambridge University Press, Cambridge, 2009. [Google Scholar]
  • [8].Graham RL, Knuth DE, and Patashnik O Concrete Mathematics. Addison-Wesley, Boston, 2nd edition, 1994. [Google Scholar]
  • [9].Himwich ZM and Rosenberg NA Roadblocked monotonic paths and the enumeration of coalescent histories for non-matching caterpillar gene trees and species trees. Advances in Applied Mathematics, 113:101939, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Paule P and Schorn M A Mathematica version of Zeilberger’s algorithm for proving binomial coefficient identities. Journal of Symbolic Computation, 20:673–698, 1995. [Google Scholar]
  • [11].Petkovšek M, Wilf HS, and Zeilberger D A=B. CRC Press, Boca Raton, 1996. [Google Scholar]
  • [12].Reuveni S Catalan’s trapezoids. Probability in the Engineering and Informational Sciences, 28:353–361, 2014. [Google Scholar]
  • [13].Rosenberg NA Counting coalescent histories. Journal of Computational Biology, 14:360–377, 2007. [DOI] [PubMed] [Google Scholar]
  • [14].Rosenberg NA Coalescent histories for caterpillar-like families. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10:1253–1262, 2013. [DOI] [PubMed] [Google Scholar]
  • [15].Rosenberg NA Enumeration of lonely pairs of gene trees and species trees by means of antipodal cherries. Advances in Applied Mathematics, 102:1–17, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Rosenberg NA and Degnan JH Coalescent histories for discordant gene trees and species trees. Theoretical Population Biology, 77:145–151, 2010. [DOI] [PubMed] [Google Scholar]
  • [17].Stanley RP Catalan Numbers. Cambridge University Press, Cambridge, 2015. [Google Scholar]
  • [18].Than C, Ruths D, Innan H, and Nakhleh L Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology, 14:517–535, 2007. [DOI] [PubMed] [Google Scholar]

RESOURCES