Abstract
Given a gene tree topology and a species tree topology, a coalescent history represents a possible mapping of the list of gene tree coalescences to associated branches of a species tree on which those coalescences take place. Enumerative properties of coalescent histories have been of interest in the analysis of relationships between gene trees and species trees. The simplest enumerative result identifies a bijection between coalescent histories for a matching caterpillar gene tree and species tree with monotonic paths that do not cross the diagonal of a square lattice, establishing that the associated number of coalescent histories for n-taxon matching caterpillar trees (n ⩾ 2) is the Catalan number . Here, we show that a similar bijection applies for non-matching caterpillars, connecting coalescent histories for a non-matching caterpillar gene tree and species tree to a class of roadblocked monotonic paths. The result provides a simplified algorithm for enumerating coalescent histories in the non-matching caterpillar case. It enables a rapid proof of a known result that given a caterpillar species tree, no non-matching caterpillar gene tree has a number of coalescent histories exceeding that of the matching gene tree. Additional results on coalescent histories can be obtained by a bijection between permissible roadblocked monotonic paths and Dyck paths. We study the number of coalescent histories for non-matching caterpillar gene trees that differ from the species tree by nearest-neighbor-interchange and subtree-prune-and-regraft moves, characterizing the non-matching caterpillar with the largest number of coalescent histories. We discuss the implications of the results for the study of the combinatorics of gene trees and species trees.
Keywords: Catalan numbers, coalescent histories, Dyck paths, monotonic paths, nearest-neighbor-interchange, subtree-prune-and-regraft
Mathematics subject classification: 05A15, 05A19, 05B35, 92B10, 92D15
1. Introduction
In the mathematical study of evolutionary trees, genetic lineages can be treated as evolving along the branches of a species phylogeny, a tree that represents the evolutionary relationships among a set of species [4, 12, 13]. A tree describing a set of genetic lineages that descend from a common ancestor is a gene tree, and a tree relating the species themselves is a species tree. Looking backward in time, in a gene tree of genetic lineages sampled from representative individuals of a given set of species, a pair of genetic lineages can coalesce, or find a common ancestor, only after the common ancestor of their species is reached. More generally, a set of two or more genetic lineages has a most recent common ancestor only after the most recent common ancestor of their associated species is reached.
The study of the relationship between gene trees and species trees—usually treated as binary, rooted, and leaf-labeled—has generated a number of novel combinatorial structures [3, 5, 6, 12, 20, 21, 25, 27, 28]. Among these are coalescent histories, structures that describe the possible locations on a species tree where the coalescences of a gene tree can take place [6, 16]. More precisely, for a (binary, rooted, leaf-labeled) gene tree topology G and a (binary, rooted, leaf-labeled) species tree topology S on the same set of taxa, a coalescent history f associates with each coalescence in G an edge of S, such that two properties are satisfied: (i) the species tree edge h(u) associated with a gene tree coalescence u is ancestral to all lineages that descend from u; (ii) for any pair of gene tree coalescences u, v for which u lies on a path from v to a leaf of the gene tree, h(u) lies on a path from h(v) to a leaf of the species tree. From a biological perspective, this pair of constraints encodes the rules that (i) gene lineages can coalesce only in a branch of the species tree in which it is possible for their ancestors to coexist, and that (ii) ancestors can coalesce no more recently than their descendants.
Rosenberg [16] provided a recursion that enumerates coalescent histories for arbitrary gene tree and species tree topologies. For gene tree topology G and species tree topology S, so that the taxon set of S is a superset of that of G but not necessarily the same set, let T(G, S) denote the minimal displayed subtree of S that contains all the taxa of G, that is, the subtree of S rooted at the node that corresponds to the most recent common ancestor of the taxa with the same labels as the taxa in G. Let d(G, S) ⩾ 0 denote the number of edges that separate the root of T(G, S) from the root of S. Let GL and GR denote the left and right subtrees of G. We define an integer parameter m ⩾ 1, and write a recursion for a function BG,S,m:
| (1) |
The base case is obtained by setting BG,S,m to 1 for all m in the case that G has only one taxon. With these definitions, the number of coalescent histories for gene tree topology G and species tree topology S is BG,S,1.
Caterpillar species trees, in which an internal node exists that is descended from all other internal nodes, represent a special case in which enumeration of the coalescent histories is simpler than in the general case of arbitrary species trees. Thus, although exact and asymptotic results are known for certain additional shapes [9, 16, 18], enumerative properties have been explored most extensively for caterpillar species trees and shapes that closely resemble them [2, 6, 10, 16, 17, 19]. First, for a matching caterpillar gene tree and species tree—a caterpillar gene tree and species tree with the same labeled topology—Degnan [2] found a bijection between coalescent histories and monotonic paths on a square lattice that do not cross above the y = x diagonal, a quantity well-known to be described by the Catalan number sequence [23, item 24]. Eq. 1 recovers the Catalan numbers in this case [16, Corollary 3.5], and can be used to show that the number of coalescent histories for matching gene trees and species trees in small “caterpillar-like families” is asymptotic to a constant multiple of the Catalan numbers [16, 17]. This asymptotic behavior has been demonstrated for caterpillar-like families of arbitrary size using techniques of analytic combinatorics [10].
Enumerative results have been comparatively little studied, however, in the case that labeled gene trees and species trees disagree in topology. Than et al. [26] performed a numerical investigation, finding that the number of coalescent histories for non-matching gene tree and species tree topologies generally decreases with increasing subtree-prune-and-regraft (SPR) distance between the trees. Rosenberg & Degnan [19] demonstrated that for the caterpillar species tree topology with n ⩾ 7 taxa, there exists a non-matching gene tree topology with more coalescent histories than the matching caterpillar gene tree topology. Nevertheless, for caterpillar species tree topologies, Degnan & Rhodes [3] showed that no non-matching caterpillar gene tree topology can exceed the matching caterpillar gene tree topology in number of coalescent histories; indeed, the constructive example of Rosenberg & Degnan [19] of a non-matching gene tree topology with more coalescent histories than the matching caterpillar was not itself a caterpillar.
Here, we extend the monotonic path approach of Degnan [2] to non-matching caterpillar gene tree and species tree topologies. We show that coalescent histories for non-matching caterpillar gene tree and species tree topologies can be bijectively associated with a set of roadblocked monotonic paths that do not cross above the y = x diagonal of a square lattice. The approach immediately recovers the result of Degnan & Rhodes [3] that non-matching caterpillar gene tree topologies do not exceed the matching caterpillar gene tree topology in number of coalescent histories. It enables calculations of the number of coalescent histories for caterpillar gene tree topologies that differ from the species tree by common transformations—nearest-neighbor-interchange and subtree-prune-and-regraft. We characterize non-matching caterpillar gene trees with the largest numbers of coalescent histories, finding that the number of coalescent histories in such cases is asymptotically equivalent to that in the matching case.
2. Preliminaries
2.1. Caterpillar trees
We consider binary, rooted, leaf-labeled trees with leaf labels bijectively drawn from a label set X containing n distinct labels. For convenience, a “tree” refers to a binary, rooted, leaf-labeled tree. Trees contain two types of nodes, leaf nodes and non-leaf, or internal, nodes. Because trees are rooted, we say that a node v1 of a tree G is descended from another node v2 if the shortest path from v1 to the root node contains v2. We also say that v2 is ancestral to v1. Ancestor–descendant relationships also apply to pairs of edges and to pairs containing a vertex and an edge. A node or edge is trivially descended from itself, and it is also trivially ancestral to itself. The root node is an internal node.
We focus on caterpillar trees, trees in which there exists an internal node descended from all other internal nodes (Figure 1A). A caterpillar tree has exactly one cherry node, a node with exactly two descendant leaves. Among leaves, the longest path length to the root of a caterpillar tree with n leaves is n – 1.
Figure 1.

Transformations of caterpillar trees. (A) A caterpillar tree G1. The vector of labels for G1, in canonical order, is (A, B, C, D, E, F, G, H, I, J). The adjacent pairs of leaves are (A, B), (A, C), (B, C), (C, D), (D, E), (E, F), (F, G), (G, H), (H, I), and (I, J). (B) A tree G2 that differs from G1 by nearest-neighbor-interchange. Leaves E and F are exchanged. (C) A tree obtained from G1 by forward incrementation of leaves C, D, and E. (D) A tree obtained from G1 by reverse incrementation of leaves C, D, and E. The tree in (C) can also be viewed as the result of a subtree-prune-and-regraft operation, with the branch leading to leaf E pruned and regrafted; the tree in (D) can be viewed as the result of an SPR operation involving the leaf leading to C. In each panel, the red line indicates which leaves are permuted.
The number of distinct caterpillar trees possible for a label set X with n distinct labels is n!/2: the leaf separated from the root by only one edge has n possible labels, the leaf two edges from the root then has n – 1 possible labels, and so on. In this assignment of labels, the leaves descended from the cherry node are exchangeable. Hence, only one labeling is possible for these leaves, giving a total of n(n – 1)(n – 2) ×⋯× 3 = n!/2 labelings. These labelings represent the n!/2 caterpillar labeled topologies for label set X.
For convenience, we organize the labels in an n-leaf caterpillar tree G canonically in a vector g of length n. For i = 3, 4,…, n, entry i in the vector is the label of the leaf separated from the root by n – i + 1 edges. Entries 1 and 2 are the labels for the leaves in the cherry. Two vectors of labels g and s are considered to be equivalent if and only if one of the following two conditions holds: (1) gi = si for all i, or (2) g1 = s2, g2 = s1, and gi = si for each i = 3, 4,…, n.
Two leaves in a caterpillar tree are considered to be adjacent if they are separated by exactly two or three edges (Figure 1A). Equivalently, leaves are adjacent if and only if their indices in the sequence of labels for the tree differ by 1, or if one is entry 1 and the other is entry 3.
A component of a caterpillar tree is a subset of adjacent leaves, excluding from the definition the subset consisting solely of the pair of leaves in the cherry. Formally, a subset of labels X′ ⊂ X is a component of G if X′ ≠ {x1, x2} and for any pair of labels x1, x2 ∈ X′, there exists a sequence of distinct elements x1, xi1, xi2,…xij, x2 ∈ X′ in which each consecutive pair of elements labels adjacent leaves in G.
It is convenient to number the internal nodes of an n-leaf caterpillar tree from 1 to n – 1 in increasing order from the cherry node toward the root. These nodes are ordered by ancestor–descendant relationships, so that the node of smallest value in any nonempty subset of internal nodes descends from all other elements of the subset. We call this node the minimal node of the subset. It is also useful to consider that a tree possesses an internal edge ancestral to its root node; thus, identifying each internal node with its immediate ancestral edge, a nonempty subset of internal edges has a minimal edge.
2.2. Relationships between pairs of caterpillar trees
The labelings of distinct caterpillar trees with the same label set differ by a permutation of the vector of leaf labels. We will have occasion to examine pairs of caterpillar trees whose labelings differ by specific types of permutation: nearest-neighbor-interchange and subtree-prune-and-regraft [24].
Consider two distinct caterpillar trees G and S, bijectively labeled from the same set of n distinct labels.
Definition 1. Caterpillar trees G and S differ by a nearest-neighbor-interchange, or NNI move, if S can be obtained from G by exchanging the labels of a pair of adjacent leaves in G that are separated by exactly three edges (Figure 1B).
Note that our definition of adjacent leaves includes the leaves corresponding to labels g1 and g2 in the canonical ordering. This pair is the only pair of adjacent leaves that are not separated by an NNI move.
Definition 2. Caterpillar trees G and S differ by a subtree-prune-and-regraft, or SPR move, if there exists an ordered pair of edges (e1, e2) in G with the property that if edge e1 is cut, edge e2 is subdivided in two by placement of a new vertex v of degree two, and the subtree descended from e1 is connected to vertex v such that v now has degree three and is ancestral to the subtree, then tree S is obtained (Figure 1C, 1D).
In an SPR move, note that it is possible for the edge e2 to be the edge ancestral to the root of G.
Definition 3. Caterpillar trees G and S differ by a cyclic permutation if there exists a component G′ of G and a component S′ of S such that the labels of S′ represent a cyclic permutation of the labels of G′.
By definition of a component, this definition excludes permutations that simultaneously involve leaves separated from the root by the fewest edges and leaves separated from the root by the most edges, unless all leaves are involved.
Definition 4. Caterpillar trees G and S differ by an incrementation if they differ by a cyclic permutation and at most one label has positions in the canonical label vectors of G and S that differ by more than one.
S can differ from G by a forward or a reverse cycle or incrementation (Figure 1C, 1D). If S differs from G by a forward incrementation or cycle, then G differs from S by a reverse incrementation or cycle, and vice versa. Note that each cyclic permutation that exchanges two leaves is concurrently a forward incrementation, a reverse incrementation, and an NNI move.
We can immediately observe that a pair of caterpillar trees G and S differ by an SPR move if and only if they also differ by an incrementation of the leaf labels. SPR moves that convert caterpillars to caterpillars necessarily prune and regraft a single leaf. If a leaf is pruned from G and regrafted to S, then depending on which leaf is pruned and where it is regrafted, S can differ from G by either a forward or a reverse incrementation. Therefore, enumeration of coalescent histories in the case that caterpillar trees differ by an SPR move is performed by enumeration in the associated case of a forward or a reverse incrementation.
2.3. Coalescent histories
We study coalescent histories for a caterpillar gene tree G and a caterpillar species tree S, treated as binary, rooted, leaf-labeled caterpillar trees, each with n leaves labeled by labels bijectively drawn from the same set X. This setting corresponds to considering G to represent the tree formed by sampling a single gene lineage in each of the n species present in species tree S. Gene tree G and species tree S are said to be matching if G and S have the same labeled topology, and they are said to be non-matching otherwise.
Formally, a coalescent history can be defined as follows [19].
Definition 5. Consider an ordered pair of binary, rooted, leaf-labeled trees (G, S) whose labels are bijectively drawn from the same label set X. A coalescent history is a function h from the set of internal nodes of G to the set of internal edges of S that satisfies two conditions:
For each internal node v of G, all leaf labels for leaves descended from v in G label leaves descended from edge h(v) in S.
For all pairs of internal nodes v1, v2 in G, if node v2 is descended from node v1 in G, then edge h(v2) is descended from edge h(v1) in S.
An illustration appears in Figure 2. Recall that we consider that S contains an edge ancestral to its root; this edge can be the image of an internal node of G under a coalescent history mapping. Note that because an edge is trivially descended from itself, in part 2 of Definition 5, it is permissible for h(v2) to equal h(v1).
Figure 2.
Coalescent histories. (A) A gene tree G and species tree S with the same label set. The gene tree appears in blue, and the species tree appears in black. (B) The coalescent history depicted in (A) for (G, S). The arrows connect internal nodes of G to their associated edges in S.
We will have occasion to use the concept of a partial coalescent history.
Definition 6. Consider an ordered pair of binary, rooted, leaf-labeled trees (G, S) whose labels are drawn from the same label set X, not necessarily bijectively. A partial coalescent history is a function h from the set of internal nodes of G to the set of internal edges of S, satisfying the two conditions in Definition 5.
We say that if G is empty, then (G, S) has one partial coalescent history. For nonempty G, because the labels in G are not necessarily the same as those of S, it is possible that for some nodes v in G, S has no edge that can serve as the image of a node in G. In this case, the pair (G, S) has no partial coalescent histories. When connecting the purely graphical definition of coalescent histories in Definition 5 to the biological context in which coalescent histories arise, we say that an internal node v of G is a gene tree coalescence; the coalescence is said to occur on edge h(v) of S.
2.4. Catalan numbers and monotonic paths
We recall a number of results concerning Catalan numbers and their use in counting paths along the edges of square lattices. The Catalan sequence {Cn}n⩾0 satisfies
beginning from n = 0, with values 1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, …
Catalan numbers can be placed in the combinatorial construction known as Catalan’s triangle [14], of which we display the first six columns:
In this triangle, the initial 1 in the lower left corner is denoted D(0, 0). Other entries are denoted D(n, k), with n as the horizontal distance from the lower left corner and k as the vertical distance from this entry.
For n, k with 0 ⩽ k ⩽ n, the entries (n, k) satisfy the recursion relation
| (2) |
with initial condition D(0, 0) = 1. The general formula for D(n, k) is
| (3) |
In particular, for k = n, we have D(n, n) = Cn.
The entry D(n, k) counts the number of monotonic paths on the lattice in the first quadrant of the (n, k) plane (including the coordinate axes) that do not cross the line k = n, where a monotonic path is a path from (0, 0) to (n, k) that proceeds by steps upward and to the right on the lattice.
We will also make use of extensions of Catalan’s triangle known as Catalan’s trapezoids of order m, which contain an initial column of m entries equal to 1, rather than a single entry [14]. Entries Dm(n, k) in Catalan’s trapezoids satisfy a version of eq. 2:
| (4) |
We have D1(n, k) = D(n, k). The first five columns of Catalan’s trapezoid of order 3 appear below:
An entry in the trapezoid can be calculated in closed form as
| (5) |
The entry Dm(n, k) in Catalan’s trapezoid of order m counts the number of monotonic paths on the lattice in the first quadrant of the (n, k) plane (including the coordinate axes) that do not cross the line k = n + m – 1.
3. Bijection of coalescent histories and roadblocked monotonic paths
3.1. Matching gene trees and species trees
Degnan [2] proved that the number of coalescent histories for a matching caterpillar gene tree G and species tree S with n labels is the Catalan number Cn–1, demonstrating a bijection between coalescent histories and monotonic paths that do not cross the y = x diagonal of a square lattice. We will discuss this well-known correspondence, as the bijective approach is useful for the non-matching case.
Lemma 7. The coalescent histories for a matching n-leaf caterpillar gene tree G and species tree S can be bijectively associated with monotonic paths that do not cross the y = x diagonal of an (n – 1) × (n – 1) lattice.
Proof. Label the internal nodes of G sequentially from 1 to n – 1, using 1 for the internal node nearest the cherry and n – 1 for the root. For each internal node of G, identify the label for the node with the edge immediately ancestral to it. Similarly, sequentially label the internal nodes of S from 1 to n – 1, proceeding from the cherry toward the root and identifying the label for each node with its immediate ancestral edge.
For each j with 1 ⩽ j ⩽ n – 1, denote by Gj the subtree of the gene tree rooted at node j, and for each i with 1 ⩽ i ⩽ n – 1, denote by Si the subtree of the species tree rooted at node i. We also define G0 and S0 to be empty subtrees of the gene tree and species tree, respectively. Denote by Ai,j the set of partial coalescent histories for (Gj, Si). For matching G and S, for each j with 0 ⩽ j ⩽ n – 1, Gj = Sj. Hence, by definition of a coalescent history, for each internal node j ⩾ 1 of G, the image h(j) in a coalescent history h of (G, S) must be ancestral in S to all leaves of S labeled by labels in Gj. The edges of S with this property are edges j, j + 1,…, n – 1. For j ⩾ 1, we have j ⩽ h(j) ⩽ n – 1, and ∣Ai,j∣ = 0 for all (i, j) with i < j.
Each partial coalescent history in Ai,j is formed in one of two ways. Gene tree node j ⩾ 1 is mapped either to species tree internal edge i, or to one of the edges 1, 2,…, i – 1. The former case produces ∣Ai,j–1∣ partial coalescent histories, each obtained by appending the coalescence of gene tree node j to a partial coalescent history for (Gj–1, Si). The latter case produces ∣Ai–1, j∣ partial coalescent histories; because no gene tree coalescences in such a partial coalescent history occur on species tree edge i, each such partial coalescent history for (Gj, Si) is a partial coalescent history for (Gj, Si–1). Hence, we have
| (6) |
with the constraint ∣Ai,j∣ = 0 for j ⩾ 1 and i < j. For j = 0 and 0 ⩽ i ⩽ n – 1, we have ∣Ai,0∣ = 1 by the convention that (G, S) has one partial coalescent history for empty G. We set ∣Ai,j∣ = 0 for all (i, j) that do not satisfy 0 ⩽ i, j ⩽ n – 1.
Recursion 6 and its base cases, with i in the role of n and j in the role of k, is precisely eq. 2. Setting i = j = n – 1, eq. 2 gives the recursion for enumerating the set of monotonic paths that do not cross the y = x diagonal of an (n – 1) × (n – 1) square lattice, a set with Cn–1 elements. In the bijection between coalescent histories and monotonic paths, each step to the right in the lattice, incrementing i, corresponds to incorporating an additional edge of the species tree as a possible location for gene tree coalescences, and each step up, incrementing j, corresponds to occurrence of a gene tree coalescence. ■
We can read a coalescent history of (G, S) from its associated monotonic path (Figure 3). For example, in a 10-leaf tree, the monotonic path that proceeds through (0,0), (3,0), (3,2), (6,2), (6,3), (7,3), (7,7), (9,7), and (9,9) has no gene tree coalescences on edge 1 of the species tree above (A, B) or on edge 2 above ((A, B), C). Gene tree coalescences (A, B) and ((A, B), C) occur on edge 3 above species tree node (((A, B), C), D). No gene tree coalescences occur on edges 4 or 5. Gene tree coalescence (((A, B), C), D) occurs on edge 6. Four gene tree coalescences occur on edge 7 above species tree node (((((((A, B), C), D), E), F), G), H). The two remaining gene tree coalescences occur on edge 9 above the species tree root.
Figure 3.
The correspondence between monotonic paths that do not cross above the y = x diagonal of an (n – 1) × (n – 1) square lattice and coalescent histories for a matching caterpillar gene tree and species tree with n = 10 leaves. The lower left corner represents the origin (0, 0). Monotonic paths from (0, 0) to (i, j) represent the partial coalescent histories Ai,j for (Gj, Si). Values ∣Ai,j∣ are taken from eq. 2, using (i, j) in place of (n, k). Species tree internal edges are read from left to right: AB labels the species tree internal edge from which A and B descend, and each successive label indicates the internal edge ancestral both to the leaf corresponding to the associated label and to the caterpillar subtree containing all prior labels.Gene tree internal nodes are read in the same manner from bottom to top. The monotonic path shown in red indicates the locations on the species tree of the gene tree coalescences of a specific coalescent history: coalescences (A, B) and ((A, B), C) occur above species tree node (((A, B), C), D), coalescence (((A, B), C), D) occurs above species tree node ((((((A, B), C), D), E), F), G), coalescences ((((A, B), C), D), E), (((((A, B), C), D), E), F), ((((((A, B), C), D), E), F), G), and (((((((A, B), C), D), E), F), G), H) occur above species tree node (((((((A, B), C), D), E), F), G), H), and coalescences ((((((((A, B), C), D), E), F), G), H), I) and (((((((((A, B), C), D), E), F), G), H),I), J) occur above the species tree root.
The bijection between coalescent histories and monotonic paths generates a set of values of ∣Ai,j∣ that considers each i and j with 0 ⩽ i, j ⩽ n – 1 and i ⩾ j. These values can be depicted in a lattice so that the value ∣Ai,j∣ is associated with the coordinate of lattice point (i, j) (Figure 3). Indeed, they correspond exactly to the entries of Catalan’s triangle (eq. 3), with i in the role of n and j in the role of k.
The construction takes advantage of the caterpillar shape of both gene tree and species tree. Because internal nodes of a caterpillar tree can be placed in order with each entry descended from the next until the root is reached, simply stating the next leaf label suffices to specify the leaves descended from the next internal node. Movement from left to right in Figure 3 indicates movement from the cherry of the species tree toward the root, and movement from bottom to top indicates coalescence in the gene tree.
3.2. Non-matching gene trees and species trees
Our key insight is that a version of the construction of Degnan [2] linking coalescent histories and monotonic paths applies even if the gene tree and species tree are non-matching, provided that both continue to be caterpillars. Coalescent histories for non-matching caterpillars can be associated with roadblocked monotonic paths that do not cross above the y = x diagonal of an (n – 1) × (n – 1) square lattice.
Definition 8. In a lattice, a roadblocked monotonic path is a monotonic path that is not permitted to pass through certain specified lattice points. We term these lattice points roadblocks.
Consider a caterpillar gene tree G and a caterpillar species tree S, whose leaves are both bijectively associated with the same set of n leaves, but that do not necessarily match. As in Section 3.1, we associate points on the x-axis of an (n – 1) × (n – 1) lattice with species tree internal edges in S, and we associate points on the y-axis with gene tree internal nodes in G. We continue to label internal nodes of G and S in increasing order from 1 to n – 1, from the cherry to the root, indexing the gene tree internal nodes by j and the species tree internal nodes by i.
As is true in the matching case, for each j from 1 to n – 1, each coalescent history must have h(j) ⩾ j, as a gene tree internal node j must map to a species tree internal edge ancestral to at least as many leaves as descend from node j in G. Hence, each coalescent history for (G, S) corresponds to a monotonic path that has j ⩽ i and hence does not cross the y = x diagonal of the lattice. However, an additional constraint is imposed by the fact that G and S do not necessarily match.
Given G and S, let π(G) denote the permutation of the gene tree leaf labels g = (g1, g2,…, gn) represented by the species tree leaf labels s = (s1, s2,…, sn). The action of π sends the vector of leaf labels from one n-tuple to another, and we denote the index in S of gk, the kth label of G, by πk(G).
For the leaf labels g1, g2,…, gn in G, let f(gk) denote the minimal internal edge of S ancestral to leaf sπk(G), the species tree leaf with label gk. For a matching gene tree and species tree (G, S), π is the identity permutation so that πk(G) = k; we then have f(g1) = f(g2) = 1, and f(gk) = k – 1 for 3 ⩽ k ⩽ n.
For general (G, S) that do not necessarily match, by Definition 5, (i) if k = 1 or k = 2, then f(gk) = maxℓ∈{1,2} πℓ(G) – 1, and (ii) if 3 ⩽ k ⩽ n, then f(gk) = maxℓ∈{1, 2,…, k} πℓ(G) – 1. This rule encodes the fact that a gene tree coalescence can occur only on a species tree edge ancestral to all species tree leaves labeled by the elements of the set of labels for leaves descended from the gene tree coalescence.
Consider the partial coalescent histories Ai,j with i ⩾ j. As in Section 3.1, for j ⩾ 1, ∣Ai,j∣ = 0 for all (i, j) with i < j. For each j from 1 to n – 1, the minimal internal edge of S that is ancestral to all leaves labeled by labels of leaves of G that descend from gene tree internal node j is f(gj+1). Therefore, for j ⩾ 1, we have ∣Ai,j∣ = 0 for all (i, j) with i < f(gj+1). Note that these (i, j) are the only roadblocks: for j ⩾ 1, f(gj+1) ⩾ j, as f(gj+1) is one less than the maximum of j + 1 distinct elements of {1, 2,…, n– 1}, a quantity greater than or equal to j. For j ⩾ 1, because ∣Ai,j∣ = 0 for all lattice points (i, j) with i < f(gj+1), all such points are roadblocks.
We also note that for 1 ⩽ j ⩽ j′ ⩽ n – 1, f(gj′+1) ⩾ f(gj+1). The set of descendant leaves of internal node j′ + 1 of G contains as a subset the descendant leaves of internal node j + 1 of G. Hence, the minimal internal edge of S ancestral to all labels that label leaves descended from internal node j′ + 1 of G has an index at least as great as the corresponding internal edge of S associated with internal node j + 1 of G. Consequently, if (i, j) is a roadblock, then because i < f(gj+1) and f(gj′+1) ⩾ f(gj+1) for j′ ⩾ j, we can conclude that (i, j′) is a roadblock for each j′ with j ⩽ j′ ⩽ i.
As in Section 3.1, each partial coalescent history in Ai,j is formed in one of two ways. For j ⩾ 1, gene tree node j is mapped either to species tree internal edge i, or to one of the edges 1, 2,…,i – 1. The former case produces ∣Ai,j–1∣ partial coalescent histories, and the latter produces ∣Ai–1,j∣. Hence, the recursion ∣Ai,j∣ = ∣Ai,j–1∣ + ∣Ai–1,j∣ is still satisfied. We still have the constraints ∣Ai,j∣ = 0 for j ⩾ 1 and i < j, ∣Ai,0∣ = 1 for j = 0 and 0 ⩽ i ⩽ n – 1, and ∣Ai,j∣ = 0 for all (i, j) that do not satisfy 0 ⩽ i, j ⩽ n – 1. We also have the new constraint ∣Ai,j∣ = 0 for all (i, j) that satisfy i < f(gj+1).
The set of roadblocks for (G, S) is defined by BG,S = {(i, j) ∣ 1 ⩽ j ⩽ i ⩽ n – 1 and i < f(gj+1)}. We have therefore demonstrated the following proposition.
Proposition 9. Consider a caterpillar gene tree G and a caterpillar species tree S with n leaves. (G, S) can be associated with a set of roadblocks BG,S such that the coalescent histories for (G, S) bijectively correspond to roadblocked monotonic paths that do not cross the y = x diagonal of an (n – 1) × (n – 1) lattice.
By definition of BG,S, we immediately see that if (i, j) is a roadblock for 1 ⩽ j ⩽ i ⩽ n – 1, then (k, j) is a roadblock as well for each k with j ⩽ k ⩽ i. We can also see that if (i, j) is a roadblock for 1 ⩽ j ⩽ i ⩽ n – 1, then (i, ℓ) is a roadblock as well for each ℓ with j ⩽ ℓ ⩽ n – 1; this result follows from the fact that f(gj′+1) ⩾ f(gj+1) for 1 ⩽ j ⩽ j′ ⩽ n – 1. We have the following remark.
Remark 10. Consider a caterpillar gene tree G and a caterpillar species tree S with n leaves. The roadblock set BG,S consists of a set of points (i, j) with 1 ⩽ j ⩽ i ⩽ n – 1 such that if (i, j) ∈ BG,S, then (i) (k, j) ∈ BG,S for all k with j ⩽ k ⩽ i, and (ii) (i, ℓ) ∈ BG,S for all ℓ with j ⩽ ℓ ⩽ i.
Figure 4 illustrates the correspondence between coalescent histories and roadblocked monotonic paths. In Figure 4, we have (f(g1), f(g2), f(g3), f(g4), f(g5), f(g6), f(g7), f(g8), f(g9)) = (5, 5, 5, 5, 6, 8, 8, 9, 9). Because f(g1+1) = 5, (4, 1) is a roadblock, as are (3, 1), (2, 1), and (1, 1) for the same reason ((i, j) is a roadblock if j ⩽ i < f(gj+1)). Because f(g2+1) = 5, (4, 2) is also a roadblock, as are (3, 2) and (2, 2). We can also identify (4, 2), (3, 2), and (2, 2) as roadblocks by Remark 10, as a consequence of the fact that (4, 1), (3, 1), and (2, 1) are roadblocks. Continuing through all (i, j), we identify 15 roadblocks in Figure 4.
Figure 4.
The correspondence between monotonic paths that do not cross above the y = x diagonal of an (n – 1) × (n – 1) square lattice and coalescent histories for a non-matching caterpillar gene tree and species tree with n = 10 leaves. Roadblocks are indicated by circles on lattice points; no roadblocked monotonic paths traverse the shaded regions. Otherwise, the figure design follows Figure 3.
From Proposition 9, we immediately obtain that the number of coalescent histories for (G, S) is given by the number of roadblocked monotonic paths that do not cross above the y = x diagonal of an (n – 1) × (n – 1) lattice, where the roadblocks are those in the set BG,S. We also obtain a simple proof of the following corollary, which appeared as Remark 15 of Degnan & Rhodes [3].
Corollary 11. Consider a caterpillar gene tree G and a caterpillar species tree S with n leaves. The number of coalescent histories for (G, S) is strictly greater for G = S than for each choice of G ≠ S.
Proof. By Proposition 9, coalescent histories for (G, S) correspond to roadblocked monotonic paths that do not cross the y = x diagonal of an (n – 1) × (n – 1) lattice.
For G = S, applying Lemma 7, the number of coalescent histories is the number of monotonic paths that do not cross the y = x diagonal. Adding a roadblock to the lattice necessarily reduces the number of monotonic paths from (0, 0) to (n – 1, n – 1), as each lattice point has at least one monotonic path that passes through it. Because the number of coalescent histories for (G, S) is equal to the number of roadblocked monotonic paths on the lattice, it suffices to show that for G ≠ S, at least one lattice point is a roadblock.
Because G ≠ S, there exists some internal node j of G at least one of whose descendant leaves has a label not contained in the label set of the leaves descended from internal node j of S. This leaf has j < f(gj+1). Hence, (j, j) is a roadblock, and (G, S) is associated with fewer monotonic paths than is (S, S). ■
3.3. Roadblock sets
Given a caterpillar species tree S, Remark 10 suggests a characterization of the possible sets of roadblocks, considering all caterpillar gene trees G. Each roadblock set has the property that within a row, all points to the left of a roadblock and on or below the y = x diagonal are also roadblocks. Within a column, all points above a roadblock and on or below the y = x diagonal are roadblocks.
Proposition 12. Consider a caterpillar species tree S with n leaves. For each caterpillar gene tree G with n leaves, denote its associated roadblock set by BG,S. Considering all n!/2 possible caterpillar gene trees, the distinct roadblock sets are bijectively associated with the Cn–1 monotonic paths on the (n – 1) × (n – 1) lattice that do not cross the y = x diagonal.
Proof. Consider a roadblock set BG,S. For each i from 1 to n – 2, we identify the largest j such that (i, j) is not a roadblock. Call this value ji. A unique monotonic path connects (0, 0), (1, j1), (2, j2),…, (n – 2, jn–2), (n – 1, n – 1): by Remark 10, for each i and each j > ji, (i, j) is either a roadblock or it lies above the y = x line. Hence, denoting j0 = 0 and jn–1 = n – 1, for each i from 1 to n – 1, a monotonic path from (i – 1, ji–1) to (i, ji) must proceed horizontally by length 1 and then vertically by length ji – ji–1.
To show that this construction is injective, note that distinct monotonic paths are associated with distinct roadblock sets: consider a point (i, ji) appearing in one monotonic path P1 but not in another one, P2. Because ji is the largest value of j that is not a roadblock for path P1, (i, ji) must be a roadblock for P2. For surjectivity, consider a monotonic path from (0, 0) to (n – 1, n – 1) that does not cross the y = x line. For each (i, ji) in the path, 1 ⩽ i ⩽ n – 2, where ji is the largest value of j for which the point (i, j) is in the path, we assign each point (i, ℓ) with ji < ℓ ⩽ i to be a roadblock. ■
Figure 5 provides an illustration of Proposition 12, showing how the monotonic path associated with a roadblock set is constructed and vice versa. The monotonic path associated with a roadblock set can be viewed as the monotonic path that comes as close as possible to the roadblocks. The roadblock set for a monotonic path is the set of points above and to the left of the path.
Figure 5.
The correspondence between roadblock sets, monotonic paths that do not cross above the y = x diagonal of an (n – 1) × (n – 1) square lattice, and Dyck paths of semi-length n – 1. Given a roadblock set, the associated monotonic path is constructed by identifying for each x coordinate from 0 to n – 1 the lattice point of greatest y coordinate, and then constructing the unique monotonic path through those points. Similarly, given a monotonic path, its roadblock set is obtained by placing roadblocks at each lattice point above and to the left of the path. (A) Roadblock set symmetric across the line y = n – 1 – x. (B) Roadblock set asymmetric across the line y = n – 1 – x. (C) Roadblock set asymmetric across the line y = n – 1 – x, obtained by reflecting the roadblocks in (B) over this line. (D) Symmetric Dyck path associated with the roadblock set in (A). (E) Asymmetric Dyck path associated with the roadblock set in (B). (F) Asymmetric Dyck path associated with the roadblock set in (C), obtained by reversing the Dyck path in (E). The roadblock sets in (B) and (C) both generate 235 monotonic paths from (0, 0) to (9, 9).
The number of distinct caterpillar trees is n!/2, whereas the number of distinct roadblock sets is the smaller Cn–1. For a given caterpillar species tree, we can place the n!/2 caterpillar gene trees into equivalence classes, where two gene trees are said to be history-equivalent if and only if they are associated with the same roadblock set. Two history-equivalent caterpillar trees G1 and G2 have the same set of roadblocks and the same set of monotonic paths, and hence, the same set of coalescent histories, up to permutation of the leaf labels. These equivalence classes were termed history classes by Rosenberg & Tao [20], so that two caterpillars with the same roadblocks are in the same history class.
By Proposition 12, for a fixed species tree, the number of history classes considering all caterpillar trees is Cn–1; this result accords with the computation of 5 history classes for n = 4 [15, Table V] and 14 for n = 5 [20, Table 3]. We have also seen in Corollary 11 that Cn–1 is the largest possible number of coalescent histories for a pair of caterpillar trees. We now ask how many of the values 1, 2,…, Cn–1 can be the number of coalescent histories for some caterpillar gene tree and species tree.
The simplest upper bound on this quantity is Cn–1. To improve on this bound, it is convenient to use the bijection between monotonic paths that do not cross the y = x diagonal of the (n – 1) × (n – 1) lattice and Dyck paths of semilength n – 1 [22, Corollary 6.3.2]. Each monotonic path represents a series of steps by (1, 0) or (0, 1) from (0, 0) to (n – 1, n – 1), with x ⩾ y at each step. Each Dyck path represents a series of steps by (1, 1) or (1, −1) from (0, 0) to (n – 1, 0), with y ⩾ 0 at each step. The coalescent histories for (G, S) can therefore be associated with Dyck paths, where each up-step represents addition of a species in the species tree and each down-step represents a gene tree coalescence.
A Dyck path of semi-length n – 1 has 2n – 2 total up-steps and down-steps. The steps of Dyck paths can be written as a sequence, with U denoting up-steps and D denoting down-steps. A Dyck path can be reversed in the following manner: we take the sequence of U and D steps in the path, reverse the order of steps, and exchange the positions of U and D steps. Thus, a path UUUDUDDUDD becomes UUDUUDUDDD. Reversing a Dyck path corresponds to traversing the path in reverse order. A reversed Dyck path is itself a Dyck path; if the sequence of U and D steps in a Dyck path is reversed, then y ⩽ 0 at each step; exchanging the positions of the U and D steps reflects the path over the y = 0 axis.
Lemma 13. Consider a caterpillar species tree S with n leaves. Consider gene trees G1 and G2 with n leaves such that (i, j) is in the roadblock set BG1,S if and only if (n – 1 – j, n – 1 – i) is in the roadblock set BG2,S. Then (G1, S) and (G2, S) have the same number of coalescent histories.
Proof. We show that the coalescent histories for (G1, S) can be bijectively associated with the coalescent histories for (G2, S). Consider a coalescent history for (G1, S). Identify its associated monotonic path M1 according to Proposition 9, and identify the Dyck path P1 associated with this monotonic path. Reverse P1 to obtain , and identify the monotonic path associated with .
Because M1 avoids each roadblock (i, j) in BG1,S, after i + j steps, P1 cannot have taken i up-steps and j down-steps. Because is the reverse of P1, after 2n – 2 – i – j steps, cannot have taken n – 1 – j up-steps and n – 1 – i down-steps. The monotonic path therefore avoids the point (n – 1 – j, n – 1 – i) for each roadblock (i, j) in BG1,S. Hence, avoids each roadblock in BG2,S, and it therefore represents a coalescent history for G2. Similarly, beginning from the coalescent history for (G2, S) associated with , we find that M1 represents a coalescent history for BG1,S. ■
The lemma demonstrates that for two roadblock sets, if their roadblocks can be obtained by transforming each roadblock (i, j) of one into a roadblock (n – 1 – j, n – 1 – i) of the other, then the associated caterpillar gene trees have the same number of coalescent histories.
Consider a set of points B on or below the y = x diagonal of the first quadrant of the (n – 1) × (n – 1) lattice (and not on lines y = 0 or x = n – 1) with the property that if (i, j) ∈ B, then (k, j) ∈ B for all k with j ⩽ k ⩽ i and (i, ℓ) ∈ B for all ℓ with j ⩽ ℓ ⩽ i. By Proposition 12, given a caterpillar species tree, B is the roadblock set for some caterpillar gene tree. We term such a set a caterpillar-friendly roadblock set.
Definition 14. Consider a caterpillar-friendly roadblock set B for the (n – 1) × (n – 1) lattice. We say that B is symmetric if for each (i, j) ∈ B, (n – 1 – j, n – 1 – i) is also in B. Otherwise, B is asymmetric.
In a symmetric caterpillar-friendly roadblock set, when the points in the roadblock set are reflected across the line y = n – 1 – x, the same roadblock set is obtained (Figure 5A). For an asymmetric caterpillar-friendly roadblock set, a different roadblock set is obtained by this reflection (Figure 5B and 5C).
For the (n – 1) × (n – 1) lattice, denote by Qn–1 and Rn–1 the numbers of symmetric and asymmetric caterpillar-friendly roadblock sets, respectively. By Lemma 13, the asymmetric caterpillar-friendly roadblock sets can be partitioned into disjoint pairs such that the associated caterpillar gene trees for the two entries in a pair give rise to the same number of coalescent histories. Hence, considering all caterpillar gene trees and species trees, the number of distinct values possible for the number of coalescent histories is bounded above by Qn–1 + Rn–1/2, or because Qn–1 + Rn–1 = Cn–1, by (Cn–1 + Qn–1)/2.
We obtain Qn–1 by counting all ways of placing roadblocks (i, j) with i + j ⩽ n – 1. By symmetry we then assign points (n – 1 – j, n – 1 – i) to be roadblocks as well. Because of the bijection between roadblock sets and monotonic paths (Proposition 12), each set of roadblocks (i, j) with i + j ⩽ n – 1 is bijectively associated with a monotonic path from (0, 0) to a point (i, n – 1 – i) for some i with 0 ⩽ i ⩽ n – 1.
Lemma 15. The value of Qn–1 is .
Proof. Using eq. 3, the number of monotonic paths from (0, 0) to (i, n – 1 – i) for some i with 0 ⩽ i ⩽ n – 1 is obtained by the sum
The first sum gives for odd n, and 2n–2 for even n. The second sum gives for odd n, and for even n. Combining these cases, the result follows. ■
This result appeared in Bonin et al. [1, Theorem 2.5] as the number of number of distinct first halves for Dyck paths, and in Deng et al. [7, Theorem 4.2] as the number of Dyck paths invariant under reversal.
Proposition 16. The size of the set of values that can equal the number of coalescent histories for at least one pair (G, S) consisting of an n-leaf caterpillar gene tree G and an n-leaf caterpillar species tree S is bounded above by Tn–1 = (Cn–1+Qn–1)/2, or
This quantity, which appeared in a bijectively related context in Bonin et al. [1, Theorem 4.2], gives the number of distinct Dyck paths up to reversal. Numerical values of the formulas in Lemma 15 and Proposition 16 are shown in Table 1.
Table 1.
The number of distinct values possible for the number of coalescent histories of a caterpillar gene tree and a caterpillar species tree.
| Number of leaves n |
Number of distinct roadblock sets |
Number of road block sets associ ated with symmet ric Dyck paths |
Number of road block sets associ ated with asym metric Dyck paths |
Upper bound on the number of distinct values for the number of coalescent histories |
Exact number of dis tinct values for the number of coalescent histories |
|---|---|---|---|---|---|
| Notation | Cn–1 | Qn–1 | Pn–1 | Tn–1 | |
| Formula | Cn–1 – Qn–1 | (Cn–1 + Qn–1)/2 | |||
| OEIS record | A000108 | A001405 | A306292 | A007123 | |
| 2 | 1 | 1 | 0 | 1 | 1 |
| 3 | 2 | 2 | 0 | 2 | 2 |
| 4 | 5 | 3 | 2 | 4 | 4 |
| 5 | 14 | 6 | 8 | 10 | 10 |
| 6 | 42 | 10 | 32 | 26 | 21 |
| 7 | 132 | 20 | 112 | 76 | 56 |
| 8 | 429 | 35 | 394 | 232 | 154 |
| 9 | 1430 | 70 | 1360 | 750 | 440 |
| 10 | 4862 | 126 | 4736 | 2494 | 1373 |
| 11 | 16796 | 252 | 16544 | 8524 | 4310 |
| 12 | 58786 | 462 | 58324 | 29624 | 13925 |
The table considers all pairs of labeled caterpillar trees, both matching and non-matching, that are possible for a given label set.
4. Non-recursive enumeration of coalescent histories
With the correspondence between coalescent histories for non-matching caterpillars and roadblocked monotonic paths established, we now turn to enumerating the coalescent histories of possibly non-matching caterpillar gene trees and species trees. We can do so recursively by enumerating roadblocked monotonic paths according to Proposition 9; we can also obtain a non-recursive formula by applying eq. 1.
Without loss of generality, considering the two subtrees immediately descended from the root of a tree, we treat the left subtree as having a number of leaves greater than or equal to that of the right subtree. The right subtree of a caterpillar tree then has a single leaf, so that in eq. 1, the right subtree GR always has exactly one leaf in each successive step of the recursion. Hence, the term BGR,T (GR,S),k+d(GR,S), follows the base case of the recursion and is equal to 1. Eq. 1, describing the number of coalescent histories for a caterpillar gene tree G and a species tree S, then reduces to
| (7) |
with initial condition BG,S,m = 1 for all m when G has a single leaf.
If S is also a caterpillar tree with n leaves, then we can iterate the recursion n – 1 times, at each step reducing the size of the left subtree GL by one, until GL has a single leaf, the base case applies, and the summand equals 1. Each iteration introduces a new summation, with its upper limit depending on the associated d(GL, S), the number of edges that separate the root of T(GL, S) from the root of S. Continuing to label internal nodes of G from 1 to n – 1 in increasing order from the cherry to the root, we associate internal node j of G with index kn–j. Setting the integer parameter m equal to 1, we have
| (8) |
where the constant cj represents the number of additional edges of S that are possible locations for gene tree coalescence j but that are not possible for gene tree coalescence j + 1.
For 1 ⩽ j ⩽ n – 1, consider gene tree internal node j. Let Lj be the set of labels for all j + 1 leaves descended from j. Following the definitions in eq. 1, let Tj(G, S) denote the smallest subtree of S that has the property that each label in Lj labels one of its leaves, and let dj denote the number of edges separating the root of Tj(G, S) from the root of S. Then dj + 1 gives the number of edges of S on which gene tree coalescence j can occur (the +1 represents the root edge of S). The quantity uj = n – 1 – j – dj, equal to the number of edges of S ancestral to at least j + 1 leaves (or n – j) but on which gene tree coalescence j cannot occur, represents the number of roadblocks (i, j) with fixed j and i ⩾ j.
For j = 1, 2,…, n–2, the desired quantity cj, the number of additional edges of S available for coalescence j but not for coalescence j + 1, equals cj = dj – dj+1. We have therefore shown the following proposition.
Proposition 17. Consider a caterpillar gene tree G and a caterpillar species tree S with n leaves. The number of coalescent histories for (G, S) is obtained by eq. 8, where the vector (c1, c2,…, cn–2) is obtained as a function c(G, S) that depends only on the topologies of G and S.
Note that if G and S match, then for each j from 1 to n – 1, Gj = Tj(G, S), and hence dj = n – 1 – j, uj = 0, and no roadblocks occur. We have cj = 1 for each j from 1 to n – 2, and eq. 8 becomes
equal to the Catalan number Cn–1 [16, Theorem 3.4].
We take as an example the gene tree and species tree in Figure 4. We report the values of the uj, dj and cj in Table 2. The number of coalescent histories is
Table 2.
Quantities associated with the enumeration of coalescent histories for a caterpillar gene tree (((((((((A, F), B), C), D), G), I), H), J), E) and species tree (((((((((A, B), C), D), E), F), G), H), I), J).
| Internal node index in gene tree G (j) | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
|---|---|---|---|---|---|---|---|---|---|
| Summation index (n – j) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| Number of roadblocks (uj) | 0 | 1 | 1 | 2 | 1 | 1 | 2 | 3 | 4 |
| Distance between root of Tj (G, S) and root of S (dj) | 0 | 0 | 1 | 1 | 3 | 4 | 4 | 4 | 4 |
| Nodes possible for coalescence j but not j + 1 (cj = dj – dj+1) | NA | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 |
| Summation term |
We can also obtain this result by recursive summation of roadblocked monotonic paths (Figure 4).
By exhaustive use of Proposition 17, we can evaluate all possible values of the number of coalescent histories for the n!/2 caterpillar gene trees with n leaves. This exhaustive computation applies eq. 8 with all possible vectors (c1, c2,…, cn–2) that correspond to gene trees–in other words, the Cn–1 vectors with for each j from 1 to n – 2 [23, replacing ai in item 81 with 1 – ck].
The upper bound from Proposition 16 on the number of distinct values for the number of coalescent histories is relatively tight for small n, but already is more than double the exact computation for n = 12 (Table 1). The smallest case in which the number of distinct values (21) differs from the upper bound (26) occurs with n = 6 leaves, in which 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 19, 22, 23, 26, 28, 32, and 42 are achievable values for the number of coalescent histories. Values 5, 9, 10, 14, and 19 are each achieved with two distinct set of roadblocks that are not equivalent when reversing their associated Dyck paths.
Because Qn–1 ≪ Cn–1, the upper bound for the number of distinct values for the number of coalescent histories of a caterpillar pair is asymptotically equivalent to Cn–1/2, half the maximum number of coalescent histories for caterpillars. Thus, although the number of caterpillars n!/2 grows much faster than the maximal number of coalescent histories Cn–1, asymptotically only at most half the values in the range of possible values for the number of coalescent histories are achieved by actual caterpillar gene trees.
5. Special families of caterpillar gene trees and species trees
From Propositions 9 and 17, we can obtain a variety of corollaries that describe the number of coalescent histories for special pairs of non-matching caterpillar trees. For certain classes of pairs, the number of coalescent histories can be obtained in closed form.
5.1. Nearest-neighbor-interchange
For a fixed caterpillar species tree S, we first consider caterpillar gene trees G that differ from S by a single nearest-neighbor-interchange move (NNI). We have the following result.
Proposition 18. Consider a caterpillar species tree S with n leaves and a caterpillar gene tree G that differs from S by an NNI move. (i) The roadblock set BG,S consists of a single point (i, i) on the diagonal of the square lattice, for some i with 1 ⩽ i ⩽ n – 2. (ii) The number of coalescent histories for (G, S) is Cn–1 – CiCn–1–i.
Proof. We use the bijection between coalescent histories and roadblocked monotonic paths (Proposition 9). We label leaves on the trees from 1 to n as in Section 3.2, using the permutation π to map the leaves of G to the leaves of S. By Definition 1, an NNI move exchanges a single pair of leaves labeled k and k + 1 in G for some k ∈ {2, 3,…, n–1}, or it exchanges leaves 1 and 3. Let ks be the smaller of the two labels for the leaves participating in the NNI move, and let kℓ be the larger of the two labels. We then have πks(G) = ks + 1 and πks+1(G) = ks if ks ∈ {2, 3,…, n – 1}, and π1(G) = 3 and π3(G) = 1 if ks = 1.
(i) Following Section 3.2, for (G, S) differing by one NNI move, the minimal internal edge of S ancestral to leaf ks of G is f(gks) = ks if 2 ⩽ ks ⩽ n – 1, and f(g1) = 2 if ks = 1. The roadblocks (i, j) in the square lattice are those points that satisfy i < f(gj+1). By construction, (ks – 1, ks – 1) is the only roadblock if 2 ⩽ ks ⩽ n – 1, and (1, 1) is the only roadblock if ks = 1.
(ii) The number of coalescent histories for (G, S) is the number of coalescent histories for the case of no roadblocks, or Cn–1 (Lemma 7), minus the number of monotonic paths from (0, 0) to (n – 1, n – 1) that do not cross the diagonal and that pass through the roadblock. For a roadblock at (i, i), this latter quantity is CiCn–1–i, multiplying the number of monotonic paths Ci from (0, 0) to (i, i) that do not cross the diagonal by the number of monotonic paths from (i, i) to (n – 1, n – 1) that do not cross the diagonal. ■
Figure 6A illustrates the result of Proposition 18 with a pair (G, S) that differ by a single NNI move. The number of coalescent histories for the example is 4274, as obtained by Proposition 9. Using Proposition 18, we see that 4274 = C9 – C4C5 = 4862 – 14 × 42.
Figure 6.
The number of coalescent histories for caterpillar gene tree G and species tree S that differ by nearest-neighbor-interchange (NNI) moves. (A) G and S differ by a single NNI move. (B) G and S differ by multiple disjoint NNI moves.
Extending Proposition 18, we can establish a formula for the number of coalescent histories when the permutation π performs k NNI moves, , in such a way that π consists of disjoint cycles of length 2. The roadblock set for (G, S) with such a permutation contains k points on the diagonal of the (n – 1) × (n – 1) lattice. For this roadblock set, we count the monotonic paths that do not cross the diagonal by use of the inclusion–exclusion principle.
Proposition 19. Consider a caterpillar species tree S with n leaves and a caterpillar gene tree G that differs from S by k disjoint NNI moves. Then (i) the roadblock set BG,S consists of k distinct points (ij, ij) on the diagonal of the square lattice, j ∈ {1, 2,…, k}, with 0 < i1 < … ik < n – 1. (ii) The number of coalescent histories for (G, S) can be written
| (9) |
Proof. As in Proposition 18, we rephrase a problem of enumerating coalescent histories in the language of roadblocked monotonic paths.
(i) Because the k NNI moves are disjoint, we can apply Proposition 18i sequentially k times, once for each NNI move. Each of the k NNI moves is associated with a roadblock on the diagonal of the square lattice, the location of which is determined by the identity of the pair of leaves exchanged: if ks is the smaller of the two labels for leaves of G participating in the move, then the roadblock location is (ks – 1, ks – 1) for 2 ⩽ ks ⩽ n – 1 and (1, 1) for ks = 1. Label the roadblocks (i1, i1), (i2, i2),…, (ik, ik), with 0 < i1 < i2 < … < ik < n – 1.
(ii) The total number of monotonic paths on the (n – 1) × (n – 1) square lattice is Cn–1 (Section 2.4). To obtain the desired quantity, we must subtract from Cn–1 the number of paths that pass through at least one of the k roadblocks. By the inclusion–exclusion principle, this quantity can be written as a sum over nonempty subsets β of the k roadblocks of the number of monotonic paths that pass through all roadblocks in β. In particular, for each j from 1 to k, denoting by βj the set of monotonic paths that pass through roadblock (ij, ij), the number of monotonic paths that pass through at least one roadblock is
The cardinality of ∣βj1 ⋂ ⋯ ⋂ βjℓ∣, representing the number of monotonic paths that pass through (0, 0), (ij1, ij1), …, (ijℓ, ijℓ), and (n – 1, n – 1), is a product of Catalan numbers, one for each pair of consecutive points that must be traversed:
The result then follows. ■
Figure 6B illustrates Proposition 19 for a case with three disjoint NNI moves. In this example, roadblocks in a 10-leaf tree appear at (1, 1), (4, 4), and (9, 9). By Proposition 19, the number of coalescent histories is C9 – (C1C8 + C4C5 + C8C1) + (C1C3C5 + C1C7C1 + C4C4C1) – (C1C3C4C1) = 2179.
Proposition 18 additionally has the consequence that for a fixed caterpillar species tree S, the largest number of coalescent histories seen for a non-matching caterpillar gene tree G occurs for a gene tree that differs from S by a single NNI move.
Corollary 20. Consider a caterpillar species tree S with n leaves. Considering all possible caterpillar gene trees G ≠ S with n leaves, (i) the largest number of coalescent histories for (G, S) is obtained when the roadblock set BG,S consists of the single point or for even n, or for odd n. (ii) It equals
Proof. First, note that for each pair (G, S) whose roadblock set BG,S has more than one point, we can identify a pair (G′, S) whose roadblock set consists of a single point in BG,S and that hence has at least as many coalescent histories as (G, S). By Remark 10, the roadblock in a roadblock set consisting of a single point must be located on the diagonal.
Applying Proposition 18ii, for the caterpillar gene tree G that maximizes the number of coalescent histories with fixed S, that number of coalescent histories must equal Cn–1 – CiCn–1–i for some i with 1 ⩽ i ⩽ n – 2. By Corollary 3.11 of Rosenberg [16], for fixed n, this quantity is maximized when .
From the proof of Proposition 18, for fixed S, this maximum is associated with the gene trees G that differ from S in that the leaves abutting the middle coalescence in the path from cherry to root of G are transposed; the case of n odd has one such coalescence and the case of n even has two. ■
Corollary 20 extends Corollary 11 by giving the exact number of coalescent histories for the pair (G, S) that has the largest number of coalescent histories among non-matching caterpillars. For odd n, the number of coalescent histories in Corollary 20 is . For example, for S = ((((A, B), C), D), E) and G = ((((A, B), D), C), E), the number of coalescent histories is coalescent histories. For even n, the number of coalescent histories in Corollary 20 is Cn–1 – Cn/2–1Cn/2. For example, for S = (((((A, B), C), D), E), F), both G = (((((A, B), D), C), E), F) and G = (((((A, B), C), E), D), F) have C5 – C2C3 = 42 – 2 × 5 = 32 coalescent histories. Figure 6A gives the largest number of coalescent histories among nonmatching caterpillars with n = 10 leaves.
We can quickly observe that the largest number of coalescent histories among discordant caterpillar trees grows at the same rate as the number of coalescent histories for matching caterpillars.
Corollary 21. Considering all non-matching caterpillar pairs (G, S) with n leaves, the largest number of coalescent histories for (G, S) is asymptotic to Cn–1.
Proof. Using Stirling’s approximation, , we can verify that .
If n is odd, then the largest number of coalescent histories for a non-matching pair satisfies
If n is even, then
In both cases, the leading term dominates, and Cn–1 – C⌊(n–1)/2⌋C⌈(n–1)/2⌉ ~ Cn–1. ■
5.2. Reverse incrementation of the leaf labels
Next, for a fixed caterpillar species tree S, we consider gene trees G that differ from S by incrementation.
Consider a caterpillar species tree S with n leaves and a caterpillar gene tree that differs from S by an incrementation. By definition, the leaves of some component G′ of G and S′ of S differ by cyclic permutation. Recall that an incrementation with two labels is a NNI move.
Proposition 22. If G is obtained by a reverse incrementation of S, then the roadblock set BG,S consists of a set of consecutive points on the diagonal of the square lattice.
Proof. Consider labels ks, kℓ ∈ {1, 2,…, n}, with ks < kℓ and kℓ ≠ 2. By definition of reverse incrementation, for some component of G with leaves sequentially labeled ks, ks + 1,…, kℓ from the cherry toward the root, associated leaves of S are labeled πks(G) = ks + 1, πks+1(G) = ks + 2,…, πkℓ–1(G) = kℓ, πkℓ(G) = ks.
As in the proof of Proposition 18, we compute the minimal internal edge of S ancestral to each leaf gk of G, k ∈ {ks, ks + 1,…, kℓ}. We obtain f(gk) = k if 2 ⩽ k ⩽ n and f(g1) = 2 if k = 1.
The roadblocks are the points (i, j) satisfying i < f(gj+1). We therefore find that the roadblocks are precisely those points (ks – 1, ks – 1),…, (kℓ – 2, kℓ – 2) if ks > 1 and (1, 1),…, (kℓ – 2, kℓ – 2) if ks = 1. ■
Because all roadblocks lie on the diagonal under reverse incrementation, eq. 9 can be applied to count coalescent histories. In the application of eq. 9, the distinct points on the diagonal through which monotonic paths cannot pass are (ks – 1, ks – 1),…, (kℓ – 2, kℓ – 2) if ks > 1 and (1, 1),…, (kℓ – 2, kℓ – 2) if ks = 1.
For example, in Figure 7A, the reverse incrementation of leaf labels C, D, and E has ks = 3 and kℓ = 5, so that the roadblocks lie at (2, 2) and (3, 3). The number of coalescent histories is obtained by eq. 9 as C9 – (C2C7 + C3C6) + C2C1C6 = 3608.
Figure 7.
The number of coalescent histories for reverse incrementations. (A) G differs from S by a reverse incrementation. (B) G differs from S by a reverse incrementation that includes all labels. (C) G differs from S by a composition of multiple disjoint reverse incrementations.
If the reverse incrementation permutes all the labels, then all points (1, 1),…(n – 1, n – 1) are roadblocks, and the number of coalescent histories is the number of monotonic paths not crossing the diagonal that lies one unit below the y = x line (Figure 7B). As the number of monotonic paths that do not pass above a diagonal of a square lattice, this computation gives Cn–2 coalescent histories. At the same time, the inclusion–exclusion computation of eq. 9 produces a sum that traverses all subsets of the points (1, 1),…, (n – 1, n – 1).
Thus, by use of eq. 9, this construction gives a combinatorial proof of a Catalan number identity.
Corollary 23. The Catalan number Cn–2 can be written as an alternating sum of products of Catalan numbers, where the sum proceeds over all compositions of n – 1:
This identity can be seen as counting Dyck paths of semi-length n – 1 with no internal returns to the origin in two ways. Cn–2 gives the number of Dyck paths of semi-length n – 2, as a Dyck path of length n – 1 with no internal returns begins with an up-step that is followed by a Dyck path of semi-length n – 2 and then a a down-step. The right-hand side instead uses the inclusion–exclusion principle to perform the computation by excluding Dyck paths of semi-length n – 1 that have at least one return to the origin.
Interestingly, a reverse cycle that permutes all labels, even if it is not an incrementation, gives a Catalan number of coalescent histories, as it generates a roadblock set that consists of one or more diagonal lines. For example, with S = (((((((((A, B), C), D), E), F), G), H), I), J), the reverse incrementation G = (((((((((B, C), D), E), F), G), H), I), J), A) gives C8 = 1430 coalescent histories (Figure 7B), the reverse cycle G = (((((((((C, D), E), F), G), H), I), J), A), B) gives C7 = 429 coalescent histories, the reverse cycle G = (((((((((D, E), F), G), H), I), J), A), B), C) gives C6 = 132 coalescent histories, and so on.
We note also that eq. 9 continues to apply if S differs from G by multiple disjoint reverse incrementations, as in Figure 7C, which adds a two-leaf incrementation—an NNI move—to Figure 7A. In this case, the number of coalescent histories is C9–(C2C7+C3C6+C7C2)+(C2C1C6+C2C5C2+C3C4C2)–(C2C1C4C2) = 3002.
5.3. Forward incrementation of the leaf labels
In the case that G represents a forward rather than a reverse incrementation of S, the roadblocks appear in a triangular region rather than exclusively on the diagonal of the square lattice.
Proposition 24. If G is obtained by forward incrementation of S, then the roadblock set BG,S consists of a triangle of points on and below the diagonal of the square lattice.
Proof. Consider labels ks, kℓ ∈ {1, 2, …, n}, with ks < kℓ and kℓ ≠ 2. By definition of forward incrementation, for some component of G with leaves sequentially labeled ks, ks + 1, …, kℓ from the cherry toward the root, associated leaves of S are labeled πks (G) = kℓ, πks + 1(G) = ks, πks+2(G) = ks + 1, …, πkℓ (G) = kℓ – 1.
We use Proposition 9 and compute the minimal internal edge of S ancestral to each leaf gk of G, k ∈ {ks, ks + 1, …, kℓ}. We obtain f(gk) = kℓ – 1.
The roadblocks are the points (i, j) satisfying i < f(gj+1). Hence, the roadblocks are points (ks – 1, ks – 1),…, (kℓ – 2, ks – 1), (ks, ks), …,(kℓ – 2, ks), …, (kℓ – 2, kℓ – 2). ■
We can use Catalan’s trapezoids to count coalescent histories for forward incrementations, noting that every monotonic path from (0, 0) to (n – 1, n – 1) passes through exactly one point on a diagonal line from the lower right corner of the triangle of roadblocks, (kℓ – 2, ks – 1) for 2 ⩽ ks ⩽ n – 1 and (kℓ – 2, 1) for ks = 1, to the bottom edge or right edge of the lattice (Figure 8A).
Figure 8.
The number of coalescent histories for forward incrementations. (A) G differs from S by a forward incrementation. All paths must pass through the dashed red line. (B) The number of paths from (4, 1) on the dashed red line to (9, 9). (C) The number of paths from (5, 0) on the dashed red line to (9, 9). (D) G differs from S by a composition of two forward incrementations. All paths must pass through the four dashed red lines. The solid line represents the Dyck path associated with the roadblock set (see Figure 5).
If 2 ⩽ ks ⩽ n – 1, then this line has points (kℓ – 1 + c, ks – 2 – c) for c = 0, 1, …, min(ks – 2, n – kℓ); if ks = 1, then the line has a single point (kℓ – 1, 0). We can combine the two cases with the Kronecker delta, capturing the line with the expression (kℓ – 1 + c, ks – 2 + δks,1 – c) for c = 0, 1, …, min(ks – 2 + δks,1, n – kℓ).
We can then count monotonic paths from (0,0) to some point on the line and from there to (n – 1, n – 1).
Proposition 25. Consider a caterpillar species tree S with n leaves and a caterpillar gene tree G that differs from S by a forward incrementation described by the component ks, …, kℓ; of G. The number of coalescent histories for (G, S) can be written
where functions D and Dm follow eqs. 3 and 5, respectively.
Proof. Each monotonic path from (0, 0) to (n – 1, n – 1), proceeds through a point on the diagonal associated with the forward incrementation. The number of paths to arrive at that point from (0, 0) is tabulated by Catalan’s triangle (eq. 3), and the number of paths to reach (n – 1, n – 1) by Catalan’s trapezoids (eq. 5). ■
Figure 8A provides an example. In the figure, ks = 3 and kℓ = 5, so that each roadblocked monotonic path must pass through (4, 1) or (5, 0). Figure 8B illustrates the Catalan trapezoid from (4, 1) to (9, 9), and Figure 8C illustrates the Catalan trapezoid from (5, 0) to (9, 9). Because the number of paths from (0, 0) to (4, 1) is 4 and the number of paths from (0, 0) to (5, 0) is 1, the number of coalescent histories is 4 × 572 + 1 × 429 = 2717. This value is returned by the proposition, which gives .
Note that we can analyze cases with multiple disjoint forward incrementations by identifying their associated negatively sloping diagonals through which all monotonic paths must pass. The number of coalescent histories can be obtained by a nested sum counting monotonic paths that pass through exactly one point on each diagonal. Changing the perspective to consider the Dyck path associated with the roadblock set for a composition of disjoint forward incrementations, each peak in the Dyck path generates a diagonal, and we can tabulate monotonic paths that pass through points on each of these diagonals.
For example, in Figure 8D, the Dyck path associated with the roadblock set has four peaks. All monotonic paths from (0, 0) to (9, 9) must pass through two of these, at (1, 0) and (9, 8). The other two peaks generate diagonals through which all monotonic paths must pass, so that all paths must pass through (4, 1) or (5, 0) and through (8, 4) or (9, 3). The number of paths passing through (4, 1) and (8, 4) is D(4, 1)D4(4, 3)D5(1, 5) = 700; the number of paths through (4, 1) and (9, 3) is D(4, 1)D4(5, 2)D7(0, 6) = 84; the number through (5, 0) and (8, 4) is D(5, 0)D6(3, 4)D5(1, 5) = 175; and the number through (5, 0) and (9, 3) is D(5, 0)D6(4, 3)D7(0, 6) = 35. In total, the number of paths is 994.
With this perspective, we can see that such an approach to enumeration applies to any Dyck path, not just those that represent disjoint forward incrementations: for every peak in the Dyck path, a diagonal list of points is generated through which each monotonic path from (0, 0) to (n – 1, n–1) must pass. We consider all possible choices of points, one on each diagonal, and tabulate paths through those points by use of Catalan’s triangle and Catalan’s trapezoids. For a general pair of caterpillar trees, such an approach can reduce the number of summations in eq. 8 from n – 1 to the number of peaks in the associated Dyck path.
The number of Dyck paths of semilength n with exactly k peaks follows the Narayana numbers [8, Section 6.1]. The mean number of peaks in a Dyck path chosen at random then follows , which, by noting and applying eq. 5.23 in Table 169 of Graham et al. [11] to complete the summation, gives a mean of (n + 1)/2. Thus, because we consider semi-length n – 1, this approach reduces the mean number of nested summations from n – 1 in eq. 8 to n/2.
6. Discussion
We have studied coalescent histories for non-matching caterpillar gene trees and species trees, showing that as in the matching case, the number of coalescent histories for non-matching caterpillars can be computed using monotonic paths that do not cross the diagonal of a square lattice (Section 3). The recursion for the number of coalescent histories that applies for arbitrary gene trees and species trees simplifies for non-matching caterpillars to a non-recursive formula dependent only on the caterpillar topologies (Section 4). Using these results, we have counted coalescent histories for non-matching caterpillars differing by nearest-neighbor-interchange (Section 5.1). By studying reverse and forward incrementation, we have also counted coalescent histories for caterpillars differing by subtree-prune-and-regraft (Sections 5.2 and 5.3).
The bijection that connects coalescent histories and monotonic paths (Proposition 9) makes use of roadblocks, lattice points through which paths are not permitted to travel. Roadblocks occur such that if a point (i, j) is a roadblock for i ⩾ j, then (k, j) is also a roadblock for each k with j ⩽ k ⩽ i, as is (i, ℓ;) for each ℓ; with j ⩽ ℓ; ⩽ i (Remark 10). Enumeration of roadblocked monotonic paths given a roadblock set connects to Catalan’s triangle and trapezoids, enabling enumeration of the associated coalescent histories. Interestingly, the distinct roadblock sets can themselves be put into bijection with the monotonic paths that do not cross the diagonal of a square lattice, so that their number also follows the Catalan sequence (Section 3.3).
Our construction linking coalescent histories and roadblocked monotonic paths enables a simple proof of a result of Degnan & Rhodes [3] that for a fixed number of leaves, matching caterpillar trees have more coalescent histories than do non-matching caterpillar trees (Corollary 11). In particular, the lattice construction that enumerates coalescent histories for a non-matching pair of caterpillar trees contains at least one roadblock, whereas the lattice for matching caterpillars has no roadblocks and therefore has more monotonic paths. For a fixed caterpillar species tree, we have identified exactly which non-matching caterpillar gene tree generates the most coalescent histories: it is immediate that this gene tree differs from the species tree by a single NNI move, as the caterpillars differing from the species tree by one NNI move are the only ones that produce only one roadblock. We find that the specific NNI move affecting leaves nearest the “middle” of the species tree generates the largest number of coalescent histories, and that as the number of leaves increases, this value is asymptotically equivalent to the Catalan number Cn–1 (Section 5.1).
The case in which the gene tree differs from the species tree by reverse incrementation produces an elegant result. Recalling that the number of coalescent histories for matching caterpillars is described by a Catalan number, if the gene tree is obtained by a reverse incrementation affecting all leaf labels of the species tree, then the number of coalescent histories is given by the next-smaller Catalan number (Section 5.2). The case of forward incrementation is more complex, but it can be analyzed using Catalan’s trapezoids and suggests further connections to the analysis of Dyck paths (Section 5.3).
This study provides some of the first systematic closed-form results concerning coalescent histories for non-matching gene trees and species trees. Our approach applies only to caterpillars, however, as the bijection with roadblocked monotonic paths relies on the fact that the internal nodes of a caterpillar tree can be placed in a sequence such that all pairs of internal nodes have an ancestor–descendant relationship. It does suggest, however, that connections to other combinatorial structures such as Dyck paths can assist in enumerating coalescent histories for non-matching gene trees and species trees beyond use of the recursion in eq. 1.
A question that remains open is that the set of integers that could equal the number of coalescent histories for some caterpillar gene tree and species tree is unknown. Rosenberg & Degnan [19, Table 1] observed that for fixed species trees S of size n and certain values t, particularly small ones, large numbers of pairs (G, S) had exactly t coalescent histories, and Rosenberg [18] enumerated the pairs (G, S) with exactly 1 coalescent history (the lonely pairs). We and Degnan & Rhodes [3] have shown that if G and S are caterpillars, then only values t ⩽ Cn–1 can represent the number of coalescent histories. Our NNI results show that all values in the open interval (Cn–1 – C⌊(n–1)/2⌋C⌈(n–1)/2⌉, Cn–1) cannot be the number of coalescent histories for (G, S). For fixed caterpillar S with n leaves, it is useful to obtain the size of the set of values of t for which the pair (G, S) has exactly t coalescent histories. We observed that , the number of symmetric caterpillar-friendly roadblock sets plus half the number of asymmetric caterpillar-friendly roadblock sets, provides an upper bound (Section 3.3).
We note that the question of identifying the integers that represent the number of coalescent histories for some (G, S) with n leaves can be phrased entirely in terms of roadblocked monotonic paths without reference to coalescent histories. Describe a lattice as monotonically roadblocked if for each roadblock (i, j) with i ⩾ j, (k, j) is also a roadblock for each k with j ⩽ k ⩽ i, and (i, ℓ;) is a roadblock for each ℓ; with j ⩽ ℓ; ⩽ i. We seek the number of integers that represent the number of monotonic paths that do not cross the diagonal of some monotonically roadblocked lattice. That the bijection between coalescent histories and roadblocked monotonic paths raises such questions illustrates that constructions enabled by this bijection can be fruitful for studies of the properties of the paths themselves.
Acknowledgments.
We thank E. Allman, J. Degnan, F. Disanto, and J. Rhodes for helpful discussions. We acknowledge NIH grants R01 GM117590 and R01 GM131404 for support.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Bonin J, de Mier A, and Noy M. Lattice path matroids: enumerative aspects and Tutte polynomials. J. Comb. Theory Ser. A, 104:63–94, 2003. [Google Scholar]
- [2].Degnan JH. Gene tree distributions under the coalescent process. PhD thesis, University of New Mexico, Albuquerque, 2005. [PubMed] [Google Scholar]
- [3].Degnan JH and Rhodes JA. There are no caterpillars in a wicked forest. Theor. Pop. Biol, 105:17–23, 2015. [DOI] [PubMed] [Google Scholar]
- [4].Degnan JH and Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol, 24:332–340, 2009. [DOI] [PubMed] [Google Scholar]
- [5].Degnan JH, Rosenberg NA, and Stadler T. The probability distribution of ranked gene trees on a species tree. Math. Biosci, 235:45–55, 2012. [DOI] [PubMed] [Google Scholar]
- [6].Degnan JH and Salter LA. Gene tree distributions under the coalescent process. Evolution, 59:24–37, 2005. [PubMed] [Google Scholar]
- [7].Deng L-H, Deng Y-P, and Shapiro LW. The Riordan group and symmetric lattice paths. J. Shandong Univ, 50:82–89, 2015. [Google Scholar]
- [8].Deutsch E. Dyck path enumeration. Discr. Math, 204:167–202, 1999. [Google Scholar]
- [9].Disanto F and Rosenberg NA. Coalescent histories for lodgepole species trees. J. Comput. Biol, 22:918–929, 2015. [DOI] [PubMed] [Google Scholar]
- [10].Disanto F and Rosenberg NA. Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Trans. Comput. Biol. Bioinf, 13:913–925, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Graham RL, Knuth DE, and Patashnik O. Concrete Mathematics. Addison-Wesley, Boston, 2nd edition, 2008. [Google Scholar]
- [12].Maddison WP. Gene trees in species trees. Syst. Biol, 46:523–536, 1997. [Google Scholar]
- [13].Pamilo P and Nei M. Relationships between gene trees and species trees. Mol. Biol. Evol, 5:568–583, 1988. [DOI] [PubMed] [Google Scholar]
- [14].Reuveni S. Catalan’s trapezoids. Prob. Eng. Inform. Sci, 28:353–361, 2014. [Google Scholar]
- [15].Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor. Pop. Biol, 61:225–247, 2002. [DOI] [PubMed] [Google Scholar]
- [16].Rosenberg NA. Counting coalescent histories. J. Comput. Biol, 14:360–377, 2007. [DOI] [PubMed] [Google Scholar]
- [17].Rosenberg NA. Coalescent histories for caterpillar-like families. IEEE/ACM Trans. Comp. Biol. Bioinf, 10:1253–1262, 2013. [DOI] [PubMed] [Google Scholar]
- [18].Rosenberg NA. Enumeration of lonely pairs of gene trees and species trees by means of antipodal cherries. Adv. Appl. Math, 102:1–17, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Rosenberg NA and Degnan JH. Coalescent histories for discordant gene trees and species trees. Theor. Pop. Biol, 77:145–151, 2010. [DOI] [PubMed] [Google Scholar]
- [20].Rosenberg NA and Tao R. Discordance of species trees with their most likely gene trees: the case of five taxa. Syst. Biol, 57:131–140, 2008. [DOI] [PubMed] [Google Scholar]
- [21].Stadler T and Degnan JH. A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree. Alg. Mol. Biol, 7:7, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Stanley RP. Enumerative Combinatorics Volume 2. Cambridge University Press, New York, 1999. [Google Scholar]
- [23].Stanley RP. An Introduction to Probability Theory. Cambridge University Press, Cambridge, 2015. [Google Scholar]
- [24].Steel M. Phylogeny: Discrete and Random Processes in Evolution Society for Industrial and Applied Mathematics, Philadelphia, 2016. [Google Scholar]
- [25].Than C and Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comp. Biol, 5:e1000501, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Than C, Ruths D, Innan H, and Nakhleh L. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J. Comput. Biol, 14:517–535, 2007. [DOI] [PubMed] [Google Scholar]
- [27].Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution, 66:763–775, 2012. [DOI] [PubMed] [Google Scholar]
- [28].Wu Y. An algorithm for computing the gene tree probability under the multispecies coalescent and its application in the inference of population tree. Bioinformatics, 32:i225–i233, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]







