Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 12.
Published in final edited form as: J Math Biol. 2018 Aug 16;78(1-2):155–188. doi: 10.1007/s00285-018-1271-5

Enumeration of compact coalescent histories for matching gene trees and species trees

Filippo Disanto *, Noah A Rosenberg
PMCID: PMC7661175  NIHMSID: NIHMS1643740  PMID: 30116881

Abstract

Compact coalescent histories are combinatorial structures that describe for a given gene tree G and species tree S possibilities for the numbers of coalescences of G that take place on the various branches of S. They have been introduced as a data structure for evaluating probabilities of gene tree topologies conditioning on species trees, reducing computation time compared to standard coalescent histories. When gene trees and species tree have a matching labeled topology G = S = t, the compact coalescent histories of t are encoded by particular integer labelings of the branches of t, each integer specifying the number of coalescent events of G present in a branch of S. For matching gene trees and species trees, we investigate enumerative properties of compact coalescent histories. We report a recursion for the number of compact coalescent histories for matching gene trees and species trees, using it to study the numbers of compact coalescent histories for small trees. We show that the number of compact coalescent histories equals the number of coalescent histories if and only if the labeled topology is a caterpillar or a bicaterpillar. The number of compact coalescent histories is seen to increase with tree imbalance: we prove that as the number of taxa n increases, the exponential growth of the number of compact coalescent histories follows 4n in the case of caterpillar or bicaterpillar labeled topologies and approximately 3.3302n and 2.8565n for lodgepole and balanced topologies, respectively. We prove that the mean number of compact coalescent histories of a labeled topology of size n selected uniformly at random grows with 3.3750n. Our results contribute to the analysis of the computational complexity of algorithms for computing gene tree probabilities, and to the combinatorial study of gene trees and species trees more generally.

Keywords: Compact coalescent histories, gene trees, generating functions, phylogenetics, species trees

Mathematics Subject Classification (2010): 05A15, 05A16, 92B10, 92D15

1. Introduction

The study of the relationships between gene trees, which represent the histories of individual genomic regions, and species trees, representing the histories of populations of organisms, has generated new combinatorial structures (Maddison, 1997; Degnan and Salter, 2005; Rosenberg and Tao, 2008; Than and Nakhleh, 2009; Degnan et al., 2012; Wu, 2012, 2016; Degnan and Rhodes, 2015). Among these structures are coalescent histories, structures that for a given gene tree topology G and species tree S represent possible pairings of the coalescences in G with the branches of S on which the coalescences take place (Degnan and Salter, 2005; Rosenberg, 2007). The use of coalescent histories in calculations of the probability Prob(G|S) (Degnan and Salter, 2005) has motivated the study of the number of coalescent histories possible for a given gene tree topology and species tree topology (Degnan and Salter, 2005; Rosenberg, 2007, 2013; Than et al., 2007; Rosenberg and Degnan, 2010; Disanto and Rosenberg, 2015, 2016). A variety of enumerative results have been derived, primarily in the case in which gene trees and species trees have a matching labeled topology.

Building on the approach of Degnan and Salter (2005), Wu (2016) introduced compact coalescent histories as a tool for simplifying gene tree probability computations (see also Degnan and Rhodes, 2015). Given G and S, Wu’s “CompactCH” algorithm computes Prob(G|S) by grouping into equivalence classes two (or more) coalescent histories h1 and h2 when, in each branch of S, the numbers of coalescences of G specified by h1 and h2 are the same. The resulting equivalence classes are the compact coalescent histories, or compact histories for short. Certain intermediate computations in the probability formula of Degnan and Salter (2005) are identical for all coalescent histories with the same compact history, simplifying the probability computation.

Compact coalescent histories appear in sets over which sums are computed (e.g. Eq. 5 of Wu (2016)). Hence, for a given G and S, similarly to the way that evaluation of Prob(G|S) by the method of Degnan and Salter (2005) depends on the number of coalescent histories, the complexity of the evaluation of Prob(G|S) in CompactCH is affected by the number of compact coalescent histories possible for G and S. By studying this number, Wu (2016) showed that when the size of the species tree is fixed and multiple gene lineages can be sampled per species, CompactCH calculates gene tree probabilities in polynomial time in the number of gene lineages. The approach of Wu (2016) exchanges the slower summation of Degnan and Salter (2005) over all coalescent histories with a given compact history for a faster computation that requires only the number of such coalescent histories.

Here, permitting the size of the species tree to grow, we investigate the number of compact coalescent histories for gene trees and species trees with a matching labeled topology G = S = t. In particular, we measure how the growth of the number of compact coalescent histories of t is affected by its number of taxa and its topology. In Section 3, we present a recursion for the number of compact coalescent histories of a matching gene tree and species tree. Extending a result of Wu (2016)—whose supplement reported that when t has a caterpillar topology of size |t| = n, the number of compact coalescent histories of t equals the number of coalescent histories of t—we show that the number of compact coalescent histories of t equals its number of coalescent histories if and only if t is a caterpillar or bicaterpillar topology. Next, in Section 4, we study the number of compact coalescent histories when t belongs to each of several families of trees with different degrees of imbalance. We demonstrate that unlike in the caterpillar and bicaterpillar cases, the number of compact coalescent histories can be much smaller than the number of coalescent histories when t is not a caterpillar or bicaterpillar. Moreover, we show that when the number of taxa increases, the number of compact coalescent histories grows exponentially faster in the families of more unbalanced trees. Section 5 reports the mean number of compact coalescent histories for a random labeled topology t of given size drawn under a uniform distribution. Our results can assist in relating the complexity of algorithms for computing gene tree probabilities based on compact coalescent histories to those that use an evaluation based on other combinatorial structures, such as coalescent histories and ancestral configurations (Wu, 2012; Disanto and Rosenberg, 2017, 2018).

2. Preliminaries

We investigate the number of compact coalescent histories for rooted binary labeled trees. We recall basic features of tree structures in Section 2.1. In Section 2.2, we give properties of generating functions that will be used for counting compact coalescent histories.

2.1. Labeled toplogies

A bifurcating rooted tree with labeled taxa (Fig. 1A) is termed a labeled topology, or “tree” for short. The size of a labeled topology t is its number of taxa |t|. We denote by [t] the unlabeled topology, or “tree shape,” underlying t. This shape is obtained by ignoring labels for the taxa of t.

Figure 1:

Figure 1:

Coalescent histories for a gene tree and a species tree with a matching labeled topology G = S = t. (A) A coalescent history. Arrows map the internal nodes of t = ((a, b), ((c, d), (e, f))) to the branches of t. (B) The gene tree topology G = t realized in the matching species tree S = t according to the coalescent history in (A). The mapping in (A) specifies the branches of the species tree (thick lines) where the coalescent events of the gene tree (thin lines) take place.

Without loss of generality, we assume an alphabetical order abc ≺ … over the set {a,b,c, …} of possible labels for the taxa of a labeled topology, using the first n labels for the leaves of a tree of size n.

As it is sometimes important to refer to internal nodes of a labeled topology, it is useful to assign distinct but arbitrary labels to these internal nodes. Unlike the taxon labels, the internal node labels need not be ordered. The labeling of internal nodes is merely a convenience that does not distinguish different trees, and only the taxon labels are important for characterizing if two labeled topologies with the same unlabeled topology are distinct. In enumerating labeled topologies, only leaves are considered to be labeled.

We let Tn be the set of labeled topologies of size n. We will require two results concerning Tn.

Proposition 1 (Felsenstein, 1978) For n ≥ 1, the cardinality of Tn is (2n)!/[2n(2n − 1)n!].

Proposition 2 (Flajolet and Sedgewick, 2009, Example II.19) The generating function T(z)=t:|t|1z|t|/|t|!=n=1|Tn|zn/n! of the sequence |Tn|/n! satisfies T(z)=112z.

2.2. Exponential growth and analytic combinatorics

One of our main goals is to evaluate features of the growth of sequences of non-negative integers. Following Flajolet and Sedgewick (2009), we recall a number of results concerning the asymptotic behavior of sequences.

Definition 3 A sequence of non-negative numbers sn is said to have exponential growth kn, or equivalently, to have exponential order k, when limsupn→∞ [(sn)1/n] = limn→∞ [supmn [(sm)1/m]] = k.

Equivalently, this relation can be written sn = kng(n), with g a subexponential factor. If the value k of the limit strictly exceeds 1, then sequence sn grows exponentially in n, and we say that its exponential order is k.

By these definitions, if the exponential order ks of a sequence sn is strictly smaller than the exponential order ks˜ of a sequence s˜n, then the sequence of ratios sn/s˜n converges to 0 exponentially fast as (ks/ks˜)n. If instead sn and s˜n have the same exponential order, then the increase or decrease of the sequence of ratios sn/s˜n is at most polynomial in n, and we write sns˜n.

Some of our results will be obtained by applying methods of analytic combinatorics that concern singularities of generating functions (Sections IV and VI of Flajolet and Sedgewick (2009)). More precisely, entries of a sequence of integers (sn)n≥0 can be seen as coefficients ([zn]f)n≥0 of the power series expansion f(z)=n=0snzn at z = 0 of a function f(z), the generating function of the sequence. Considering z as a variable in the complex plane , a correspondence exists between the dominant singularity z = ρ of f(z)—the singularity of smallest distance from the origin in —and the exponential growth of the coefficients sn. In particular, for n → ∞, the exponential order of sequence sn is the inverse of the modulus of the dominant singularity of f(z),

sn=[zn]f(z)(1ρ)n. (1)

For instance, consider the generating function T(z) of the sequence |Tn|/n! (Proposition 2). Due to the branching character of the square root function 12z, z = 1/2 is the point of smallest modulus in the complex plane where T(z) fails to be analytic. Hence, z = 1/2 is the dominant singularity of T(z). Using eq. 1, we have

|Tn|n!2n. (2)

3. Compact coalescent histories for matching gene trees and species trees

In this section, we define compact coalescent histories, and we provide a characterization of the compact coalescent histories of a gene tree and species tree (Section 3.1). Next, we report a recursion for the number of compact coalescent histories of a matching gene tree and species tree (Section 3.2), using this recursion to analyze the number of compact coalescent histories for small trees (Section 3.3). We provide a characterization of the trees for which the numbers of coalescent histories and compact coalescent histories are the same (Section 3.4).

We consider a gene tree labeled topology G and a species tree labeled topology S with the same set of leaf labels. The gene tree labeled topology represents the sampling of a single gene lineage in each of n ≥ 1 species.

A partial order can be placed on nodes and branches of a tree, where we denote k2k1 for a pair of nodes k1, k2 if k2 is descended from k1 in t; we write k2 < k1 if k2 is descended from k1 and k1, k2 are distinct. We also write b2b1 if branch b2 is descended from b1 in t, and b2 < b1 if in addition, b1, b2 are distinct. A node or branch is trivially descended from itself.

Let tk be the subtree of t generated by node k, including the branch immediately ancestral to k. Let |tk| be the number of leaves in tk; we identify node k with the branch immediately ancestral to it, so that we also describe tk as the subtree generated by this branch.

3.1. A characterization of compact coalescent histories

We now formally define compact coalescent histories, recalling the definition of coalescent histories (e.g. Than et al., 2007; Rosenberg and Degnan, 2010).

Definition 4 Given a gene tree G and a species tree S, a coalescent history of (G, S) is a function h from the internal nodes of G to the internal branches of S, satisfying two conditions: (i) for each internal node k in G, all leaves descended from node k in G descend from branch h(k) in S; (ii) for all pairs of internal nodes k1 and k2 in G, if k2 is a descendant of k1 in G, then branch h(k2) is descended from branch h(k1) in S.

Here and in our subsequent analysis, we include the root of S as an internal node, and we consider that a branch broot of S exists that is ancestral to the root. Note that in condition (ii), h(k2) is permitted to equal h(k1).

In the case of a matching labeled topology G = S = t, a coalescent history can be regarded as being associated with the single tree t, and the conditions can be simplified: a coalescent history of t is a function h from the internal nodes of t to the internal branches of t satisfying: (i) for each internal node k in t, node k descends from branch h(k) in t; (ii) for all pairs of internal nodes k1 and k2 in t, if k2 is a descendant of k1 in t, then branch h(k2) is descended from branch h(k1) in t.

Coalescent histories (Fig. 1A) represent the topologically distinct configurations that a gene tree labeled topology G can assume in the branching structure of a species tree labeled topology S (Fig. 1B). A coalescent history specifies a possible list of the species tree branches on which the gene tree coalescent events occur.

Following Wu (2016), an equivalence can be defined over the set of coalescent histories for (G, S).

Definition 5 Consider a relation in which two coalescent histories h1, h2 of (G, S) are equivalent when, for each branch b of S, considering all internal nodes k in G, |{k : h1(k) = b}| = |{k : h2(k) = b}|. Each equivalence class of this relation is termed a compact coalescent history, or a compact history for short.

In this equivalence relation, h1 is equivalent to h2 when, in each branch of S, h1 and h2 have the same numbers of coalescent events (Fig. 2A). We represent a compact history of (G, S) by an integer labeling of the internal branches of S, the branch b being labeled by the number b of coalescent events in that branch (Fig. 2B). We denote by m = m(h) the number root of coalescent events in the root branch broot of compact history h.

Figure 2:

Figure 2:

Equivalence classes of coalescent histories and compact coalescent histories for matching gene trees and species trees. (A) Two coalescent histories of the species tree G = S = t = ((a, b), ((c, d), (e, f))) in the same equivalence class. For each branch of t, the numbers of incoming arrows in the two coalescent histories, representing coalescences on the branch, are the same. (B) The compact coalescent history of the species tree t representing the equivalence class of the coalescent histories depicted in (A). The label for each branch corresponds to the number of incoming arrows in that branch.

Note that from a compact history, the numbers of lineages of G entering the branches of S from below and exiting them above can be extracted. Indeed, in Definition 5, we could instead define h1 and h2 to be equivalent if and only if for each branch of S, (i) h1 and h2 have the same numbers of entering lineages, and (ii) h1 and h2 have the same numbers of exiting lineages (Wu, 2016, Lemma 3.1). This alternative perspective is useful for computing the probability of the set of coalescent histories represented by the compact history, as gene tree probability computations rely on counts of entering and exiting lineages (Degnan and Salter, 2005; Wu, 2016).

Let (b)b be an integer labeling of the internal branches of S, where b is the label of branch b. We will also treat the label of a branch of S as the label of its immediate descendant node, so that the labeling is associated with both the internal branches and the internal nodes of S.

For branch b of S, let Gb be the set of all internal nodes k in G with the following pair of properties: (i) k represents the most recent common ancestor in G of a group of two or more taxa descended from branch Sb of S; (ii) all taxa descended from k in G are descended from Sb. |Gb| is the number of such nodes. The set Gb represents the set of coalescences of G that have the possibility of occurring on branch b of S. For the root branch broot of S, we have |Groot| = |G| − 1.

We can then characterize the labelings (b)b that represent compact histories for (G, S).

Proposition 6 A labeling (b)b of S identifies a compact history h of (G, S) if and only if (i) for all branches b of S other than the root branch, 0lb|Gb|b<blb, and (ii) lroot=|G|1brootlb.

Proof First, we show that a labeling (b)b that represents a compact history satisfies (i) and (ii).

For a subtree Sb of S descended from branch b, the sum bblb is the total number of coalescent events in Sb. By definition of a coalescent history, this quantity is bounded above by the number of internal nodes of the gene tree all of whose descendant taxa in G descend from Sb in S, or |Gb|. Removing b from the sum and noting that b ≥ 0 because b is a count, we obtain (i). For the case in which b is the root branch of S, the total number of internal nodes of G all of whose descendant taxa in G descend from Sroot in S is exactly |G| − 1, so that the inequality lb|Gb|b<blb becomes an equality, and we obtain (ii).

We must now show that any labeling (b)b that satisfies (i) and (ii) represents a compact history. It suffices to demonstrate that at least one coalescent history h lies in the equivalence class represented by (b)b. By postorder traversal of S, proceed through the internal branches of S, for each branch b assigning certain nodes k of G the value h(k) = b in the following manner. (1) If b = 0, continue to the next branch of S. (2) If b > 0, by postorder traversal of G, proceed through the internal nodes k of G all of whose taxa are descended from b in S. (3) Assign the value h(k) = b to the first node of G encountered that either has no internal node descendants in G or that already has all its descendant internal nodes in G assigned values of h. (4) Continue following (3) until b nodes k of G have been assigned h(k) = b.

That this construction produces a coalescent history h can be seen as follows. Because (b)b satisfies (i) by assumption, for each non-root branch b of S, Steps (1)–(4) always find b internal nodes of G to which the label b can be assigned: because of the postorder traversal of S, the number of unassigned internal nodes of G descended from b is initially |Gb|b<blb, and b is no more than this quantity by (i). Condition (ii) guarantees that all |G| − 1 internal nodes k of G are assigned a value of h(k), with those unassigned when broot is reached being assigned h(k) = broot. Step (1) guarantees that condition (i) of the definition of a coalescent history is respected by h, and Step (2) guarantees that h respects condition (ii) of the definition of coalescent histories. □

In Proposition 6, condition (i) indicates that the maximal number of coalescent events that can happen in an internal branch b of the species tree, other than the root, is given by the difference between the number |Gb| of coalescences of the gene tree that could potentially occur on that branch and the number of coalescent events present in the internal branches descended from b in S. Condition (ii) states instead that the number of coalescences above the root of S is the total number of coalescences in G, or |G| − 1, minus the number of coalescences in the branches below the root. When b is a leaf of S, Gb is empty, as no coalescences occur in the branch above a leaf node. Note that although the definitions of coalescent histories and compact histories consider only the internal branches of S, we can extend the labeling in compact histories to include b = 0 for branches b of S immediately ancestral to leaf nodes. Proposition 6 still applies if compact histories are taken to include leaf nodes of S with labels of 0; indeed, by (i), b = 0.

Our main interest is in the case of G = S = t. In this case, the number of internal nodes of G that could potentially coalesce on branch b of S is |Gb| = |tb| − 1, so that we have the following corollary.

Corollary 7 A labeling (b)b of t identifies a compact history h of t if and only if (i) for all internal branches b of t other than the root branch of t, 0lb|tb|1b<blb, and (ii) lroot=|t|1brootlb.

Compact coalescent histories are closely related to the population histories of Degnan and Rhodes (2015). A compact coalescent history, like a coalescent history, is defined for a pair consisting of a gene tree topology and a species tree topology. A population history in the sense of Degnan and Rhodes (2015) is an integer labeling of the species tree branches that, like a compact coalescent history, tabulates the numbers of coalescences of a gene tree that occur on those branches. However, a population history is defined only given the species tree, and not all population histories of a species tree can represent possible sets of locations for the coalescences of a specified gene tree on that species tree; the population histories of a species tree are exactly the compact coalescent histories associated with the species tree and its matching gene tree.

3.2. Recursion for the number of compact coalescent histories

For a general pair of trees (G, S), the compact coalescent histories can be enumerated by classifying into equivalence classes the coalescent histories listed by the exhaustive recursive enumeration of Rosenberg (2007). In the case of G = S = t, we can provide a recursion for the number of compact coalescent histories itself.

We consider a concept of extended compact coalescent histories, which differ from compact coalescent histories in that it is possible that some of the gene tree coalescences of t have not yet occurred in t, including on the root branch of species tree t; this extension is useful in case t is a subtree of a larger tree (Fig. 3). Let u be the number of coalescences that occur in t, including on its root branch. Let m be the number of coalescences that occur on the root branch of t. The quantities u and m are constrained, with 0 ≤ mu ≤ |t| − 1. For compact coalescent histories, we have u = |t| − 1, as all coalescences of t occur in t, possibly on the root branch.

Figure 3:

Figure 3:

Schematic illustration of quantities in the recursion for the number of compact coalescent histories. The labels m, m1, and m2 represent the numbers of coalescences on the root branch of a tree t, the root branch of the left subtree t1, and the root branch of the right subtree t2, respectively. The quantities u, u1, and u2 represent the total numbers of coalescences in the tree, left subtree, and right subtree, respectively, including coalescences on the associated root branches.

Tree t has “left” and “right” subtrees t1 and t2, where we consider these subtrees to include their associated root branches. The quantities u1, u2, m1, m2, corresponding to the numbers of coalescences of the left subtree, the right subtree, the root branch of the left subtree, and the root branch of the right subtree, respectively, satisfy 0 ≤ m1u1 ≤ |t1| −1 and 0 ≤ m2u2 ≤ |t2| −1. The total number of coalescences in t is u = u1 +u2 +m, as each coalescence of t must occur in the left subtree of t, the right subtree of t, or on the root branch of t.

Let At,u,m be the number of extended compact coalescent histories of t in which u coalescences occur, of which m occur on the root branch. By definition of extended compact coalescent histories, At,0,0 = 1 for any tree t, as a tree has a single labeling—zeroes on all internal branches—if u = 0 and m = 0 and no coalescences occur. In addition, At,u,m = 0 when u,m fail to satisfy 0 ≤ mu ≤ |t| − 1. Let Bt denote the number of compact coalescent histories for a tree t with |t| taxa, and let Bt,m = At,|t|−1,m be the number among these compact coalescent histories in which m coalescences occur on the root branch.

Theorem 8 The number of compact coalescent histories for a tree t with |t| ≥ 2 taxa satisfies

Bt=m=1|t|1At,|t|1,m. (3)

The number of extended compact coalescent histories for a tree t with |t| ≥ 1 taxa satisfies

At,u,m=u1=max[0,um(|t2|1)]min(|t1|1,um)m1=0u1m2=0umu1At1,u1,m1At2,umu1,m2, (4)

The base cases of the recursion are At,0,0 = 1 for the 1-taxon tree, At,0,0 = At,1,1 = 1 for the 2-taxon tree, and At,u,m = 0 when u,m fail to satisfy 0 ≤ mu ≤ |t| − 1.

Proof Eq. 3 follows from the fact that Bt,m = At,|t|−1,m, noting that the number of coalescences on the root branch of a tree with |t| ≥ 2 taxa satisfies 1 ≤ m ≤ |t| − 1.

For Eq. 4, we decompose each extended compact coalescent history for t into an extended compact coalescent history for t1, an extended compact coalescent history for t2, and a set of coalescences on the root branch of t. We must consider all assignments of (u1, u2, m1, m2) that produce an extended compact coalescent history with u total coalescences and m coalescences above the root. For each such assignment, the number of extended compact coalescent histories is At1,u1,m1At2,u2,m2.

To determine permissible values for (u1, u2), recall that the total number of coalescences in t1 and t2 together is u1 + u2 = um, so that 0 ≤ u1, u2um. However, u1 ≤ |t1| − 1, as at most |t1| − 1 coalescences occur in t1, and similarly, u2 ≤ |t2| − 1. Hence, if as many coalescences as possible are placed in t2 so that u2 is as large as possible, u1 remains bounded below by um − (|t2| − 1). Once u1 and u2 = umu1 have been specified, (m1, m2) satisfies 0 ≤ m1u1 and 0 ≤ m2u2.

The nontrivial base case At,1,1 = 1 for the 2-taxon tree follows by noting from Corollary 7 that this tree has only a single labeling that represents a compact coalescent history, and that this labeling has u = m = 1. □

Using Theorem 8, we can compute the number of compact coalescent histories for arbitrary trees t by applying Eq. 3, recursively applying Eq. 4 to complete the calculation.

3.3. Number of compact coalescent histories for small trees

For small values of n, we use Theorem 8 to exhaustively compute the number of compact histories for representative labelings of the unlabeled topologies with n taxa. Table 1 reports these numbers of compact coalescent histories for each unlabeled topology of size 2 ≤ n ≤ 7, where an unlabeled topology is taken to have a specific but arbitrary labeling. For the tree shapes considered, the number of compact coalescent histories is always less than or equal to the number of coalescent histories, with equality only when the two root subtrees are caterpillar trees. As we will see, this characterization of the condition for equality of the numbers of compact coalescent histories and coalescent histories will be shown to hold for arbitrary tree size in Section 3.4.

Table 1:

Numbers of compact coalescent histories and coalescent histories for small trees.

Size Unlabeled topology Number of coalescent histories Number of compact coalescent histories Size Unlabeled topology Number of coalescent histories Number of compact coalescent histories
2 graphic file with name nihms-1643740-t0001.jpg 1 1 6 graphic file with name nihms-1643740-t0002.jpg 25 25
3 graphic file with name nihms-1643740-t0003.jpg 2 2 7 graphic file with name nihms-1643740-t0004.jpg 132 132
4 graphic file with name nihms-1643740-t0005.jpg 5 5 7 graphic file with name nihms-1643740-t0006.jpg 138 118
4 graphic file with name nihms-1643740-t0007.jpg 4 4 7 graphic file with name nihms-1643740-t0008.jpg 130 108
5 graphic file with name nihms-1643740-t0009.jpg 14 14 7 graphic file with name nihms-1643740-t0010.jpg 112 98
5 graphic file with name nihms-1643740-t0011.jpg 13 12 7 graphic file with name nihms-1643740-t0012.jpg 113 86
5 graphic file with name nihms-1643740-t0013.jpg 10 10 7 graphic file with name nihms-1643740-t0014.jpg 106 90
6 graphic file with name nihms-1643740-t0015.jpg 42 42 7 graphic file with name nihms-1643740-t0016.jpg 84 84
6 graphic file with name nihms-1643740-t0017.jpg 42 37 7 graphic file with name nihms-1643740-t0018.jpg 84 74
6 graphic file with name nihms-1643740-t0019.jpg 37 33 7 graphic file with name nihms-1643740-t0020.jpg 74 66
6 graphic file with name nihms-1643740-t0021.jpg 28 28 7 graphic file with name nihms-1643740-t0022.jpg 70 70
6 graphic file with name nihms-1643740-t0023.jpg 26 24 7 graphic file with name nihms-1643740-t0024.jpg 65 60

Each unlabeled topology corresponds to a single representative labeled topology t.

From the table, we also observe that the number of compact coalescent histories does not always increase with the number of coalescent histories. The fifth tree shape of size 7 has more coalescent histories than the sixth tree shape of size 7, but the latter has more compact coalescent histories. In Section 4, we will observe this phenomenon on a larger scale, identifying two families of trees of increasing size, F1 and F2, such that the number of coalescent histories grows exponentially faster for trees in F1 than for trees in F2, whereas the growth of the number of compact coalescent histories for trees in F1 is exponentially slower than for trees in F2.

Our calculations suggest a correlation between the number of compact histories and tree balance, with more compact histories occurring for less balanced trees. We can examine this claim using the Colless (1982) index, iC(t), which measures the degree of imbalance of a tree t, summing over all internal nodes k of t the absolute value of the difference between the sizes k, rk of the left and right subtrees of k. More precisely, iC(t)=stk|rklk|, where st = 2/[(|t|−1)(|t|−2)] is a rescaling factor. The index iC(t) ranges in the interval iC(t) ∈ [0, 1], assuming values close to 1 for more unbalanced trees and values close to 0 for more balanced trees.

Fig. 4 plots the number of compact histories against iC(t) for the 98 unlabeled topologies with 10 taxa. Trees with a larger Colless index tend to have more compact histories. The Pearson correlation coefficient is 0.9691.

Figure 4:

Figure 4:

The natural logarithm of the number of compact coalescent histories for the 98 tree shapes of size n = 10, plotted against the Colless index of imbalance.

For n ≤ 15, we have identified the tree shapes underlying the labeled topologies with the largest and smallest numbers of compact histories among labeled topologies of size n. These shapes are not necessarily those with the largest and smallest numbers of coalescent histories; for example, in Table 1, the shapes with the fewest compact histories and the fewest coalescent histories differ for n = 6, and the shapes with the most compact histories and the most coalescent histories differ for n = 7.

For each n for 2 ≤ n ≤ 15, caterpillar shapes are seen to have the most compact histories, equal to the (n − 1)th Catalan number, their number of coalescent histories (see Section 4). Tree shapes associated with the fewest compact histories for each small n appear in Fig. 5. These shapes have a recursive structure: the nth tree tn for 2 ≤ n ≤ 15 can be decomposed as

tn=(td,tnd), (5)

where d is the power of 2 nearest to n/2. In particular, when n is a power of 2, the observed tree decomposition defines tn to be the completely balanced tree shape. Interestingly, the family of tree shapes (tn)n≥1 obtained by iteratively applying Eq. 5 already appears in the study of gene trees and species trees. As shown by Disanto and Rosenberg (2017), for fixed tree size n, the labeled topologies with shape tn have the largest number of “root ancestral configurations,” and they also have the highest probability under the Yule model of speciation (Harding, 1971; Hammersley and Grimmett, 1974; Degnan and Rosenberg, 2006).

Figure 5:

Figure 5:

Tree shapes of size 1 ≤ n ≤ 10 whose labeled topologies have the fewest compact histories among shapes of size n. In each tree with n ≥ 2, the two root subtrees each minimize the number of compact coalescent histories among trees of their size. From left to right, the numbers of compact histories are 1, 1, 2, 4, 10, 24, 60, 144, 396, and 1032. For 11 ≤ n ≤ 15, the shapes with the fewest compact histories continue to follow the recursive decomposition in Eq. 5, with 2796, 7200, 19800, 51600, and 139800 compact coalescent histories for n = 11, 12, 13, 14, and 15, respectively.

3.4. Trees with the same numbers of compact coalescent histories and coalescent histories

In this section, we characterize labeled topologies of matching gene trees and species trees for which the number of compact coalescent histories equals the number of coalescent histories. Because each compact coalescent history represents an equivalence class of coalescent histories, the number of compact histories is less than or equal to the number of coalescent histories. Wu (2016) showed that each compact coalescent history for a caterpillar labeled topology is associated with a single coalescent history, so that the numbers of compact histories and coalescent histories are equal. A caterpillar tree has only one possible sequence in which the coalescences can occur, so that once the locations of the coalescences are specified by the integer labeling of a compact coalescent history, the particular coalescences associated with the nodes are determined.

Following Rosenberg (2007), a bicaterpillar tree is a tree whose two root subtrees are both caterpillar trees (Fig. 6A). A caterpillar of size n ≥ 2 is trivially a bicaterpillar, with subtrees of size 1 and n−1. In a bicaterpillar, no internal node other than the root has the property that both of its immediate descendant nodes are internal nodes; in a caterpillar, not even the root has this property. Any non-bicaterpillar tree has at least one non-root internal node both of whose immediate descendant nodes are internal nodes.

Figure 6:

Figure 6:

Families γp,q, λn, and βn of labeled topologies. (A) The bicaterpillar labeled topology γ2,3. Topology γp,q has |γp,q| = p + q taxa. (B) The lodgepole labeled topology λ3, where |λn| = 2n + 1. (C) The completely balanced labeled topology β3, where |βn| = 2n.

Theorem 9 In a labeled topology t, the number of compact coalescent histories equals the number of coalescent histories if and only if t is a bicaterpillar.

Proof Consider a bicaterpillar tree t. We must show that each compact history of t is associated with only a single coalescent history. Consider a compact history of t. In that compact history, for each of the two caterpillar root subtrees, the list of integer labels for the nodes in that subtree, including the subtree root, uniquely specifies the locations of the coalescences in that subtree. The remaining coalescences necessarily occur above the root of t. Hence, the list of labels for the nodes of t specifies exactly where all coalescences occur, and only one coalescent history is possible for each compact history.

For the reverse direction, suppose t is not a bicaterpillar. Then there must exist an internal node κ other than the root of t whose immediate descendant nodes κ1 and κ2 are internal nodes. These nodes must each have as a descendant a cherry internal node, an internal node with exactly two leaf descendants. Denote these cherry nodes κ1 and κ2, with κ1 possibly equal to κ1 and κ2 possibly equal to κ2. Let a and b be leaves that descend from κ1, and let c and d be leaves that descend from κ2. The compact history in which the label for κ is 1, the label for the root of t is |t| − 2, and all other nodes have label 0 has at least two associated coalescent histories. In particular, it is possible that the single coalescence associated with node κ is (a, b), or that it is (c, d). Hence, we have two coalescent histories associated with a single compact history, and the number of compact histories is strictly less than the number of coalescent histories. □

As noted above, Table 1 illustrates that the number of compact coalescent histories is equal to the number of coalescent histories if and only if t is a bicaterpillar for trees t of size 2 ≤ n ≤ 7.

4. Number of compact coalescent histories for special families of trees

We now study the number of compact histories in three families of labeled topologies. We consider bicaterpillar, lodgepole, and completely balanced labeled topologies.

By γp,q, we denote a representative bicaterpillar labeled topology having root subtrees of size p ≥ 1 and q ≥ 1 (Fig. 6A). For fixed n ≥ 2, letting qp, the bicaterpillar trees have (p, q) = (1, n−1),(2, n−2), … , (⌊n/2⌋, ⌈n/2⌉).

Denote by λn a representative lodgepole labeled topology with n cherries and size |λn| = 2n+1 taxa (Fig. 6B). The shape [λn] satisfies the recursion [λn] = ([λn−1], (•, •)), with [λ0] = •. In other words, [λn] is inductively defined by appending [λn−1] and a tree with two leaves—a cherry—to a common root, beginning with the 1-taxon tree [λ0]. Lodgepole trees have been introduced by Disanto and Rosenberg (2015) as an example of a tree family for which the growth of the number of coalescent histories is faster than exponential in the number of taxa. In particular, the number of coalescent histories of λn grows asymptotically like the double factorial (2n + 1)!!.

Finally, in contrast with maximally unbalanced trees γ1,n−1, we consider completely balanced trees. We denote by βn a representative completely balanced labeled topology of size |βn| = 2n taxa (Fig. 6C), with shape defined by [β0] = • and [βn] = ([βn−1], [βn−1]). The number of coalescent histories in the family βn is available only by a recursion (Rosenberg, 2007), and the asymptotic growth of this number is not known.

Setting c(γp,np), c(λn), and c(βn) as the numbers of compact histories for γp,np, λn, and βn respectively, in Sections 4.1, 4.2, and 4.3 we show that for increasing values of n, the exponential growth of the sequences c(γp,np), c(λn), and c(βn) with respect to tree sizes |γp,np|, |λn|, and |βn| is given by

c(γp,np)(kγ)γp,np, with kγ=4, (6)
c(λn)(kλ)|λn|, with kλ=55+1123.3302, (7)
c(βn)(kβ)|βn|, with 2.855<kβ<2.858. (8)

A remarkable consequence of Eq. 7 is that although the growth of the number of coalescent histories in the lodgepole family is faster than exponential, the number of compact histories in the family grows “only” exponentially, as determined by Eq. 7. Furthermore, although the number of coalescent histories in the lodgepole family grows much faster than in the caterpillar family (Disanto and Rosenberg, 2015), the growth of the number of compact histories in the lodgepole family is exponentially slower than for caterpillars.

In accord with the cases of the small trees illustrated in Fig. 4, we also observe a trend in the values of the exponential orders kγ, kλ, and kβ and the values of the Colless indices iC(γ1,n−1), iC(λn), and iC(βn). For maximally unbalanced and completely balanced trees, we have iC(γ1,n−1) = 1 and iC(βn) = 0. For n ≥ 1, iC(λn)=2/[2n(2n1)]×[1+i=2n(2i3)]=(n22n+2)/[n(2n1)], from which iC(λn) → 1/2 as n → ∞. For large n, among the families we consider, the unbalanced caterpillars have the most compact histories, the completely balanced trees have the fewest, and the lodgepole trees, with an intermediate level of balance, have an intermediate number of compact histories.

4.1. Bicaterpillar trees γp,np

We showed in Section 3.4 that the number c(γp,np) of compact coalescent histories for γp,np equals the number of coalescent histories of γp,np. This fact enables the computation of c(γp,np) and its exponential growth.

Theorem 10 For the bicaterpillar tree γp,n−p, (i) the number of compact coalescent histories satisfies

c(γp,np)=CpCnp, (9)

where Cn=(2nn)/(n+1) is the nth Catalan number, and (ii) the exponential growth of the number of compact coalescent histories satisfies c(γp,np)(kγ)|γp,np|, where kγ = 4.

Proof (i) The number of coalescent histories for c(γp,np) was shown by Rosenberg (2007, Theorem 3.10) to be CpCnp. The claim follows from the equivalence of compact histories and coalescent histories for bicaterpillars.

(ii) We compute the exponential growth of the number of compact coalescent histories first for the caterpillar γ1,n−1. Eq. 9 yields c(γ1,n−1) = Cn−1. From (2nn)4n and |γ1,n−1| = n, it follows that c(γ1,n−1) ⋈ 4n.

Rosenberg (2007, Corollary 3.11) showed that for fixed n ≥ 2, over the range 1 ≤ p ≤ ⌊n/2⌋, the number of coalescent histories for the bicaterpillar γp,np (Eq. 9) is greatest when p = 1, and it decreases monotonically from Cn−1 to Cn/2⌋Cn/2⌉ as p increases from 1 to ⌊n/2⌋. Hence, considering bicaterpillars with n taxa, the Catalan number Cn−1 is both the largest number of coalescent histories and the largest number of compact histories. Note that because Cn ⋈ 4n, the product Cn/2⌋Cn/2⌉, representing the smallest number of coalescent histories and compact histories possible for a bicaterpillar with n taxa, also satisfies Cn/2⌋Cn/2⌉ ⋈ 4n/2⌋4n/2⌉ = 4n. Thus, because the number of compact histories satisfies c(γp,np) ⋈ 4n both for the n-taxon bicaterpillar with the fewest compact histories and for the n-taxon bicaterpillar with the most compact histories, it does so for any n-taxon bicaterpillar, irrespective of the value of p. □

The pattern that the number of compact histories increases with increasing imbalance that is seen in comparing caterpillar, lodgepole, and completely balanced families is also observed with bicaterpillars as p changes. The Colless index for γp,np is

iC(γp,np)=(n2p)+[i=2p(i2)]+[i=2np(i2)](n1)(n2)=2[p(n2+1)]2+n26n+42(n1)(n2). (10)

For fixed n, this quantity decreases as p increases from 1 to ⌊n/2⌋. At p = 1, it has the maximal value of iC(γ1,n−1) = 1. At p = ⌊n/2⌋, it is near 1/2: iC(γn/2,n/2) = (n−4)/[2(n−1)] for even n and iC(γ(n−1)/2,(n+1)/2) = (n2 − 6n + 13)/[2(n − 1)(n − 2)] for odd n.

Fig. 7 plots the logarithm of the number of compact histories (Eq. 9) against the Colless index (Eq. 10) for each p from 1 to ⌊n/2⌋, for n = 10, 20, 30, 40, and 50. Following Eqs. 9 and 10, the values of both log c(γp,np) and iC(γp,np) decrease as p increases. We can also observe from the figure the relatively constant value in p for log c(γp,np) suggested by the fact that c(γp,np) has exponential order 4 in n irrespective of p. For fixed p, with each increment of 10 in n, the figure illustrates that this constant value increases by a value close to log(4n+10/4n) = 10 log 4 ≈ 13.8629, the value predicted by the exponential order 4 of c(γp,np) at fixed p.

Figure 7:

Figure 7:

The natural logarithm of the number of compact coalescent histories for bicaterpillar tree shapes γp,np (Eq. 9), plotted against the Colless index of imbalance (Eq. 10). For each of five values of n, the size of plotted points increases as p ranges from 1 to ⌊n/2⌋, indicating that bicaterpillars with larger p have smaller Colless indices and fewer compact histories.

4.2. Lodgepole trees λn

In this section, we study in detail the number c(λn) of compact histories of the lodgepole labeled topology λn. We prove Eq. 7, and we derive an explicit formula, Eq. 18, for c(λn).

We say that a compact history h of λn generates a compact history h′ of λn+1 if the restriction of h′ to the subtree λn of λn+1 agrees with h when we ignore the label assigned by h to the root branch of λn. For instance, exactly 6 of the 10 compact histories of λ2 depicted in Fig. 8 are generated by the compact history h of λ1 = (a, (b, c)) that has m(h) = 2 and label 0 for the branch above the cherry (b, c). According to this definition, each compact history h′ of λn+1 is generated by exactly one compact history h of λn.

Figure 8:

Figure 8:

The 10 compact histories possible for the lodgepole labeled topology λ2.

To enumerate the compact histories of the lodgepole family, we use a generating tree approach (Barcucci et al., 1999; Banderier et al., 2002). We associate each compact history with a labeled node in a tree that represents all possible choices for producing the compact histories: the generating tree. More precisely, the generating tree of the compact histories of the lodgepole family is characterized by the following properties.

Definition 11 The generating tree of the compact coalescent histories of the lodgepole family (λn)n≥0 is the rooted tree in which (i) the node associated with a compact history h of λn for which m(h) = m has depth n and label (m), and (ii) a node (m′) directly descends from a node (m), written (m) ⇝ (m′), when (m′) is associated with a compact history of λn+1 generated by the compact history of λn corresponding to the node (m).

The first levels of the generating tree appear in Fig. 9B. Nodes correspond to the compact histories of λ0, λ1, and λ2; each of the 10 depth-2 nodes is associated with a compact history of λ2 from Fig. 8. As previously observed, 6 of the 10 compact histories of λ2 are generated by the compact history of λ1 with root label 2. Indeed, in Fig. 9B, 6 nodes at depth 2 descend directly from node (2) at depth 1. Different nodes in the generating tree can share the same label, as different compact histories can have the same label for their root branch (Fig. 8).

Figure 9:

Figure 9:

Generation of compact coalescent histories of lodgepole labeled topologies. (A) Generation of compact histories of λn+1 from a compact history of λn. Let h be a compact history of λn with label m = m(h) for its root branch. The compact histories h′ of λn+1 generated by h are determined by choosing two parameters: (i) the label, 0 or 1, for the branch above the cherry root subtree of λn+1, and (ii) the label ∈ [0, m] for the branch above the root subtree λn of λn+1. If the label in (i) is chosen to be 0, then the label m(h′) = m + 2 − of the root branch in λn+1 ranges in the interval m(h′) ∈ [2, m + 2]. Similarly, if the label chosen in (i) is 1, then the label m(h′) = m + 1 − ranges in m(h′) ∈ [1, m + 1]. (B) The first levels of the generating tree (Eq. 11). A node (m) at depth n in the generating tree accounts for a compact history of λn with root branch labeled by m. The root of the generating tree has label (0), as the lodgepole λ0 with 1 taxon has no coalescent events. Nodes descending from a generic node (m) are determined by Eq. 11. The 10 nodes at depth 2 account for the compact histories of λ2 of Fig. 8.

For an arbitrary compact history h of λn, the value of m(h) = m provides information about the number of compact histories of λn+1 generated by h, or, equivalently, about the number of nodes at depth n + 1 in the generating tree that descend from the node (m) at depth n associated with h. Moreover, taking the integer m as input, the construction in Fig. 9A determines the value m(h′) for all the compact histories h′ generated by h.

The next result iteratively characterizes the structure of the generating tree.

Proposition 12 The generating tree of the compact coalescent histories of the lodgepole family (λn)n≥0 can be produced iteratively, level by level, by the following rule: (i) the root of the generating tree is labeled by (0), and (ii) each node with label (m) in the generating tree has exactly 2m + 2 descendants, which are labeled by (2),(3), … , (m + 2) and (1),(2), … , (m + 1). In symbols,

{(0)root;(m)(2),(3),,(m+2),(1),(2),,(m+1). (11)

Proof According to the construction of compact histories described in Fig. 9A, each compact history h of λn with root label m generates exactly 2m + 2 different compact histories h′ of λn+1: one for each value of m(h′) ∈ {2,3, … , m + 2}, when the node above the cherry root subtree of λn+1 has label 0, and one for each value of m(h′) ∈ {1,2, …, m + 1}, when the node above the cherry root subtree of λn+1 has label 1. In particular, this construction characterizes the nodes of the generating tree that directly descend from an arbitrary node (m): for each integer m ≥ 0, the descendants of each node (m) present in the generating tree are (2),(3), … , (m + 2),(1),(2), … , (m + 1). Setting to (0) the label for the root—the node at depth 0—of the generating tree, the characterization of descendant nodes yields the procedure given in Eq. 11 for iteratively producing the generating tree of the compact histories of the lodgepole family. □

As an example, starting from the root node (0) of the generating tree and applying Eq. 11, we find (0) ⇝ (2), (1), which gives the first level of the tree of Fig. 9B. A second application then gives (2) ⇝ (2), (3), (4), (1), (2), (3) and (1) ⇝ (2), (3), (1), (2), from which we recover the second level of the tree.

To count the number of compact histories of the nth lodgepole tree, we make use of the equivalence between the number of nodes with label (m) produced at depth n in the generating tree determined by Eq. 11 and the number cm,n of compact histories of λn with root branch labeled by m.

Let L(x,z)=n=0m=0|λn|1cm,nxmzn be the bivariate generating function counting nodes (m) at depth n in the generating tree. Note that for each n ≥ 0, because each compact history has label from 1 to at most |λn|−1 above the root, we have m=0|λn|1cm,n=c(λn). Hence, L(1,z)=n=0c(λn)zn is the generating function associated with the sequence c(λn). A functional equation that characterizes L(1, z) can be determined from the structure of the generating tree described in Proposition 12.

Proposition 13 The generating function L(1,z)=n=0c(λn)zn satisfies the functional equation

L(1,z)=1+zL(1,z)2+zL(1,z)3ϕ(z,L(1,z)). (12)

Proof We first derive an equation for the bivariate generating function L(x, z), which is then used to prove Eq. 12. From Proposition 12, each time that an expression xmzn is counted in the generating function L(x,z)—written xmznL in what follows—the terms (j=2m+2xj+j=1m+1xj)zn+1 appear in L(x, z) as well. Summing over all possible xmznL, we obtain

L(x,z)=1+[xmznL(j=2m+2xj+j=1m+1xj)zn+1]=1+x2zxmznL(1xm+1)zn1x+xzxmznL(1xm+1)zn1x=1+(x2z+xz)[L(1,z)xL(x,z)1x], (13)

where the 1 = x0z0 term in Eq. 13 accounts for the root of the generating tree (Eq. 11). The root does not appear in the sum on the right-hand side because it is not descended from any node; in summing over all xmznL to produce L(x, z) on the left, no term gives rise to x0z0 on the right. Collecting terms yields

L(x,z)[1+x2z(1+x)1x]=1+L(1,z)[xz(1+x)1x], (14)

from which we can derive an equation for L(1, z) by applying the “kernel” method (Banderier et al., 2002).

Take X = X(z) such that

1+X2z(1+X)1X=0. (15)

By replacing x with X in Eq. 14, the left-hand side cancels, giving

0=1+L(1,z)(1X),

where we note that Xz(1+X)1X=1X to produce the right-hand side. We then obtain L(1, z) = X, which together with Eq. 15 yields Eq. 12. □

From Eq. 12, it is possible to determine the dominant singularity ρ of L(1, z), and thus, from Eq. 1, the exponential growth of the sequence c(λn) ⋈ (1)n. Following Section VII.6.1 of Flajolet and Sedgewick (2009), given m ≥ 1 generating functions y1(z), …, ym(z) satisfying a system of m non-linear polynomial equations

{y1=ϕ1(z,y1,,ym)ym=ϕm(z,y1,,ym), (16)

the value ρ of the common dominant singularity of y1, …, ym can be determined from the algebraic expressions for ϕ1, …, ϕm through the “characteristic system” associated with Eq. 16. Eq. 64 in Section VII.6.1 of Flajolet and Sedgewick (2009) enables the calculation of the characteristic system of Eq. 16.

In our case, setting y1(z) = L(1, z), ϕ1 = ϕ, and m = 1, the associated characteristic system of Eq. 12 is

{τ=ϕ(ρ,τ)=1+ρτ2+ρτ30=1ϕ(ρ,τ)τ=12ρτ3ρτ2, (17)

and the following theorem holds.

Theorem 14 In the lodgepole family (λn)n≥0, (i) the exponential growth of the number of compact coalescent histories satisfies c(λn) ⋈ (kλ)|λn|, where

kλ=55+1123.3302,

and (ii) when n ≥ 1, the number c(λn) can be computed as

c(λn)=1ni=0n12i+1(2ni)(ni+1). (18)

Proof (i) By solving Eq. 17 in positive real numbers, we obtain ρ=(5511)/2, and c(λn) ⋈ (1)n. Because the lodgepole λn has |λn| = 2n + 1 taxa, the number of compact histories in the lodgepole family grows like (1)(|λn|−1)/2, or

c(λn)(1/ρ)|λn|, (19)

with respect to the number of taxa |λn|. Setting kλ=1/ρ, Eq. 19 yields the result.

(ii) The exact formula for c(λn) follows from an application of Lagrange inversion to the functional equation of Proposition 13. The complete derivation of Eq. 18 from Eq. 12 can be found in Deutsch (2000), where a class of lattice paths is shown to be enumerated by a generating function satisfying Eq. 12. □

Note that computing the exponential order 1 of the sequence c(λn) directly from Eq. 18 is not straightforward, and the value of ρ is indeed not reported by Deutsch (2000). Fig. 10 shows numerical values of c(λn)1/|λn| converging to the value of kλ ≈ 3.3302 that determines the exponential growth of the sequence c(λn) with respect to tree size |λn|.

Figure 10:

Figure 10:

Values of c(λn)1/|λn| for 0 ≤ n ≤ 100. The dashed horizontal line has ordinate kλ, with kλ ≈ 3.3302 as in Theorem 14. The integers c(λn) representing the number of compact coalescent histories for the lodgepole family are computed from Eq. 18. As c(λn)kλ|λn|, for increasing n, the sequence c(λn)1/|λn| approaches kλ.

4.3. Completely balanced trees βn

This section studies the number c(βn) of compact histories for the completely balanced labeled topology βn. We prove Eq. 8, deriving a recursive procedure for calculating c(βn).

Denote by cm,n the number of compact histories of βn with root branch labeled by m. Consider the family of polynomials Bn(x)=m=0|βn|1cm,nxm, where each term xm in Bn(x), written xmBn, accounts for a compact history h of βn with m(h) = m. Note that Bn(1)=m=0|βn|1cm,n=c(βn).

The next proposition gives a recursive procedure for calculating the polynomial Bn+1(x).

Proposition 15 The family of polynomials Bn(x)=m=0|βn|1cm,nxm satisfies the recursion

Bn+1(x)=x[Bn(1)xBn(x)]2(1x)2, (20)

with B0(x) = 1.

Proof The construction of compact histories described in Fig. 11 translates into algebraic terms, determining the following recurrence for the polynomial Bn+1(x):

Bn+1(x)=xm1Bnxm2Bnl1=0m1l2=0m2xm1+m2+1l1l2 (21)
=x[(xm1Bnj=0m1xj)(xm2Bnj=0m2xj)]=x(xmBn1xm+11x)2=x[Bn(1)xBn(x)]2(1x)2. (22)

In particular, the nested sums in Eq. 21 encode the generation of a generic compact history hxm1+m2+1l1l2Bn+1 by appending to a common root two arbitrary compact histories h1xm1Bn and h2xm2Bn (step i of Fig. 11), and then choosing new labels 1 ∈ [0, m1] and 2 ∈ [0, m2] for the two branches descending from the root of h (step ii). Eq. 22 follows from Eq. 21 through algebraic manipulations. □

Figure 11:

Figure 11:

Compact histories of completely balanced labeled topologies. Each compact history h of βn+1 is uniquely obtained by (i) appending two compact histories h1, h2 of βn to a common root node, and (ii) choosing labels 1 and 2 for the two branches descending from the root of h. If m1 = m(h1) and m2 = m(h2) are the labels of the root branches of h1 and h2 respectively, then 1 ranges in the interval 1 ∈ [0, m1], and 2 ranges in the interval 2 ∈ [0, m2]. Once 1, 2 have been fixed, the label of the root branch in h is determined by m(h) = m1 + m2 + 1 − 12. After step (ii), taxa of h1 and h2 are relabeled to obtain a proper completely balanced labeled topology (capital letters). The labeling is applied such that one set of labels is given to the taxa in h1 and another set to the taxa in h2. Note that even when h1 = h2 (as in the figure), if 1 =6 2, then switching the values for 1 and 2 generates a different compact history of βn+1.

By applying Eq. 20 four times, we obtain B1(x) = x, B2(x) = x + 2×2 + x3, B3(x) = 16x + 32×2 + 40×3 + 32×4 + 17×5 + 6×6 + x7, and B4(x) = 20736x + 41472×2 + 57600×3 + 64512×4 + 60160×5 + 47616×6 + 32480×7 + 19200×8 +9824×9+4288×10+1552×11 +448×12+97×13+14×14+x15. For example, the term 448×12B4 indicates that β4 has exactly 448 compact histories with root branch labeled by m = 12. Using these calculations, we find that the first entries of the sequence c(βn) = Bn(1) are c(βn) = 1, 1, 4, 144, and 360000 for n = 0, 1, 2, 3, and 4, respectively. The sequence c(βn) grows exponentially as specified by the following theorem.

Theorem 16 In the completely balanced family (βn)n≥0, (i) the exponential growth of the number of compact coalescent histories satisfies c(βn)(kβn)|βn|, where

kβ=exp[j=02j log(1+ej)],

and en=Bn(1)/Bn(1)=(m=0|βn|1mcm,n)/c(βn) is the expected value of m(h) in a compact coalescent history h chosen uniformly at random from the set of compact coalescent histories of βn. Furthermore, (ii) kβ satisfies the bounds 2.855 < kβ < 2.858.

Proof (i) From Eq. 20, we have

c(βn+1)=Bn+1(1)=[Bn(1)+Bn(1)]2=(1+en)2Bn(1)2=(1+en)2c(βn)2, (23)

where the second equality follows from a double application of l’Hopital’s rule to the limit

Bn+1(1)=limx1x[Bn(1)xBn(x)]2(1x)2.

Setting yn = log c(βn), from Eq. 23, we obtain

yn+1=2yn+2 log(1+en).

This linear recursion has solution

yn=2ny0+j=0n12nj log(1+ej)=2n[y0+j=02j log(1+ej)]j=n2nj log(1+ej), (24)

where, because the two series in Eq. 24 have positive terms, they both converge being bounded from above. More precisely, for j ≥ 0, the inequality 1 + ej ≤1 + (2j – 1) = 2j holds, from the interpretation of ej as the mean value of m(h) for a random compact history h of a balanced tree with 2j taxa. Hence, for each fixed n ≥ 0, the following upper bound for the series in Eq. 24 holds

j=n2nj log(1+ej)2nj=nlog(2j)2j<2nj=nj2j=2n[j=0j2jj=0n1j2j]=2n[22nn12n1]=2n+2. (25)

The second equality in Eq. 25 uses the fact that j=0kj/2j=2k(2k+1k2) for each integer k ≥ −1, which follows by setting x = 1 into

j=0k(x2)jj=x2j=0k(x2)j1j=x2[2j=0k(x2)j]=x[1(x/2)k+11x/2]=x[2k+12(k+1)xk+kxk+1]2k(2x)2.

Switching back to c(βn)=eyn, and noting that c(β0) = 1 and |βn| = 2n, Eq. 24 yields

c(βn)=[c(β0)exp(j=02j log(1+ej))]2nexp[j=n2nj log(1+ej)]=1an(kβ)|βn|,

where an=exp[j=n2nj log(1+ej)] and kβ is the quantity defined in the statement of the theorem.

Note that the sequence an is bounded by polynomial functions of the size |βn| = 2n. Indeed, from the trivial inequality ej ≥ 1 (with j ≥ 1) and from Eq. 25, for n ≥ 1 we have

4=e2 log 2an<e2n+2=e2ne2=e2log2|βn|e2=e2 log|βn|/log 2e2=|βn|2/log 2e2.

Hence, the exponential growth of the sequence c(βn) is determined by c(βn) ⋈ (kβ)|βn| as claimed.

(ii) The value of the constant kβ can be bounded by using the first terms of the sequence en. For n ≤ 14, we perform the exact computation of the values of en using the recursion of Proposition 15 for the polynomials Bn(x). By using the exact sequence of rational numbers (en)0≤n≤14, symbolic calculations give

2.8550<exp[j=0142j log(1+ej)]<2.8551.

From this inequality, we obtain the bounds for kβ claimed in the statement of the theorem:

2.8550<exp[j=0142j log(1+ej)]<kβ=exp[j=0142j log(1+ej)]exp[j=152j log(1+ej)]<2.8551 exp[j=152j log(1+ej)]<2.8551e1/1024<2.8580,

where we have used the inequality j=152j log(1+ej)=[j=15215j log(1+ej)]/215<32/215=1/1024 derived directly from Eq. 25. □

Because the lower and upper bounds for kβ given in Theorem 16 are quite close to each other, we can take their mean as an approximation for kβ, that is, kβ ≈ (2.855 + 2.858)/2 = 2.8565 (Fig. 12). Finally, we observe that by computing more terms of the sequence en—here we have used the first n ≤ 14 terms—the same approach used in the proof of the theorem can be applied to obtain even more accurate estimates of kβ. In particular, because en increases slowly with respect to the number of taxa |βn| = 2n—the values of en are 0, 1, and 2 for n = 0, 1, and 2, respectively, and they are approximated by 3.1667, 4.6033, 6.4180, 8.7404, 11.7342, 15.6085, 20.6332, 27.1578, 35.6357, 46.6559, 60.9835, and 79.6133 for n = 3,4, …, 14—the calculation of a few more terms of the sequence en can lead to stricter bounds for kβ.

Figure 12:

Figure 12:

Values of c(βn)1/|βn| for 0 ≤ n ≤ 14. The dashed horizontal line has ordinate 2.8565 given by the mean of the lower and upper bounds found for the exponential order kβ for the increase in the number of taxa of compact histories for the completely balanced trees (Eq. 8). The integers c(βn) are computed as c(βn) = Bn(1), that is, by setting x = 1 in the polynomials Bn(x) obtained recursively from Proposition 15. The last few points are very closely approximated by the horizontal line. As c(βn)kβ|βn|, for increasing n, the sequence c(βn)1/|βn| approaches kβ.

5. Mean number of compact coalescent histories

In Section 4, we found that the sequence of the number of compact histories can have different exponential orders for different tree families, as seen in the values of kγ = 4 (Eq. 6), kλ ≈ 3.3302 (Eq. 7), and kβ ≈ 2.8565 (Eq. 8) for the bicaterpillar, lodgepole, and balanced families, respectively. Motivated by these observations, we now study the exponential growth of the mean number En[c] of compact histories of a labeled topology selected uniformly at random in the set of labeled topologies Tn. By using generating functions, we show that the mean grows like

En[c]3.375n, (26)

where the asymptotic constant 3.375 is close to the mean (kγ + kλ + kβ)/3 ≈ 3.3955.

We start our proof of Eq. 26 by considering all possible labeled topologies of size n, where cm,n now denotes the total number of compact histories with root branch labeled by m. Define cn=m=0n1cm,n to be the total number of compact histories of all trees of size n. Let

F(x,z)=n=1m=0n1cm,nxmznn!=z+x2z2+(x2+x22)z3+(9x8+5x24+5x38)z4+(7x2+4x2+21x38+7x48)z5+

be the bivariate exponential generating function associated with integers cm,n, where each term xmzn/n! in F(x, z), written xmzn/n! ∈ F, accounts for a compact history h of size n with m(h) = m. The function F(x, z) is characterized by the following proposition.

Proposition 17 The generating function F(x,z)=n=1m=0n1cm,nxmznn! satisfies the functional equation

F(x,z)=z+x[F(1,z)xF(x,z)]22(1x)2. (27)

Proof Observe that we can write F(x, z) as the sum

F(x,z)=z+12xm1zn1n1!Fxm2zn2n2!Fl1=0m1l2=0m2xm1+m2+1l1l2zn1+n2(n1+n2)!(n1+n2n1). (28)

The initial z in Eq. 28 accounts for the term x0z1/1! in F associated with the compact history of the one-taxon tree. Mirroring the construction of compact histories of size larger than one from smaller compact histories described in Fig. 13, the nested sums and the factor 1/2 in Eq. 28 take into account the presence in F of exactly (n1+n2n1)/2 copies of the term xm1+m2+1l1l2zn1+n2/(n1+n2)!, for each fixed pair (xm1zn1/n1!,xm2zn2/n2!)F×F and for each choice of (1, 2) ∈ [0, m1]×[0, m2]. Specifically, each copy of xm1+m2+1l1l2zn1+n2/(n1+n2)! is associated with a compact history h that, as in Fig. 13A, can be decomposed into the compact histories h1 and h2 associated with terms xm1zn1/n1! and xm2zn2/n2!, and in which the two branches descending from the root are labeled by 1 and 2, respectively.

Figure 13:

Figure 13:

Generation of compact histories from compact histories of smaller trees. (A) Generation of a compact history for a tree from compact histories for its two root subtrees. (B) Generating the same compact history twice when h1 = h2 and 1 = 2. (C,D) Generating the same compact history twice when h1 = h2 and 12. Each compact history h of size |h| > 1 is obtained as in (A) by (i) appending to a common root node a pair (h1, h2) of compact histories, and (ii) choosing labels 1, 2 for the two branches descending from the root of h. If m1 = m(h1) and m2 = m(h2) are the labels of the root branches of h1 and h2, respectively, then 1 ranges in the interval 1 ∈ [0,m1], and 2 ranges in 2 ∈ [0, m2]. The label of the root branch in h is thus m(h) = m1 + m2 + 1 − 12, which provides the exponent assigned to variable x in Eq. 28. After step (ii), taxa of h1 and h2 are relabeled to obtain a proper labeled topology underlying h. As in Section 2.1, we impose without loss of generality a linear order ≺ for the labels of the taxa of a tree. For the relabeling procedure, we choose |h1| elements among the |h| = |h1|+ |h2| new labels possible for the taxa of h, where we are using |h|, |h1|, and |h2| here to indicate the number of taxa in the trees underlying h, h1, and h2, respectively. There are (|h||h1|) different choices, producing the binomial coefficient in Eq. 28. The elements chosen relabel h1, and the remaining elements relabel h2. With respect to the order ≺, the ith label of h1 is assigned the ith label selected. Similarly, the ith label of h2 is assigned the ith label that was not selected. This construction generates each compact history exactly twice. For this reason, the factor 1/2 appears in Eq. 28 before the summations. More precisely, if the pair (h1, h2) considered in step (i) of the procedure has h1h2, then each resulting compact history has a copy when we take the pair (h2, h1). If h1 = h2, and we take 1 = 2 in step (ii), then the (|h||h1|) relabelings generate each compact history twice, as can be seen in (B) by switching the labels assigned to h1 and h2. Finally, if h1 = h2 and we set 12 in step (ii), then each compact history generated has an equivalent one obtained as in (C) and (D) by switching both the values of 1 and 2 and the labels assigned to h1 and h2.

From Eq. 28, algebraic manipulations give

F(x,z)=z+12x[(xm1zn1n1!Fzn1n1!j=0m1xj)(xm2zn2n2!Fzn2n2!j=0m2xj)]=z+x[F(1,z)xF(x,z)]22(1x)2,

where the last equality uses

xmznn!Fznn!j=0mxj=xmznn!Fzn(1xm+1)n!(1x)=11x(xmznn!Fznn!xxmznn!Fxmznn!)=F(1,z)xF(x,z)1x.

Setting

fF(1,z)=n=1m=0n1cm,nznn!=n=1cnznn!, (29)

the equation for F(x, z) given in Proposition 17 yields the next result.

Proposition 18 The exponential generating function fF(1,z)=n=1cnznn! of the total number of compact coalescent histories of all trees of size n satisfies the equation

f=z(27/2)z24f24f3+18zfψ(z,f). (30)

Proof From Eq. 27, we derive Eq. 30 by applying the “quadratic” method. As described by Flajolet and Sedgewick (2009, Section VII.8.2), this method can be used for solving functional equations of the form

[g1F(x,z)+g2]2=g3, (31)

where the functions gj = gj(x, z, f) are given explicitly, and both F(x, z) and f = f(z) are unknown generating functions. Rearranging terms and completing the square, Eq. 27 can be rewritten as

[x3/2F(x,z)2(1x)12x+x2+x2f2(1x)x3/2]2=[12x+x2+x2f2(1x)x3/2]2xf22(1x)2z. (32)

This equation has the form given in Eq. 31 when we set

(g1,g2,g3)=(x3/22(1x),12x+x2+x2f2(1x)x3/2,[12x+x2+x2f2(1x)x3/2]2xf22(1x)2z).

Following the quadratic method, suppose there exists a substitution x = X = X(z) for which the left-hand side of Eq. 32, g1(X, z, f)F(X, z) + g2(X, z, f), cancels. This substitution cancels the right-hand side of Eq. 32 as well and, because of the square in the left-hand side of the equation, its derivative with respect to x. Note that because f is a function of z only, both the substitution x = X and the derivative of g3 with respect x do not affect f. We thus have a system of two equations,

{g3(X,z,f)=0g3(X,z,f)x=0, (33)

which implicitly determines the two unknown functions X and f. The derivative produces

g3(X,z,f)x=3+4XX22X2f2X4.

Solving Eq. 33 for f and z yields f=(X1)(X3)2X2 and z=X1X3, from which we eliminate X to obtain f = z − (27/2)z2 − 4f2 − 4f3 + 18zf, as claimed. □

Identifying f with its power series expansion (Eq. 29), we observe that the terms of f with order at most i ≥ 2 that appear in the left-hand side of Eq. 30 can be determined from the terms of f of order at most i − 1 present in the right-hand side of Eq. 30. For example, setting i = 3 and writing cn*cn/n!, Eq. 30 gives

(c1*z+c2*z2+c3*z3+c4*z4+)=z(27/2)z24(c1*z+c2*z2+c3*z3+c4*z4+)24(c1*z+c2*z2+c3*z3+c4*z4+)3+18z(c1*z+c2*z2+c3*z3+c4*z4+),

where the terms of the right-hand side given in bold are the terms of the expansion of f that affect the computation of the terms in bold on the left-hand side. In other words, Eq. 30 can be used for recursively computing the coefficients [zn]f = cn/n! of the generating function f. Denoting by p(i) the polynomial obtained from a polynomial p(z) by deleting terms of order larger than i in z, the polynomial fi recursively defined by

{f0=0f1=zfi=[z(27/2)z24fi124fi23+18zfi1](i),i2, (34)

gives the expansion of f up to the term of order i. For instance, for i = 2 and i = 3 we have f2 = z + z2/2, and f3 = z + z2/2 + z3, respectively.

Increasing the value of i, from the polynomials fi we obtain the expansion

f=z+z2/2+z3+3z4+11z5+(91/2)z6+204z7+969z8+4807z9+(49335/2)z10+, (35)

in which coefficients cn/n! grow like

cnn!(1/ρ)n, (36)

with ρ corresponding to the dominant singularity of f. From the calculation of the value of ρ, the following theorem determines the exponential growth of the mean number of compact histories in a labeled topology of size n selected uniformly at random.

Theorem 19 The exponential growth of the mean number of compact coalescent histories in a labeled topology of size n selected uniformly at random satisfies En[c]3.375n.

Proof We proceed as in Section VII.6.1 of Flajolet and Sedgewick (2009), calculating the value of ρ (Eq. 36) as the positive solution of the characteristic system associated with the functional equation (Eq. 30) satisfied by f:

{τ=ψ(ρ,τ)=ρ(27/2)ρ24τ24τ3+18ρτ0=1ψ(ρ,τ)τ=1+8τ+12τ218ρ. (37)

This characteristic system has been obtained by Eq. 64 of Flajolet and Sedgewick (2009), interpreting our Eq. 30 as their Eq. 61. By solving Eq. 37 in positive real numbers, we obtain ρ = 4/27, with 1 = (27/4) = 6.75.

The mean number of compact histories in a labeled topology of size n selected uniformly at random can be computed as En[c] = cn/|Tn|, with |Tn| as in Proposition 1. From Eqs. 36 and 2, the mean En[c] grows like

En[c]=cn/n!|Tn|/n!(27/4)n2n=(27/8)n=3.375n,

as claimed. □

Fig. 14 shows numerical values of (En[c])1/n approaching the exponential order 3.375 of the sequence En[c].

Figure 14:

Figure 14:

Values of (En[c])1/n for 1 ≤ n ≤ 100. The dashed horizontal line has ordinate 3.375 given by the exponential order of the sequence En[c] (Theorem 19). The expectation En[c] is calculated as the ratio cn/|Tn|, where cn = n!([zn]f) is the total number of compact histories of size n and |Tn| is the number of labeled topologies with n taxa (Proposition 1). The nth coefficient [zn]f in the expansion (Eq. 35) is the coefficient of the term of order n in the polynomial f100, obtained recursively as in Eq. 34. As En[c]3.375n, for increasing n, the sequence (En[c])1/n approaches 3.375.

6. Discussion

Considering gene trees and species trees with a matching labeled topology G = S = t, we have studied the number of compact histories of labeled topologies t. We have focused on the exponential growth of the number of compact histories, both when t belongs to special tree families of increasing size and when t is a random labeled tree topology of given size drawn under a uniform distribution. We also characterized the set of labeled topologies in which the coalescent histories are the same as the compact coalescent histories.

In Section 4, in addition to the caterpillar trees γ1,n−1 already studied by Wu (2016), we considered three other tree families: the bicaterpillar trees γp,np, the lodgepole trees λn, and the completely balanced tress βn. Whereas for the caterpillar and bicaterpillar trees, the number of compact histories grows like c(γp,np)(kγ)|γp,np|, with kγ = 4, for the lodgepole trees, it grows exponentially like c(λn) ⋈ (kλ)|λn|, where kλ ≈ 3.3302. Notably, although the growth of the number of coalescent histories in the family λn is faster than exponential (Disanto and Rosenberg, 2015), the number of compact histories grows “only” exponentially—in fact, exponentially slower than in the family γp,np. In terms of the relative complexity of the two gene tree probability algorithms CompactCH (Wu, 2016) and COAL (Degnan and Salter, 2005), this result demonstrates that when gene trees and species trees have a particular matching labeled topology t, the number of compact coalescent histories processed by CompactCH for calculating the gene tree probability can be much smaller—although still exponential in the size of t—than the number of coalescent histories used by COAL for computing the same probability.

The study of the number c(βn) of compact histories in the family of completely balanced trees βn appears to be more difficult. Indeed, whereas for the caterpillar γ1,n−1 and the lodgepole λn, explicit formulas, Eqs. 9 and 18, could be obtained for enumerating compact histories, in the completely balanced case, the exact enumeration proceeds only recursively. However, the bounds given in Eq. 8 determine the numerical value of the exponential order kβ of the sequence c(βn) with a precision of 2 decimal digits, kβ = 2.8565 ± 0.0015. Theoretical results describing the growth of the number of coalescent histories in the family βn are not known. It is of interest to examine if the generating tree and generating function approaches used here for enumerating compact histories could be extended to the framework of coalescent histories.

By comparison of the values of kγ, kλ, and kβ, it can be observed that in more unbalanced trees, the number of compact histories tends to be larger. This correlation is supported by the exhaustive calculation of the number of compact histories for unlabeled topologies of small size (Section 3.3) and by the analysis of bicaterpillar trees with different levels of balance (Section 4.1). More generally, our results prove that for different tree families, the growth of the number of compact histories can be exponentially faster or slower than for other families. An average case analysis of the number of compact histories is conducted in Section 5, where it is shown that the expected number of compact histories of a labeled topology of size n selected uniformly at random grows like 3.3750n. Interestingly, the constant 3.3750 is not far from the mean (kγ + kλ + kβ)/3 ≈ 3.3955.

Note that because coalescent histories are at least as numerous as compact histories, the value 3.375 provides a lower bound for the exponential order of the sequence of the mean number of coalescent histories of a labeled topology of size n chosen uniformly at random. This lower bound is unlikely to be precise, as sequences of the number of coalescent histories in specific families substantially exceed this value in exponential order. For example, for caterpillar and bicaterpillar families, the agreement of the number of compact histories with the number of coalescent histories gives an exponential order of 4 for sequences of the number of coalescent histories. An exponential order of 4 has also been associated with caterpillar-like families that begin with a seed tree t(0) and for n ≥ 1 sequentially build a family of trees t(n) by appending t(n−1) and a single taxon to a shared root (Rosenberg, 2013; Disanto and Rosenberg, 2016). Moreover, as noted above, the number of coalescent histories for the lodgepole family grows faster than exponentially (Disanto and Rosenberg, 2015).

Many enumerative problems concerning compact histories remain open. For instance, to understand the computational complexity of gene tree probability algorithms, it would be of interest to obtain comparative results relating numbers of compact histories not only to numbers of coalescent histories, but also to enumerations of the ancestral configurations (Wu, 2012; Disanto and Rosenberg, 2017) and “nonequivalent” ancestral configurations (Wu, 2012; Disanto and Rosenberg, 2018) that arise in alternative probability methods. It would also be of interest to have an explicit characterization of those labeled topologies that, for a given number of taxa, possess the largest and smallest numbers of compact histories. Results from Section 3.3 suggest that the maximally asymmetric caterpillar trees might have the largest number of compact histories, whereas for small n, trees with the smallest number appear to follow a recursive decomposition that appears in other settings (Eq. 5).

We have considered compact coalescent histories only for matching gene trees and species trees. For non-matching trees, the characterization in Section 3.4 of cases in which the numbers of compact histories and coalescent histories are equal does not have a natural extension. For caterpillar gene trees and arbitrary species trees, they continue to be equal: because coalescences in a caterpillar gene tree must follow a unique sequence, the only nonzero labels in a compact history must be associated with species tree internal nodes that all lie on a single path in which any two distinct nodes k1, k2 satisfy k1 < k2 or k2 < k1. Proceeding from the “smallest” node in this path to the species tree root, the nonzero labels in the compact history indicate the gene tree coalescences in the specified unique sequence, identifying only one coalescent history. This reasoning of Wu (2016) for matching caterpillar gene trees and species trees applies to caterpillar gene trees with arbitrary species trees as well.

However, the equivalence of coalescent histories and compact coalescent histories seen with caterpillar gene trees and arbitrary species trees does not extend to the other settings in which the equivalence holds for matching trees. The case of bicaterpillar (and caterpillar) species tree (((((a, b), e), f), c), d), bicaterpillar gene tree ((((a, b), c), d),(e, f)), and a compact history with label 1 above subtree (((a, b), e), f), 4 above the species tree root, and 0 above all other species tree internal nodes provides a counterexample that shows that the numbers of compact histories and coalescent histories need not agree both for the case in which the species tree is a caterpillar or bicaterpillar and for the case in which the gene tree is a non-caterpillar bicaterpillar: two coalescent histories are indicated by the compact history, one with the coalescence above subtree (((a, b), e), f) joins (a, b), and the other in which it joins (e, f). At the same time, many combinations of a gene tree and a non-matching species tree, neither of which is caterpillar or bicaterpillar, can have the same numbers of compact histories and coalescent histories. In the many cases in which all cherries in the gene tree involve taxa on opposite sides of the species tree—gene tree (((a, b),(c, d)),((e, f),(g, h))) and species tree (((a, c),(e, g)),((b, d),(f, h))), for example—only one coalescent history exists, only one compact history exists, and the numbers of compact histories and coalescent histories are trivially equal.

We note that in parallel to the introduction of compact coalescent histories by Wu (2016), a related concept of the population histories of a species tree—equivalent to the compact coalescent histories for a species tree and matching gene tree—was defined by Degnan and Rhodes (2015) for analyzing non-matching caterpillar trees. Using population histories, Degnan and Rhodes (2015, Remark 15) demonstrated that given a caterpillar species tree, the number of coalescent histories, and hence the (equivalent) number of compact coalescent histories, is always larger for the matching gene tree than for a non-matching caterpillar gene tree. We have not compared compact histories for distinct gene trees with a fixed species tree, and we defer a deeper analysis of compact histories of non-matching gene trees and species trees for future work.

Acknowledgments

Support was provided by National Institutes of Health grant R01 GM117590 and by a Rita Levi-Montalcini grant to FD from the Ministero dell’Istruzione, dell’Università e della Ricerca.

References

  1. Banderier C, Bousquet-Mélou M, Denise A, Flajolet P, Gardy D, and Gouyou-Beauchamps D (2002). Generating functions for generating trees. Discr. Math 246, 29–55. [Google Scholar]
  2. Barcucci E, Del Lungo A, Pergola E, and Pinzani R (1999). ECO: a methodology for the enumeration of combinatorial objects. J. Differ. Equ. Appl 5, 435–490. [Google Scholar]
  3. Colless DH (1982). Phylogenetics, the theory and practice of phylogenetic systematics. Syst. Zool 31, 100–104. [Google Scholar]
  4. Degnan JH and Rhodes JA (2015). There are no caterpillars in a wicked forest. Theor. Pop. Biol 105, 17–23. [DOI] [PubMed] [Google Scholar]
  5. Degnan JH and Rosenberg NA (2006). Discordance of species trees with their most likely gene trees. PLoS Genet 2, 762–768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Degnan JH, Rosenberg NA, and Stadler T (2012). The probability distribution of ranked gene trees on a species tree. Math. Biosci 235, 45–55. [DOI] [PubMed] [Google Scholar]
  7. Degnan JH and Salter LA (2005). Gene tree distributions under the coalescent process. Evolution 59, 24–37. [PubMed] [Google Scholar]
  8. Deutsch E (2000). Problem 10658. Am. Math. Monthly 107, 368–370. [Google Scholar]
  9. Disanto F and Rosenberg NA (2015). Coalescent histories for lodgepole species trees. J. Comput. Biol 22, 918–929. [DOI] [PubMed] [Google Scholar]
  10. Disanto F and Rosenberg NA (2016). Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Trans. Comput. Biol. Bioinf 13, 913–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Disanto F and Rosenberg NA (2017). Enumeration of ancestral configurations for matching gene trees and species trees. J. Comput. Biol 24, 831–850. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Disanto F and Rosenberg NA (2018). On the number of non-equivalent ancestral configurations for matching gene trees and species trees. Bull. Math. Biol in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Felsenstein J (1978). The number of evolutionary trees. Syst. Zool 27, 27–33. [Google Scholar]
  14. Flajolet P and Sedgewick R (2009). Analytic Combinatorics Cambridge: Cambridge University Press. [Google Scholar]
  15. Hammersley JM and Grimmett GR (1974). Maximal solutions of the generalized subadditive inequality In Harding EF and Kendall DG (Eds.), Stochastic Geometry, pp. 270–285. London: Wiley. [Google Scholar]
  16. Harding EF (1971). The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Prob 3, 44–77. [Google Scholar]
  17. Maddison WP (1997). Gene trees in species trees. Syst. Biol 46, 523–536. [Google Scholar]
  18. Rosenberg NA (2007). Counting coalescent histories. J. Comput. Biol 14, 360–377. [DOI] [PubMed] [Google Scholar]
  19. Rosenberg NA (2013). Coalescent histories for caterpillar-like families. IEEE/ACM Trans. Comput. Biol. Bioinf 10, 1253–1262. [DOI] [PubMed] [Google Scholar]
  20. Rosenberg NA and Degnan JH (2010). Coalescent histories for discordant gene trees and species trees. Theor. Pop. Biol 77, 145–151. [DOI] [PubMed] [Google Scholar]
  21. Rosenberg NA and Tao R (2008). Discordance of species trees with their most likely gene trees: the case of five taxa. Syst. Biol 57, 131–140. [DOI] [PubMed] [Google Scholar]
  22. Than C and Nakhleh L (2009). Species tree inference by minimizing deep coalescences. PLoS Comp. Biol 5, e1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Than C, Ruths D, Innan H, and Nakhleh L (2007). Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J. Comput. Biol 14, 517–535. [DOI] [PubMed] [Google Scholar]
  24. Wu Y (2012). Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66, 763–775. [DOI] [PubMed] [Google Scholar]
  25. Wu Y (2016). An algorithm for computing the gene tree probability under the multispecies coalescent and its application in the inference of population tree. Bioinformatics 32, i225–i233. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES