Abstract
The Colijn–Plazzotta ranking is a bijective encoding of the unlabeled binary rooted trees with positive integers. We show that the rank f(t) of a tree t is closely related to its height h, the maximal path length from a leaf to the root. We consider the rank of a random n-leaf tree under each of three models: (i) uniformly random unlabeled unordered binary rooted trees, or unlabeled topologies; (ii) uniformly random leaf-labeled binary trees, or labeled topologies under the uniform model; and (iii) random binary search trees, or labeled topologies under the Yule–Harding model. Relying on the close relationship between tree rank and tree height, we obtain results concerning the asymptotic properties of . In particular, we find for uniformly random unlabeled ordered binary rooted trees and uniformly random leaf-labeled binary trees, and for a constant , for leaf-labeled binary trees under the Yule–Harding model. We show that the mean of itself under the three models is largely determined by the rank of the highest-ranked tree—the caterpillar—obtaining an asymptotic relationship with , where is a model-specific function of n. The results resolve open problems, providing a new class of results on an encoding useful in mathematical phylogenetics.
Keywords: Colijn-Plazzotta rank, Mathematical phylogenetics, Tree height
Introduction
The Colijn–Plazzotta rank f(t) of a binary rooted tree t is defined recursively as follows (Colijn and Plazzotta 2018): if and r(t) are the left and right subtree, respectively, arranged in such a way that , then
The rank 1 is assigned to a tree with a single leaf.
In the study of evolutionary trees, statistical summaries of trees are often used for characterizing the outcomes of evolutionary models and for statistical inference of the processes that have given rise to the trees (Fischer et al. 2023). Colijn–Plazzotta rank, or CP rank, has been used as a summary of tree shape in empirical scenarios in which trees of biological relationships are unconcerned with leaf labels, such as in examples with trees of sequences from the same pathogenic organism (Colijn and Plazzotta 2018).
Informally, for a fixed number of leaves, the CP rank is lowest for balanced trees and greatest for unbalanced trees. It has therefore been proposed as a measure of tree balance (Fischer et al. 2023; Rosenberg 2021). In a compilation of mathematical results for tree balance indices that capture many different features of rooted trees, Fischer et al. (2023) have listed a set of basic properties that are of interest for any balance index. Among these are the minimal and maximal values of the index across all trees with a fixed number of leaves, and the mean and variance of the index under the two most frequently used probabilistic models in mathematical phylogenetics. One is the uniform model, also sometimes known as the proportional-to-distinguishable-arrangements or PDA model, which assigns equal probability to all binary rooted labeled trees with a fixed number of leaves. The other is the Yule–Harding model, also sometimes known as the equal-rates Markov or ERM model or simply as the Yule model, in which, conditional on the number of leaves, the probability of a binary rooted labeled tree is proportional to the number of sequences of bifurcations that can give rise to the tree. The mathematical properties of balance indices assist in characterizing the way that balance indices relate to one another and how they perform in empirical settings.
The trees of minimal and maximal CP rank for a fixed number of leaves have been characterized (Rosenberg 2021), and indeed the asymptotic CP ranks of these trees in terms of the number of leaves have also been obtained (Doboli et al. 2024; Rosenberg 2021). The mean and variance under the uniform and Yule–Harding models have been listed as open problems (Fischer et al. 2023, p. 243).
We show here that the asymptotic mean and variance under the Yule–Harding model can be obtained by a connection between this model in the phylogenetics setting and the nearly equivalent formulation of random binary search trees in computer science. First, we show that the order of magnitude of the CP rank of a tree is determined by the height of the tree, the greatest distance from the root to a leaf. By connecting the CP rank to tree height and in turn to probabilistic results for the height, we obtain distributional properties of the CP rank under the Yule–Harding model. We also obtain related results on the closely related uniform model on labeled binary rooted trees and the uniform model on unlabeled binary rooted trees.
Tree Height and the Colijn–Plazzotta Rank
We consider all trees to be binary and rooted. The height of a tree is the maximal path length in edges from the root to a leaf. Two special families of binary trees with n leaves play a key role in our analysis: the caterpillars, and the pseudocaterpillars (Figure 1). In a caterpillar with n leaves, , every non-leaf has at least one leaf child. This condition forces each caterpillar to consist of a chain of internal (i.e. non-leaf) nodes to which a layer of external nodes is added. The pseudocaterpillars (Rosenberg 2007) (or 4-pseudocaterpillars in the terminology of Alimpiev and Rosenberg (2021)) can be constructed as follows for : start with a chain of internal nodes. Give the bottom node in the chain two children, and finally, complete the tree by adding a layer of n external nodes. Caterpillars have height , and pseudocaterpillars have height .
Fig. 1.

Caterpillar and pseudocaterpillar trees. (A) Caterpillar tree with leaves. The height of the tree is . (B) Pseudocaterpillar tree with leaves. The height of the tree is
Among binary rooted trees with a fixed number of leaves, Rosenberg (2021, Corollary 10) found that the tree with the largest CP rank was the caterpillar. The CP rank of the caterpillar tree with n leaves can be computed recursively via a sequence termed by Rosenberg (2021, Theorem 9). It is convenient to shift the index of the sequence by 1 so that here, we will use to correspond to the CP rank of the caterpillar with height k and leaves. The sequence begins 1, 2, 3, 5, 12, 68, 2280 starting at , matching OEIS A108225 (OEIS Foundation Inc. 2025) for .
Lemma 1
Let the sequence be defined by and for . For every tree t of height h, we have
Proof
The proof proceeds by induction on h. For , the tree consists of a single leaf, and we have . Thus, the statement holds in this case, and we can proceed with the induction step.
For a tree t of height h, suppose and are the heights of subtrees and r(t), respectively. From the induction hypothesis for trees of height less than h and the left–right arrangement so that , it follows that
The sequence is increasing (Rosenberg 2021, Lemma 8), so that , and hence, .
Because , it follows that . Thus, we have, again by the induction hypothesis,
which proves the lower bound. On the other hand,
proving the upper bound. This completes the induction.
We conclude that the behavior of the height is to a great extent responsible for the behavior of the Colijn–Plazzotta rank of a tree. Indeed, because the CP rank is bijective with the positive integers (Rosenberg 2021, Proposition 2), the lemma implies that as the positive integers are traversed, for each , the ranking proceeds through trees with height h, then proceeds to those with height , and so on. We immediately obtain the following corollaries (which are well known, see Harary et al. (1992)).
Corollary 2
For , the number of unlabeled binary rooted trees with height at most h is .
Corollary 3
For , the number of unlabeled binary rooted trees with height exactly h is .
The sequence begins at with values 1, 2, 4, 11, 67, 2279 (OEIS A006894). The sequence begins at with values 1, 1, 2, 7, 56, 2212 (OEIS A002658).
According to Rosenberg (2021, Corollary 14), for a constant as ; note that for the constant in Rosenberg (2021), owing to the shift by 1 in relative to the indexing in Rosenberg (2021). We immediately obtain the following result.
Corollary 4
Uniformly over all trees t with height h, we have,
and thus, for ,
In other words, the difference is bounded by a universal constant.
We now analyze the behavior of the CP rank of random trees, which is mainly determined by the height. Indeed, we proceed by making use of extensive probabilistic results available on tree height under different sets of assumptions.
Uniformly Random Unlabeled Binary Trees
Consider an unlabeled binary rooted tree on n leaves. Each node possesses either 0 offspring (leaves) or 2 offspring (internal nodes). Note that binary trees in which each node possesses either 0 or 2 (and not 1) offspring are sometimes termed full binary trees; here, all binary trees are “full” except where specified. A distinction exists between binary trees in which the left–right order of the children matters (ordered binary trees), and those in which the order is irrelevant (unordered binary trees, or unlabeled topologies in the terminology of mathematical phylogenetics, or Otter trees after Otter (1948)).
Let be a uniformly random ordered binary tree on leaves, also called a random Catalan tree because the number of such trees is
where is the n-th Catalan number (Stanley 2015, Exercise 5). Catalan trees, viewed as ordered binary trees with n leaves, in which each node has 0 or 2 offspring, can be placed in bijection with trees with nodes in which the left–right order matters and each node has 0, 2, or 1 offspring. For the bijection, we consider the latter type of tree, treating its nodes as internal nodes, and add descendant leaves so that each node that started with 0 or 1 offspring now has 2 offspring. Catalan trees are an example of a simply generated family of trees, and the random Catalan tree is also a special case of a conditioned Galton–Watson tree, with an offspring distribution whose support is . See, for example, Sedgewick and Flajolet (1996, p. 224) and Drmota (2009, Section 1.2.7). We denote the CP rank of a random Catalan tree by (C for Catalan).
Let be a uniformly random unordered binary tree, a uniformly random Otter tree. The number of such trees can be calculated recursively. The exact value (Wedderburn–Etherington number, OEIS A001190) for the number of such trees on n leaves follows
| 1 |
The asymptotic approximation follows (Harding 1971; Otter 1948)
| 2 |
where and . The CP rank of a random Otter tree is denoted by (O for Otter).
To understand Theorem 5, we define a theta random variable as a random variable with distribution function (Devroye 1997)
| 3 |
CP rank is defined for unordered binary trees. To extend the CP rank to ordered binary trees, we compute the CP rank of the unordered binary tree associated with an ordered binary tree.
Theorem 5
-
(i)Let be a uniformly random unlabeled binary tree with n leaves, with CP rank . Then
and
converges in distribution to a theta random variable as defined by (3). - (ii)
Proof
-
(i)The statement on is a consequence of a result of Flajolet and Odlyzko (1982, Theorem B) about the height of : as , and tends in distribution to a theta random variable. By Corollary 4, the difference is (deterministically, thus almost surely) bounded by a universal constant, so that
is ; for any sequence of random trees of increasing size, this quantity goes to 0 (almost sure convergence, and hence, convergence in probability). The statement on the expected value now follows from the linearity of expectation and the statement on convergence in distribution follows from Slutsky’s theorem applied to the convergence in distribution of and the convergence in probability to 0 of . -
(ii)
The statement on follows in the same fashion from the results of Broutin and Flajolet (2008, Theorem 1 and Theorem 5; 2012, Theorem 1 and Theorem 3) on the height of unlabeled unordered binary trees. These state that the height of a random unlabeled unordered binary tree with n leaves satisfies , and that tends in distribution to a theta random variable. We remark here that our notation differs slightly from Broutin and Flajolet (2012): our constant corresponds to the constant denoted in Broutin and Flajolet (2012), and our distribution function F(x) in (3) is in the notation of Broutin and Flajolet.
Uniformly Random Leaf-Labeled Binary Trees
A leaf-labeled binary tree with n leaves is a binary tree in which the leaves are bijectively labeled from 1 to n, and in which each internal node has two children. The children are unordered. Such trees are also called labeled topologies or cladograms.
We consider a uniformly random cladogram . The number of such trees is
| 4 |
all of which are equally likely under this model of randomness (OEIS A001147). The CP rank of a random cladogram is denoted by (L for labeled).
A model of uniformly random cladograms is a special case of more general models on the cladograms, such as Ford’s alpha-splitting model (Ford 2005, 2006) and Aldous’s beta-splitting model (Aldous 1996, 2001). In particular, Aldous (1996, Proposition 4, case) showed that the expected height of a random cladogram satisfies
It is worth pointing out that this result (including the constant ) is the same as for uniformly random unlabeled ordered binary trees (compare to Theorem 5i). This is no coincidence: for every unlabeled ordered binary tree on n leaves, there are n! possibilities to label the leaves and turn it into a leaf-labeled ordered binary tree. Likewise, precisely possibilities turn a labeled unordered binary tree on n leaves into a labeled ordered binary tree (by switching the order of the children at the internal nodes). For this reason, the distribution of the height and any other parameters that do not depend on labels or order is the same for three uniform models: unlabeled ordered, labeled unordered, and labeled ordered binary trees (Disanto et al. 2022, Section 3.1). In particular, the following result is equivalent to part (i) of Theorem 5.
Theorem 6
Let be a uniformly random leaf-labeled binary tree with n leaves, with CP rank . Then
and
converges to a theta distribution.
Aldous’s beta-splitting model for random binary trees has a shape parameter , encompassing a limiting unbalanced model (), a limiting balanced model (), the Yule model (), and the uniform model in Theorem 6 (). Generally, Aldous (1996, Proposition 4) proved the following results on the height :
For , the ratio tends in probability and in expectation to a constant . There is no explicit expression for this constant, but numerical values can be determined from an implicit equation given by Aldous (1996, Proposition 4). To mention some examples, , and we obtain , , and from the implicit equation (note that Aldous (1996) only gives two digits each). The case corresponds to the Yule model (see Section 5 below for more information). For , all internal nodes split their subtrees (almost) precisely in half: the difference of the subtree sizes is at most 1.
For , . Aldous’s proposition did not report a result for with , but this inequality follows quickly from Aldous’s results reported in the proposition for related quantities. Recently, Aldous and Pittel (2025, Theorem 1.5) showed that with probability approaching 1 with increasing n, where and .
For , , and has a non-degenerate limit distribution.
These results on tree height for cladograms under the beta-splitting model directly impact the Colijn–Plazzotta rank. For example, for , we have for Aldous’s beta-splitting tree with n leaves
Yule–Harding Trees, Random Binary Search Trees
Among the probability distributions that could be placed on the leaf-labeled binary trees with n leaves, perhaps the most frequently considered, along with the uniform distribution of Section 4, is the case of the beta-splitting model. This model corresponds to the random binary search trees, which are identical to Yule or Yule–Harding trees in phylogenetics (Fuchs 2025), except for the convention that random binary search trees are typically indexed by the number of internal nodes and Yule–Harding trees are indexed by the number of leaves. We index trees by the number of leaves, considering random binary search trees in which all internal nodes have two children so that the total number of internal nodes is when the number of leaves is n.
To be precise, we start with a standard random binary search tree on (internal) nodes and attach a layer of n external nodes, i.e., we give a second child to all (internal) nodes having one child, and give two children to all leaves. The random CP rank of a tree under this model is denoted by (S for search tree).
For these trees, the height satisfies (Devroye 1986, Theorem 5.1)
where is the unique solution in of the equation
Setting
Reed (2003) and Drmota (2003) showed that is tight, i.e.,
| 5 |
One way to see this result is as follows: Reed (2003, Theorem 1) states that
and
from which tightness follows by a standard application of the Chebyshev inequality. Alternatively, one can use Lemmas 8 and 10 of Reed (2003), which provide explicit tail bounds.
Theorem 7
Let be a random leaf-labeled binary tree with n leaves following the Yule–Harding distribution, with CP rank . Then
and
Proof
The proof is similar to Theorem 5. By Corollary 4, the difference between and the height is bounded, so
goes to 0 (almost surely, thus also in probability). The second part of the result follows immediately via Slutsky’s theorem from the fact that ; the first part follows from the fact that as (Devroye 1986).
Theorem 8
Let be a random leaf-labeled binary tree with n leaves following the Yule–Harding distribution, with CP rank . Then
is a tight sequence of random variables.
Proof
By Corollary 4, there exists an absolute positive constant K such that . Thus,
implies
or
This means that
By (5), this expression goes to 0 if we take and then , showing that the sequence is indeed tight.
Mean and Variance of the Colijn–Plazzotta Rank
Sections 3–5 focus on properties of the distribution of under various models of randomness; in this section, we focus on the distribution of the random CP rank itself. In particular, we study the first-order asymptotics of the mean and variance of the Colijn–Plazzotta rank under the models of randomness from Sections 3–5, investigating , , , and . As pointed out in Section 4, the models of uniformly random unlabeled ordered binary trees (Catalan trees) and uniformly random labeled unordered binary trees are equivalent for our purposes, so that the distributions of and are the same.
We give a general theorem on the mean and variance of the Colijn–Plazzotta rank applicable to all random tree models specifying a certain condition. We then obtain first-order asymptotics for the means and variances of , , and as simple corollaries. The desired means and variances are determined mainly by the extreme cases for Colijn–Plazzotta ranks.
Lemma 9
(i) Among all unlabeled binary rooted trees with n leaves, , the Colijn–Plazzotta rank is maximized by the caterpillar. (ii) Among all unlabeled binary rooted trees with n leaves and height or less, , the Colijn–Plazzotta rank is maximized by the pseudocaterpillar.
Proof
-
(i)
This result was proven in Corollary 20 of Rosenberg (2021).
-
(ii)
This result follows by induction and Lemma 1. For , the pseudocaterpillar is the only tree with height at most . Suppose for induction that for all k, , the pseudocaterpillar has the maximal Colijn–Plazzotta rank among trees with k leaves and height . Among trees t with n leaves and height at most , by definition of the Colijn–Plazzotta rank, the rank f(t) is maximized by choosing its left subtree to have as large as possible. The left subtree has at most leaves and height at most , so that the inductive hypothesis applies: is the pseudocaterpillar with leaves, the right subtree r(t) is a single leaf, and t is the pseudocaterpillar with n leaves.
For the following theorem, we recall Rosenberg’s (2021) sequence for the maximal Colijn–Plazzotta rank among trees with height and leaves: , and
| 6 |
Equivalently, is the Colijn–Plazzotta rank of a caterpillar of height h. Recall that , , , and .
We also let be the corresponding rank of a pseudocaterpillar of height h. Then , and
| 7 |
The sequences and obey identical recursions, only with different starting points. Sequence begins with , , , and .
Theorem 10
For a given probability model for random binary rooted trees with n leaves, let
and let be the Colijn–Plazzotta rank of . If and
| 8 |
then
The idea of the result is that under the conditions specified, the CP rank of the n-leaf pseudocaterpillar—the tree of second-largest CP rank among those with n leaves—grows sufficiently slowly that the CP ranks of this tree and all other non-caterpillar trees are negligible in relation to that of the n-leaf caterpillar. The mean and variance of the CP rank of a random tree then depend only on the probability that a tree is a caterpillar and the CP rank of the caterpillar.
Proof
We distinguish two events. If is a caterpillar of height , then . Otherwise, if is some other tree, then its CP rank has upper bound , the CP rank of a pseudocaterpillar of height . These values yield the trivial bounds
| 9 |
| 10 |
By taking the ratio of (9) with , to verify , it suffices to show
| 11 |
Similarly, because so that , by taking the ratio of (10) and , verifying condition (11) suffices for verifying ; we see first that , and then follows by recalling that .
We will show that (8) implies (11). We first prove by induction that for all . This statement is readily verified for and . Now assume that the inequality holds for some positive integer , and write , so that . It follows from the recursions (6) and (7) that
The final fraction is positive since and . Thus,
completing the induction.
It follows (for ) that
by the assumption (8). Because this last expression goes to as n increases without bound, we have verified (11). This completes the proof.
The theorem finds that the asymptotic mean is simply the product of the CP rank of the caterpillar and the probability that a tree is a caterpillar. In all four types of random trees that we consider, we verify that satisfies (8), so that the theorem applies. This verification amounts to demonstrating that caterpillars are sufficiently probable as n grows large; if were to decrease too quickly, then the condition would not be satisfied.
The number of caterpillar cladograms is n!/2, so that for a random cladogram (and equivalently, for a random Catalan tree), (4) gives
| 12 |
For a random Otter tree on n leaves, we have no simple explicit expression for . However, we have the asymptotic probability from (2) that a random Otter tree is the unique caterpillar:
| 13 |
Finally, for a random binary search tree (Slowinski 1990, p. 92),
| 14 |
Verifying in (12), (13), and (14) that condition (8) is satisfied, we have shown the following theorem.
Theorem 11
With as in (12), (13), and (14), and with corresponding to either (the random Catalan tree), (the random Otter tree), (the random cladogram), or (the random binary search tree), we have
Numerical Computations
We informally examine the extent to which the asymptotic approximations for , , and agree with the exact values for small n. First, Tables 1 and 2 show the CP rank and the probabilities of all unlabeled unordered binary trees for to 8 under each of three models: uniformly random unlabeled unordered trees, uniformly random leaf-labeled trees, and Yule–Harding leaf-labeled trees. The much larger CP rank for the caterpillar compared to the pseudocaterpillar (and all other trees) is already visible for .
Table 1.
CP rank and probability under three models for all unlabeled unordered binary trees with n leaves, . For unlabeled uniform unordered trees, the probability is the reciprocal of the number of such trees, the Wedderburn–Etherington number defined by (1). For leaf-labeled uniform trees, it is the ratio of (the number of ways of labeling shape , where the number of symmetric nodes is the number of internal nodes whose two descendant subtrees have the same unlabeled shape) and , the number of leaf-labeled trees with n leaves (4). For leaf-labeled Yule–Harding trees, it is the ratio of and , where is the number of internal nodes of with r descendant leaves, gives the number of labeled histories of a leaf-labeled tree (the number of sequences in which the tree can be produced by a sequence of bifurcations), and is the total number of labeled histories for n labeled leaves
| Model | ||||||
|---|---|---|---|---|---|---|
| n | Height | Unlabeled uniform unordered | Leaf-labeled uniform | Leaf-labeled Yule–Harding | ||
| 1 | 1 | 0 | 1 | 1 | 1 | |
| 2 | 2 | 1 | 1 | 1 | 1 | |
| 3 | 3 | 2 | 1 | 1 | 1 | |
| 4 | 5 | 3 | 1/2 | 4/5 | 2/3 | |
| 4 | 4 | 2 | 1/2 | 1/5 | 1/3 | |
| 5 | 12 | 4 | 1/3 | 4/7 | 1/3 | |
| 5 | 8 | 3 | 1/3 | 1/7 | 1/6 | |
| 5 | 6 | 3 | 1/3 | 2/7 | 1/2 | |
| 6 | 68 | 5 | 1/6 | 8/21 | 2/15 | |
| 6 | 30 | 4 | 1/6 | 2/21 | 1/15 | |
| 6 | 17 | 4 | 1/6 | 4/21 | 1/5 | |
| 6 | 13 | 4 | 1/6 | 4/21 | 4/15 | |
| 6 | 9 | 3 | 1/6 | 1/21 | 2/15 | |
| 6 | 7 | 3 | 1/6 | 2/21 | 1/5 | |
| 7 | 2280 | 6 | 1/11 | 8/33 | 2/45 | |
| 7 | 437 | 5 | 1/11 | 2/33 | 1/45 | |
| 7 | 138 | 5 | 1/11 | 4/33 | 1/15 | |
| 7 | 80 | 5 | 1/11 | 4/33 | 4/45 | |
| 7 | 38 | 4 | 1/11 | 1/33 | 2/45 | |
| 7 | 23 | 4 | 1/11 | 2/33 | 1/15 | |
| 7 | 69 | 5 | 1/11 | 4/33 | 1/9 | |
| 7 | 31 | 4 | 1/11 | 1/33 | 1/18 | |
| 7 | 18 | 4 | 1/11 | 2/33 | 1/6 | |
| 7 | 14 | 4 | 1/11 | 4/33 | 2/9 | |
| 7 | 10 | 3 | 1/11 | 1/33 | 1/9 | |
Table 2.
CP rank and probability under three models for all unlabeled unordered binary trees with n leaves, . The table design follows Table 1
| Model | ||||||
|---|---|---|---|---|---|---|
| n | Height | Unlabeled uniform unordered | Leaf-labeled uniform | Leaf-labeled Yule–Harding | ||
| 8 | ![]() |
2598062 | 7 | 1/23 | 64/429 | 4/315 |
| 8 | ![]() |
95268 | 6 | 1/23 | 16/429 | 2/315 |
| 8 | ![]() |
9455 | 6 | 1/23 | 32/429 | 2/105 |
| 8 | ![]() |
3162 | 6 | 1/23 | 32/429 | 8/315 |
| 8 | ![]() |
705 | 5 | 1/23 | 8/429 | 4/315 |
| 8 | ![]() |
255 | 5 | 1/23 | 16/429 | 2/105 |
| 8 | ![]() |
2348 | 6 | 1/23 | 32/429 | 2/63 |
| 8 | ![]() |
467 | 5 | 1/23 | 8/429 | 1/63 |
| 8 | ![]() |
155 | 5 | 1/23 | 16/429 | 1/21 |
| 8 | ![]() |
93 | 5 | 1/23 | 32/429 | 4/63 |
| 8 | ![]() |
47 | 4 | 1/23 | 8/429 | 2/63 |
| 8 | ![]() |
2281 | 6 | 1/23 | 32/429 | 4/105 |
| 8 | ![]() |
438 | 5 | 1/23 | 8/429 | 2/105 |
| 8 | ![]() |
139 | 5 | 1/23 | 16/429 | 2/35 |
| 8 | ![]() |
81 | 5 | 1/23 | 16/429 | 8/105 |
| 8 | ![]() |
39 | 4 | 1/23 | 4/429 | 4/105 |
| 8 | ![]() |
24 | 4 | 1/23 | 8/429 | 2/35 |
| 8 | ![]() |
70 | 5 | 1/23 | 32/429 | 2/21 |
| 8 | ![]() |
32 | 4 | 1/23 | 8/429 | 1/21 |
| 8 | ![]() |
19 | 4 | 1/23 | 16/429 | 1/7 |
| 8 | ![]() |
16 | 4 | 1/23 | 16/429 | 4/63 |
| 8 | ![]() |
15 | 4 | 1/23 | 8/429 | 4/63 |
| 8 | ![]() |
11 | 3 | 1/23 | 1/429 | 1/63 |
Figure 2 plots the values of , the mean height , and the asymptotic approximation for under the three models. For each of the three models, we can observe similar shapes in plots for its three quantities. The values are greatest for the uniformly random leaf-labeled trees, with asymptotic approximation , followed by the uniformly random unlabeled unordered trees, with asymptotic approximation , and finally, the Yule–Harding leaf-labeled trees, with asymptotic approximation .
Fig. 2.
Expected value of the double logarithm of CP rank, , under three models, for to 20: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of (open symbols) appear alongside exact values of the expected tree height (open symbols superimposed with crosses) under the three models and the asymptotic expressions (closed symbols, dashed lines): for unlabeled uniform unordered (Theorem 5ii), for leaf-labeled uniform (Theorem 6), and for leaf-labeled Yule–Harding (Theorem 7). , (color figure online)
Figures 3 and 4 plot the exact mean and variance of under the three models alongside the asymptotic approximation based on the contribution of the caterpillar tree, taking the of these quantities to produce a comparable scale to Figure 2. In the figure, we observe that even for quite small n, the exact mean and variance are closely approximated by the asymptotic . The mean and variance are greatest for the uniformly random leaf-labeled trees, for which (12), followed by the uniformly random unlabeled unordered trees, with asymptotic approximation (13). For the Yule–Harding model, caterpillars are least probable (14).
Fig. 3.
Expected value of the CP rank, , under three models, for to 10: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of (open symbols) appear alongside asymptotic expressions from Theorem 10 (closed symbols, dashed lines), where follows (12) for leaf-labeled uniform and (14) for leaf-labeled Yule–Harding and is the CP rank of the caterpillar with internal nodes and n leaves (6). For unlabeled uniform unordered, is computed as the exact , where is the Wedderburn–Etherington number defined by (1) (color figure online)
Fig. 4.
Variance of the CP rank, , under three models, for to 10: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of (open symbols) appear alongside asymptotic expressions from Theorem 10 (closed symbols, dashed lines), where follows (12) for leaf-labeled uniform and (14) for leaf-labeled Yule–Harding and is the CP rank of the caterpillar with internal nodes and n leaves (6). For unlabeled uniform unordered, is computed as the exact , where is the Wedderburn–Etherington number defined by (1) (color figure online)
Discussion
We have analyzed the Colijn–Plazzotta rank of rooted binary trees, showing that the rank of a tree is largely determined by its height. Indeed, the ranking proceeds through all trees of a given height h before moving on to trees of height . We have also obtained asymptotic properties of the trees under three different models for selecting random trees, finding in particular the asymptotics of for random trees . The asymptotic mean and variance of the CP rank across trees with n leaves depend only on the probability and CP rank of the n-leaf caterpillar, as the product of the probability and the CP rank of the caterpillar grows faster than the next-highest rank. A summary of mathematical results appears in Table 3.
Table 3.
Summary of the main asymptotic results under three models. refers to a random tree with n leaves under the model, and is the associated random CP rank. Properties of random trees are the same for uniformly random leaf-labeled unordered trees and for uniformly random unlabeled ordered trees
Numerical investigations clarify a pattern observable in the mathematical results, namely that the “uniform” model—uniformly random leaf-labeled trees—has CP ranks greater than the Yule–Harding model on leaf-labeled trees (Figures 2–4). This observation can be viewed as a consequence of the greater probability of the caterpillar shape in the uniform (12) than in the Yule–Harding model (14).
It has been suggested that CP rank can serve as a measure of tree balance and imbalance in empirical studies (Fischer et al. 2023; Rosenberg 2021). We have found that as n grows, the CP rank of the caterpillar grows so fast that for both the uniform and Yule–Harding models on leaf-labeled trees, the mean CP rank across trees with n leaves is asymptotically determined by the contribution of the caterpillar. Hence, as a balance statistic beyond the smallest tree sizes, the use of CP rank would amount primarily to distinguishing caterpillars from non-caterpillars. A potentially more suitable statistic is , which places the CP ranks of different trees on a similar scale. Due to its extremely large values, the CP rank has been omitted from an empirical comparison of tree balance statistics (Kersting et al. 2025); we suggest that this problem could be resolved by including its double-logarithm in its place.
The results have been obtained by connecting studies of CP rank as a quantity of mathematical phylogenetics to the extensive literature on tree height in studies grounded in theoretical computer science. As has been demonstrated here, such applications of theoretical computer science results on tree properties have the potential to provide solutions to unsolved problems in mathematical phylogenetics.
Although we have obtained the asymptotics of the mean and variance of the CP rank under the uniform and Yule–Harding models—the two models for which the mean and variance were noted by Fischer et al. (2023) as open problems—we have not commented on the exact mean and variance. For practical applications of CP rank, an understanding of the asymptotics likely suffices, but we note that the precise determination of the mean and variance of the CP rank remains an open problem.
Acknowledgements
This project developed from conversations at the Analysis of Algorithms meeting in Bath, United Kingdom (AofA2024), and we are grateful to the conference organizers.
Funding
We acknowledge the Natural Sciences and Engineering Research Council of Canada (LD),
National Institutes of Health grant NIH R01-HG005855 (NAR),
National Science Foundation grant NSF DMS-2450005 (NAR),
and Swedish Research Council/Vetenskapsrådet grant 2022-04030 (SW).
Data Availability
The study has no associated data.
Declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Aldous D (1996) Probability distributions on cladograms. In Random Discrete Structures (Minneapolis, MN, 1993), volume 76 of IMA Vol. Math. Appl., pages 1–18. Springer, New York
- Aldous D, Pittel B (2025) The critical beta-splitting random tree I: heights and related results. Ann Appl Probab 35:158–195
- Aldous DJ (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat Sci 16:23–34 [Google Scholar]
- Alimpiev E, Rosenberg NA (2021) Enumeration of coalescent histories for caterpillar species trees and p-pseudocaterpillar gene trees. Adv Appl Math 131:102265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Broutin N, Flajolet P (2008) The height of random binary unlabelled trees. Fifth Colloquium on Mathematics and Computer Science. Volume AI of Discrete Mathematics and Theoretical Computer Science Proceedings. Nancy, France, pp 121–134
- Broutin N, Flajolet P (2012) The distribution of height and diameter in random non-plane binary trees. Random Struct Algorithms 41:215–252
- Colijn C, Plazzotta G (2018) A metric on phylogenetic tree shapes. Syst Biol 67:113–126 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devroye L (1986) A note on the height of binary search trees. J Assoc Comput Machinery 33:489–498
- Devroye L (1997) Simulating theta random variates. Statist Probab Lett 31:275–279 [Google Scholar]
- Disanto F, Fuchs M, Paningbatan AR, Rosenberg NA (2022) The distributions under two species-tree models of the number of root ancestral configurations for matching gene trees and species trees. Ann Appl Probab 32:4426–4458 [Google Scholar]
- Doboli MR, Hwang H-K, Rosenberg NA (2024) Periodic behavior of the minimal Colijn-Plazzotta rank for trees with a fixed number of leaves. In 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, volume 302 of LIPIcs. Leibniz Int. Proc. Inform. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. Art. No. 18, 14 pages
- Drmota M (2003) An analytic approach to the height of binary search trees II. J Assoc Comput Machinery 30:333–374
- Drmota M (2009) Random Trees: an Interplay between Combinatorics and Probability. Springer, Wien
- Fischer M, Herbst L, Kersting S, Kühn AL, Wicke K (2023) Tree Balance Indices: A Comprehensive Survey. Springer, Cham, Switzerland [Google Scholar]
- Flajolet P, Odlyzko A (1982) The average height of binary trees and other simple trees. J Comput Syst Sci 25:171–213 [Google Scholar]
- Ford DJ (2005) Probabilities on cladograms: introduction to the alpha model. Arxiv:math/0511246v1
- Ford DJ (2006) Probabilities on cladograms: introduction to the alpha model. PhD thesis, Department of Mathematics, Stanford University,
- Fuchs M (2025) Shape parameters of evolutionary trees in theoretical computer science. Philos Trans R Soc B, Biol Sci 380:20230304
- Harary F, Palmer EM, Robinson RW (1992) Counting free binary trees admitting a given height. J Combin Inform System Sci 17:175–181
- Harding EF (1971) The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Probab 3:44–77 [Google Scholar]
- Kersting S, Wicke K, Fischer M (2025) Tree balance in phylogenetic models. Philos Trans R Soc B, Biol Sci 380:20230303 [Google Scholar]
- OEIS Foundation Inc. (2025). The On-Line Encyclopedia of Integer Sequences. Published electronically at https://oeis.org
- Otter R (1948) The number of trees. Ann Math 49:583–599
- Reed B (2003) The height of a random binary search tree. J Assoc Comput Machinery 30:306–332
- Rosenberg NA (2007) Counting coalescent histories. J Comput Biol 14:360–377 [DOI] [PubMed] [Google Scholar]
- Rosenberg NA (2021) On the Colijn-Plazzotta numbering scheme for unlabeled binary rooted trees. Discr Appl Math 291:88–98
- Sedgewick R, Flajolet P (1996) An Introduction to the Analysis of Algorithms. Addison-Wesley, Boston [Google Scholar]
- Slowinski JB (1990) Probability of -trees under two models: a demonstration that asymmetrical interior nodes are not improbable. Syst Biol 39:89–94 [Google Scholar]
- Stanley RP (2015) Catalan Numbers. Cambridge University Press, New York [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The study has no associated data.


























