Skip to main content
Springer logoLink to Springer
. 2025 Nov 3;87(12):172. doi: 10.1007/s11538-025-01538-7

Tree Height and the Asymptotic Mean of the Colijn–Plazzotta Rank of Unlabeled Binary Rooted Trees

Luc Devroye 1, Michael R Doboli 2, Noah A Rosenberg 2,, Stephan Wagner 3,4
PMCID: PMC12583421  PMID: 41182472

Abstract

The Colijn–Plazzotta ranking is a bijective encoding of the unlabeled binary rooted trees with positive integers. We show that the rank f(t) of a tree t is closely related to its height h, the maximal path length from a leaf to the root. We consider the rank f(τn) of a random n-leaf tree τn under each of three models: (i) uniformly random unlabeled unordered binary rooted trees, or unlabeled topologies; (ii) uniformly random leaf-labeled binary trees, or labeled topologies under the uniform model; and (iii) random binary search trees, or labeled topologies under the Yule–Harding model. Relying on the close relationship between tree rank and tree height, we obtain results concerning the asymptotic properties of loglogf(τn). In particular, we find E{log2logf(τn)}2πn for uniformly random unlabeled ordered binary rooted trees and uniformly random leaf-labeled binary trees, and for a constant α4.31107, E{log2logf(τn)}αlogn for leaf-labeled binary trees under the Yule–Harding model. We show that the mean of f(τn) itself under the three models is largely determined by the rank cn-1 of the highest-ranked tree—the caterpillar—obtaining an asymptotic relationship with πncn-1, where πn is a model-specific function of n. The results resolve open problems, providing a new class of results on an encoding useful in mathematical phylogenetics.

Keywords: Colijn-Plazzotta rank, Mathematical phylogenetics, Tree height

Introduction

The Colijn–Plazzotta rank f(t) of a binary rooted tree t is defined recursively as follows (Colijn and Plazzotta 2018): if (t) and r(t) are the left and right subtree, respectively, arranged in such a way that f((t))f(r(t)), then

f(t)=f((t))(f((t))-1)2+1+f(r(t)).

The rank 1 is assigned to a tree with a single leaf.

In the study of evolutionary trees, statistical summaries of trees are often used for characterizing the outcomes of evolutionary models and for statistical inference of the processes that have given rise to the trees (Fischer et al. 2023). Colijn–Plazzotta rank, or CP rank, has been used as a summary of tree shape in empirical scenarios in which trees of biological relationships are unconcerned with leaf labels, such as in examples with trees of sequences from the same pathogenic organism (Colijn and Plazzotta 2018).

Informally, for a fixed number of leaves, the CP rank is lowest for balanced trees and greatest for unbalanced trees. It has therefore been proposed as a measure of tree balance (Fischer et al. 2023; Rosenberg 2021). In a compilation of mathematical results for tree balance indices that capture many different features of rooted trees, Fischer et al. (2023) have listed a set of basic properties that are of interest for any balance index. Among these are the minimal and maximal values of the index across all trees with a fixed number of leaves, and the mean and variance of the index under the two most frequently used probabilistic models in mathematical phylogenetics. One is the uniform model, also sometimes known as the proportional-to-distinguishable-arrangements or PDA model, which assigns equal probability to all binary rooted labeled trees with a fixed number of leaves. The other is the Yule–Harding model, also sometimes known as the equal-rates Markov or ERM model or simply as the Yule model, in which, conditional on the number of leaves, the probability of a binary rooted labeled tree is proportional to the number of sequences of bifurcations that can give rise to the tree. The mathematical properties of balance indices assist in characterizing the way that balance indices relate to one another and how they perform in empirical settings.

The trees of minimal and maximal CP rank for a fixed number of leaves have been characterized (Rosenberg 2021), and indeed the asymptotic CP ranks of these trees in terms of the number of leaves have also been obtained (Doboli et al. 2024; Rosenberg 2021). The mean and variance under the uniform and Yule–Harding models have been listed as open problems (Fischer et al. 2023, p. 243).

We show here that the asymptotic mean and variance under the Yule–Harding model can be obtained by a connection between this model in the phylogenetics setting and the nearly equivalent formulation of random binary search trees in computer science. First, we show that the order of magnitude of the CP rank of a tree is determined by the height of the tree, the greatest distance from the root to a leaf. By connecting the CP rank to tree height and in turn to probabilistic results for the height, we obtain distributional properties of the CP rank under the Yule–Harding model. We also obtain related results on the closely related uniform model on labeled binary rooted trees and the uniform model on unlabeled binary rooted trees.

Tree Height and the Colijn–Plazzotta Rank

We consider all trees to be binary and rooted. The height of a tree is the maximal path length in edges from the root to a leaf. Two special families of binary trees with n leaves play a key role in our analysis: the caterpillars, and the pseudocaterpillars (Figure 1). In a caterpillar with n leaves, n1, every non-leaf has at least one leaf child. This condition forces each caterpillar to consist of a chain of n-1 internal (i.e. non-leaf) nodes to which a layer of external nodes is added. The pseudocaterpillars (Rosenberg 2007) (or 4-pseudocaterpillars in the terminology of Alimpiev and Rosenberg (2021)) can be constructed as follows for n4: start with a chain of n-3 internal nodes. Give the bottom node in the chain two children, and finally, complete the tree by adding a layer of n external nodes. Caterpillars have height n-1, and pseudocaterpillars have height n-2.

Fig. 1.

Fig. 1

Caterpillar and pseudocaterpillar trees. (A) Caterpillar tree with n=8 leaves. The height of the tree is n-1=7. (B) Pseudocaterpillar tree with n=8 leaves. The height of the tree is n-2=6

Among binary rooted trees with a fixed number of leaves, Rosenberg (2021, Corollary 10) found that the tree with the largest CP rank was the caterpillar. The CP rank of the caterpillar tree with n leaves can be computed recursively via a sequence termed bn by Rosenberg (2021, Theorem 9). It is convenient to shift the index of the sequence by 1 so that here, we will use ck to correspond to the CP rank of the caterpillar with height k and k+1 leaves. The sequence ck begins 1, 2, 3, 5, 12, 68, 2280 starting at k=0, matching OEIS A108225 (OEIS Foundation Inc. 2025) for k1.

Lemma 1

Let the sequence ck be defined by c0=1 and ck+1=ck(ck-1)/2+2 for k0. For every tree t of height h, we have

chf(t)<ch+1.

Proof

The proof proceeds by induction on h. For h=0, the tree consists of a single leaf, and we have 1=c0=f(t)<c1=2. Thus, the statement holds in this case, and we can proceed with the induction step.

For a tree t of height h, suppose h<h and hr<h are the heights of subtrees (t) and r(t), respectively. From the induction hypothesis for trees of height less than h and the left–right arrangement so that f((t))f(r(t)), it follows that

chrf(r(t))f((t))<ch+1.

The sequence ck is increasing (Rosenberg 2021, Lemma 8), so that hr<h+1, and hence, hrh.

Because h=max(h,hr)+1, it follows that h=h-1. Thus, we have, again by the induction hypothesis,

f(t)=f((t))(f((t))-1)2+1+f(r(t))f((t))(f((t))-1)2+1+1ch-1(ch-1-1)2+2=ch,

which proves the lower bound. On the other hand,

f(t)=f((t))(f((t))-1)2+1+f(r(t))f((t))(f((t))-1)2+1+f((t))=f((t))(f((t))+1)2+1(ch-1)ch2+1=ch+1-1,

proving the upper bound. This completes the induction.

We conclude that the behavior of the height is to a great extent responsible for the behavior of the Colijn–Plazzotta rank of a tree. Indeed, because the CP rank is bijective with the positive integers (Rosenberg 2021, Proposition 2), the lemma implies that as the positive integers are traversed, for each h0, the ranking proceeds through trees with height h, then proceeds to those with height h+1, and so on. We immediately obtain the following corollaries (which are well known, see Harary et al. (1992)).

Corollary 2

For h0, the number of unlabeled binary rooted trees with height at most h is ch+1-1.

Corollary 3

For h0, the number of unlabeled binary rooted trees with height exactly h is ch+1-ch.

The sequence ch+1-1 begins at h=0 with values 1, 2, 4, 11, 67, 2279 (OEIS A006894). The sequence ch+1-ch begins at h=0 with values 1, 1, 2, 7, 56, 2212 (OEIS A002658).

According to Rosenberg (2021, Corollary 14), ck2γk for a constant γ1.11625 as k; note that γ=β2 for the constant β in Rosenberg (2021), owing to the shift by 1 in ck relative to the indexing in Rosenberg (2021). We immediately obtain the following result.

Corollary 4

Uniformly over all trees t with height h, we have,

2h+O(1)logγf(t)2h+1+O(1),

and thus, for h>0,

log2logγf(t)=log2logf(t)+O(1)=h+O(1).

In other words, the difference |log2logγf(t)-h| is bounded by a universal constant.

We now analyze the behavior of the CP rank of random trees, which is mainly determined by the height. Indeed, we proceed by making use of extensive probabilistic results available on tree height under different sets of assumptions.

Uniformly Random Unlabeled Binary Trees

Consider an unlabeled binary rooted tree on n leaves. Each node possesses either 0 offspring (leaves) or 2 offspring (internal nodes). Note that binary trees in which each node possesses either 0 or 2 (and not 1) offspring are sometimes termed full binary trees; here, all binary trees are “full” except where specified. A distinction exists between binary trees in which the left–right order of the children matters (ordered binary trees), and those in which the order is irrelevant (unordered binary trees, or unlabeled topologies in the terminology of mathematical phylogenetics, or Otter trees after Otter (1948)).

Let τn be a uniformly random ordered binary tree on n1 leaves, also called a random Catalan tree because the number of such trees is

kn-1=1n2n-2n-1,

where kn is the n-th Catalan number (Stanley 2015, Exercise 5). Catalan trees, viewed as ordered binary trees with n leaves, in which each node has 0 or 2 offspring, can be placed in bijection with trees with n-1 nodes in which the left–right order matters and each node has 0, 2, or 1 offspring. For the bijection, we consider the latter type of tree, treating its n-1 nodes as internal nodes, and add descendant leaves so that each node that started with 0 or 1 offspring now has 2 offspring. Catalan trees are an example of a simply generated family of trees, and the random Catalan tree is also a special case of a conditioned Galton–Watson tree, with an offspring distribution whose support is {0,2}. See, for example, Sedgewick and Flajolet (1996, p. 224) and Drmota (2009, Section 1.2.7). We denote the CP rank of a random Catalan tree by Cn (C for Catalan).

Let τn be a uniformly random unordered binary tree, a uniformly random Otter tree. The number of such trees can be calculated recursively. The exact value un (Wedderburn–Etherington number, OEIS A001190) for the number of such trees on n leaves follows

un=1,n=1,j=1(n-1)/2ujun-j,oddn3,(j=1n/2-1ujun-j)+un/2(un/2+1)2,evenn2. 1

The asymptotic approximation follows (Harding 1971; Otter 1948)

un(1+o(1))1κn3/2ρn, 2

where κ3.13699 and ρ0.40270. The CP rank of a random Otter tree is denoted by On (O for Otter).

To understand Theorem 5, we define a theta random variable as a random variable with distribution function (Devroye 1997)

F(x)=4π5/2x3j=1j2e-π2j2/x2=j=-(1-2j2x2)e-j2x2,x>0. 3

CP rank is defined for unordered binary trees. To extend the CP rank to ordered binary trees, we compute the CP rank of the unordered binary tree associated with an ordered binary tree.

Theorem 5

  • (i)
    Let τn be a uniformly random unlabeled binary tree with n leaves, with CP rank Cn=f(τn). Then
    E{log2logCn}2πn,
    and
    log2logCn2n
    converges in distribution to a theta random variable as defined by (3).
  • (ii)
    Let τn be a uniformly random unlabeled unordered binary tree with n leaves, with CP rank On=f(τn). Then, with κ as in (2),
    E{log2logOn}κn,
    and
    log2logOnκn/π
    converges in distribution to a theta random variable as defined by (3).

Proof

  • (i)
    The statement on τn is a consequence of a result of Flajolet and Odlyzko (1982, Theorem B) about the height Hn of τn: E{Hn}/n2π as n, and Hn/(2n) tends in distribution to a theta random variable. By Corollary 4, the difference |log2logCn-Hn| is (deterministically, thus almost surely) bounded by a universal constant, so that
    log2logCn-Hn2n
    is O(n-1/2); for any sequence of random trees of increasing size, this quantity goes to 0 (almost sure convergence, and hence, convergence in probability). The statement on the expected value now follows from the linearity of expectation and the statement on convergence in distribution follows from Slutsky’s theorem applied to the convergence in distribution of Hn/(2n) and the convergence in probability to 0 of (log2logCn-Hn)/(2n).
  • (ii)

    The statement on τn follows in the same fashion from the results of Broutin and Flajolet (2008, Theorem 1 and Theorem 5; 2012, Theorem 1 and Theorem 3) on the height of unlabeled unordered binary trees. These state that the height Hn of a random unlabeled unordered binary tree with n leaves satisfies E{Hn}/nκ, and that Hn/(κn/π) tends in distribution to a theta random variable. We remark here that our notation differs slightly from Broutin and Flajolet (2012): our constant κ3.13699 corresponds to the constant denoted 2π/λ in Broutin and Flajolet (2012), and our distribution function F(x) in (3) is 1-Θ(2x) in the notation of Broutin and Flajolet.

Uniformly Random Leaf-Labeled Binary Trees

A leaf-labeled binary tree with n leaves is a binary tree in which the leaves are bijectively labeled from 1 to n, and in which each internal node has two children. The children are unordered. Such trees are also called labeled topologies or cladograms.

We consider a uniformly random cladogram τn. The number of such trees is

(2n-3)·(2n-5)3·1=12n-1(2n-2)!(n-1)!, 4

all of which are equally likely under this model of randomness (OEIS A001147). The CP rank of a random cladogram is denoted by Ln (L for labeled).

A model of uniformly random cladograms is a special case of more general models on the cladograms, such as Ford’s alpha-splitting model (Ford 2005, 2006) and Aldous’s beta-splitting model (Aldous 1996, 2001). In particular, Aldous (1996, Proposition 4, β=-32 case) showed that the expected height of a random cladogram satisfies

EHn2πn.

It is worth pointing out that this result (including the constant 2π) is the same as for uniformly random unlabeled ordered binary trees (compare to Theorem 5i). This is no coincidence: for every unlabeled ordered binary tree on n leaves, there are n! possibilities to label the leaves and turn it into a leaf-labeled ordered binary tree. Likewise, precisely 2n-1 possibilities turn a labeled unordered binary tree on n leaves into a labeled ordered binary tree (by switching the order of the children at the internal nodes). For this reason, the distribution of the height and any other parameters that do not depend on labels or order is the same for three uniform models: unlabeled ordered, labeled unordered, and labeled ordered binary trees (Disanto et al. 2022, Section 3.1). In particular, the following result is equivalent to part (i) of Theorem 5.

Theorem 6

Let τn be a uniformly random leaf-labeled binary tree with n leaves, with CP rank Ln=f(τn). Then

E{log2logLn}2πn,

and

log2logLn2n

converges to a theta distribution.

Aldous’s beta-splitting model for random binary trees has a shape parameter β[-2,], encompassing a limiting unbalanced model (β=-2), a limiting balanced model (β=), the Yule model (β=0), and the uniform model in Theorem 6 (β=-32). Generally, Aldous (1996, Proposition 4) proved the following results on the height Hn:

  • For β>-1, the ratio Hn/logn tends in probability and in expectation to a constant g(β). There is no explicit expression for this constant, but numerical values can be determined from an implicit equation given by Aldous (1996, Proposition 4). To mention some examples, g()=1/log21.44270, and we obtain g(1)3.19258, g(0)4.31107, and g(-12)6.38090 from the implicit equation (note that Aldous (1996) only gives two digits each). The case β=0 corresponds to the Yule model (see Section 5 below for more information). For β=, all internal nodes split their subtrees (almost) precisely in half: the difference of the subtree sizes is at most 1.

  • For β=-1, E{Hn}(6/π2+o(1))(logn)2. Aldous’s proposition did not report a result for E{Hn} with β=-1, but this inequality follows quickly from Aldous’s results reported in the proposition for related quantities. Recently, Aldous and Pittel (2025, Theorem 1.5) showed that Hn(γ+ϵ)(logn)2 with probability approaching 1 with increasing n, where ϵ>0 and γ42.9.

  • For β(-2,-1), n1+βE{Hn}g(β), and n1+βHn has a non-degenerate limit distribution.

These results on tree height for cladograms under the beta-splitting model directly impact the Colijn–Plazzotta rank. For example, for β(-2,-1), we have for Aldous’s beta-splitting tree τn with n leaves

E{log2logf(τn)}g(β)n1+β.

Yule–Harding Trees, Random Binary Search Trees

Among the probability distributions that could be placed on the leaf-labeled binary trees with n leaves, perhaps the most frequently considered, along with the uniform distribution of Section 4, is the β=0 case of the beta-splitting model. This model corresponds to the random binary search trees, which are identical to Yule or Yule–Harding trees in phylogenetics (Fuchs 2025), except for the convention that random binary search trees are typically indexed by the number of internal nodes and Yule–Harding trees are indexed by the number of leaves. We index trees by the number of leaves, considering random binary search trees in which all internal nodes have two children so that the total number of internal nodes is n-1 when the number of leaves is n.

To be precise, we start with a standard random binary search tree on n-1 (internal) nodes and attach a layer of n external nodes, i.e., we give a second child to all (internal) nodes having one child, and give two children to all leaves. The random CP rank of a tree under this model is denoted by Sn (S for search tree).

For these trees, the height Hn satisfies (Devroye 1986, Theorem 5.1)

Hnlognpα,

where α4.31107 is the unique solution in (2,) of the equation

αlog(2e/α)=1.

Setting

β=3α2α-21.95303,

Reed (2003) and Drmota (2003) showed that Hn-αlogn+βloglogn is tight, i.e.,

lim supx[supnP|Hn-αlogn+βloglogn|x]=0. 5

One way to see this result is as follows: Reed (2003, Theorem 1) states that

E{Hn-αlogn+βloglogn}=O(1)

and

V{Hn-αlogn+βloglogn}=V{Hn}=O(1),

from which tightness follows by a standard application of the Chebyshev inequality. Alternatively, one can use Lemmas 8 and 10 of Reed (2003), which provide explicit tail bounds.

Theorem 7

Let τn be a random leaf-labeled binary tree with n leaves following the Yule–Harding distribution, with CP rank Sn=f(τn). Then

E{log2logSn}αlogn,

and

log2logSnlognpα.

Proof

The proof is similar to Theorem 5. By Corollary 4, the difference between log2logSn and the height Hn is bounded, so

log2logSn-Hnlogn

goes to 0 (almost surely, thus also in probability). The second part of the result follows immediately via Slutsky’s theorem from the fact that Hn/lognpα; the first part follows from the fact that E{Hn/logn}α as n (Devroye 1986).

Theorem 8

Let τn be a random leaf-labeled binary tree with n leaves following the Yule–Harding distribution, with CP rank Sn=f(τn). Then

(logn)βlog2logSnnαlog2

is a tight sequence of random variables.

Proof

By Corollary 4, there exists an absolute positive constant K such that K·2HnlogSn. Thus,

(logn)βlog2logSnnαlog2x

implies

2Hnxnαlog2K(logn)βlog2,

or

Hn-αlogn+βloglognlog(x/K)log2.

This means that

P{|(logn)βlog2logSnnαlog2|x}=P{(logn)βlog2logSnnαlog2x}P{Hn-αlogn+βloglognlog(x/K)log2}P{|Hn-αlogn+βloglogn|log(x/K)log2}.

By (5), this expression goes to 0 if we take supn and then lim supx, showing that the sequence is indeed tight.

Mean and Variance of the Colijn–Plazzotta Rank

Sections 35 focus on properties of the distribution of loglogf(τn) under various models of randomness; in this section, we focus on the distribution of the random CP rank f(τn) itself. In particular, we study the first-order asymptotics of the mean and variance of the Colijn–Plazzotta rank under the models of randomness from Sections 35, investigating Cn, On, Ln, and Sn. As pointed out in Section 4, the models of uniformly random unlabeled ordered binary trees (Catalan trees) and uniformly random labeled unordered binary trees are equivalent for our purposes, so that the distributions of Cn and Ln are the same.

We give a general theorem on the mean and variance of the Colijn–Plazzotta rank applicable to all random tree models specifying a certain condition. We then obtain first-order asymptotics for the means and variances of Cn, On, Ln and Sn as simple corollaries. The desired means and variances are determined mainly by the extreme cases for Colijn–Plazzotta ranks.

Lemma 9

(i) Among all unlabeled binary rooted trees with n leaves, n1, the Colijn–Plazzotta rank is maximized by the caterpillar. (ii) Among all unlabeled binary rooted trees with n leaves and height n-2 or less, n4, the Colijn–Plazzotta rank is maximized by the pseudocaterpillar.

Proof

  • (i)

    This result was proven in Corollary 20 of Rosenberg (2021).

  • (ii)

    This result follows by induction and Lemma 1. For n=4, the pseudocaterpillar is the only tree with height at most n-2=2. Suppose for induction that for all k, 4kn-1, the pseudocaterpillar has the maximal Colijn–Plazzotta rank among trees with k leaves and height k-2. Among trees t with n leaves and height at most n-2, by definition of the Colijn–Plazzotta rank, the rank f(t) is maximized by choosing its left subtree (t) to have f((t)) as large as possible. The left subtree (t) has at most n-1 leaves and height at most n-3, so that the inductive hypothesis applies: (t) is the pseudocaterpillar with n-1 leaves, the right subtree r(t) is a single leaf, and t is the pseudocaterpillar with n leaves.

For the following theorem, we recall Rosenberg’s (2021) sequence for the maximal Colijn–Plazzotta rank ch among trees with height h0 and h+1 leaves: c0=1, and

ch+1=ch2+2,h0. 6

Equivalently, ch is the Colijn–Plazzotta rank of a caterpillar of height h. Recall that c2=3, c3=5, c4=12, and c5=68.

We also let dh be the corresponding rank of a pseudocaterpillar of height h. Then d2=4, and

dh+1=dh2+2,h2. 7

The sequences ch and dh obey identical recursions, only with different starting points. Sequence dh begins with d2=4, d3=8, d4=30, and d5=437.

Theorem 10

For a given probability model for random binary rooted trees Tn with n leaves, let

πn=PTnis a caterpillar,

and let Pn be the Colijn–Plazzotta rank of Tn. If πn=o(1) and

log(1/πn)=o(2n), 8

then

EPnπncn-1,VPnEPn2πncn-12.

The idea of the result is that under the conditions specified, the CP rank of the n-leaf pseudocaterpillar—the tree of second-largest CP rank among those with n leaves—grows sufficiently slowly that the CP ranks of this tree and all other non-caterpillar trees are negligible in relation to that of the n-leaf caterpillar. The mean and variance of the CP rank of a random tree then depend only on the probability that a tree is a caterpillar and the CP rank of the caterpillar.

Proof

We distinguish two events. If Tn is a caterpillar of height n-1, then Pn=cn-1. Otherwise, if Tn is some other tree, then its CP rank Pn has upper bound dn-2, the CP rank of a pseudocaterpillar of height n-2. These values yield the trivial bounds

πncn-1EPnπncn-1+dn-2, 9
πncn-12EPn2πncn-12+dn-22. 10

By taking the ratio of (9) with πncn-1, to verify EPnπncn-1, it suffices to show

limndn-2πncn-1=0. 11

Similarly, because dn-2<cn-1 so that (dn-2/cn-1)2<dn-2/cn-1, by taking the ratio of (10) and πncn-12, verifying condition (11) suffices for verifying VPnEPn2πncn-12; we see first that EPn2πncn-12, and then VPn=EPn2-EPn2EPn2 follows by recalling that πn=o(1).

We will show that (8) implies (11). We first prove by induction that dh-1<0.9h-3ch for all h3. This statement is readily verified for h=3 and h=4. Now assume that the inequality holds for some positive integer h4, and write Qh=0.9-(2h-3)>1, so that ch>Qhdh-1. It follows from the recursions (6) and (7) that

dhch+1=dh-12-dh-1+4ch2-ch+4<dh-12-dh-1+4Qh2dh-12-Qhdh-1+4=1Qh2-(Qh-1)(Qhdh-1-4Qh-4)Qh2(Qh2dh-12-Qhdh-1+4).

The final fraction is positive since Qh>1 and dh-1d38. Thus,

dhch+1<1Qh2=0.9h-2,

completing the induction.

It follows (for n4) that

logdn-2πncn-1log(1πn0.9n-4)=2n-4log0.9-logπn=2n-4log0.9+o(2n)

by the assumption (8). Because this last expression goes to - as n increases without bound, we have verified (11). This completes the proof.

The theorem finds that the asymptotic mean is simply the product of the CP rank of the caterpillar and the probability that a tree is a caterpillar. In all four types of random trees that we consider, we verify that πn satisfies (8), so that the theorem applies. This verification amounts to demonstrating that caterpillars are sufficiently probable as n grows large; if πn were to decrease too quickly, then the condition would not be satisfied.

The number of caterpillar cladograms is n!/2, so that for a random cladogram (and equivalently, for a random Catalan tree), (4) gives

πn=n!21(2n-3)!!=2n-21n2n-2n-12n-2π-1/2n-3/24n-1n3/2π2n. 12

For a random Otter tree on n leaves, we have no simple explicit expression for πn. However, we have the asymptotic probability from (2) that a random Otter tree is the unique caterpillar:

πn=1unκn3/2ρn. 13

Finally, for a random binary search tree (Slowinski 1990, p. 92),

πn=n!21n!(n-1)!2n-1=2n-2(n-1)!(2en)nn42π. 14

Verifying in (12), (13), and (14) that condition (8) is satisfied, we have shown the following theorem.

Theorem 11

With πn as in (12), (13), and (14), and with Pn corresponding to either Cn (the random Catalan tree), On (the random Otter tree), Ln (the random cladogram), or Sn (the random binary search tree), we have

EPnπncn-1,VPnEPn2πncn-12.

Numerical Computations

We informally examine the extent to which the asymptotic approximations for E{log2logf(τn)}, E{f(τn)}, and V{f(τn)} agree with the exact values for small n. First, Tables 1 and 2 show the CP rank and the probabilities of all unlabeled unordered binary trees for n=1 to 8 under each of three models: uniformly random unlabeled unordered trees, uniformly random leaf-labeled trees, and Yule–Harding leaf-labeled trees. The much larger CP rank for the caterpillar compared to the pseudocaterpillar (and all other trees) is already visible for n=8.

Table 1.

CP rank f(tn) and probability under three models for all unlabeled unordered binary trees tn with n leaves, 1n7. For unlabeled uniform unordered trees, the probability is the reciprocal of the number of such trees, the Wedderburn–Etherington number defined by (1). For leaf-labeled uniform trees, it is the ratio of n!/2s(tn) (the number of ways of labeling shape tn, where the number of symmetric nodes s(tn) is the number of internal nodes whose two descendant subtrees have the same unlabeled shape) and (2n-3)!!, the number of leaf-labeled trees with n leaves (4). For leaf-labeled Yule–Harding trees, it is the ratio of [n!/2s(tn)][(n-1)!/r=2n(r-1)dr(tn)] and n!(n-1)!/2n-1, where dr(tn) is the number of internal nodes of tn with r descendant leaves, (n-1)!/r=2n(r-1)dr(tn) gives the number of labeled histories of a leaf-labeled tree (the number of sequences in which the tree can be produced by a sequence of bifurcations), and n!(n-1)!/2n-1 is the total number of labeled histories for n labeled leaves

Model
n tn f(tn) Height Unlabeled uniform unordered Leaf-labeled uniform Leaf-labeled Yule–Harding
1 graphic file with name 11538_2025_1538_Figa_HTML.gif 1 0 1 1 1
2 graphic file with name 11538_2025_1538_Figb_HTML.gif 2 1 1 1 1
3 graphic file with name 11538_2025_1538_Figc_HTML.gif 3 2 1 1 1
4 graphic file with name 11538_2025_1538_Figd_HTML.gif 5 3 1/2 4/5 2/3
4 graphic file with name 11538_2025_1538_Fige_HTML.gif 4 2 1/2 1/5 1/3
5 graphic file with name 11538_2025_1538_Figf_HTML.gif 12 4 1/3 4/7 1/3
5 graphic file with name 11538_2025_1538_Figg_HTML.gif 8 3 1/3 1/7 1/6
5 graphic file with name 11538_2025_1538_Figh_HTML.gif 6 3 1/3 2/7 1/2
6 graphic file with name 11538_2025_1538_Figi_HTML.gif 68 5 1/6 8/21 2/15
6 graphic file with name 11538_2025_1538_Figj_HTML.gif 30 4 1/6 2/21 1/15
6 graphic file with name 11538_2025_1538_Figk_HTML.gif 17 4 1/6 4/21 1/5
6 graphic file with name 11538_2025_1538_Figl_HTML.gif 13 4 1/6 4/21 4/15
6 graphic file with name 11538_2025_1538_Figm_HTML.gif 9 3 1/6 1/21 2/15
6 graphic file with name 11538_2025_1538_Fign_HTML.gif 7 3 1/6 2/21 1/5
7 graphic file with name 11538_2025_1538_Figo_HTML.gif 2280 6 1/11 8/33 2/45
7 graphic file with name 11538_2025_1538_Figp_HTML.gif 437 5 1/11 2/33 1/45
7 graphic file with name 11538_2025_1538_Figq_HTML.gif 138 5 1/11 4/33 1/15
7 graphic file with name 11538_2025_1538_Figr_HTML.gif 80 5 1/11 4/33 4/45
7 graphic file with name 11538_2025_1538_Figs_HTML.gif 38 4 1/11 1/33 2/45
7 graphic file with name 11538_2025_1538_Figt_HTML.gif 23 4 1/11 2/33 1/15
7 graphic file with name 11538_2025_1538_Figu_HTML.gif 69 5 1/11 4/33 1/9
7 graphic file with name 11538_2025_1538_Figv_HTML.gif 31 4 1/11 1/33 1/18
7 graphic file with name 11538_2025_1538_Figw_HTML.gif 18 4 1/11 2/33 1/6
7 graphic file with name 11538_2025_1538_Figx_HTML.gif 14 4 1/11 4/33 2/9
7 graphic file with name 11538_2025_1538_Figy_HTML.gif 10 3 1/11 1/33 1/9

Table 2.

CP rank f(tn) and probability under three models for all unlabeled unordered binary trees tn with n leaves, n=8. The table design follows Table 1

Model
n tn f(tn) Height Unlabeled uniform unordered Leaf-labeled uniform Leaf-labeled Yule–Harding
8 graphic file with name 11538_2025_1538_Figz_HTML.gif 2598062 7 1/23 64/429 4/315
8 graphic file with name 11538_2025_1538_Figaa_HTML.gif 95268 6 1/23 16/429 2/315
8 graphic file with name 11538_2025_1538_Figab_HTML.gif 9455 6 1/23 32/429 2/105
8 graphic file with name 11538_2025_1538_Figac_HTML.gif 3162 6 1/23 32/429 8/315
8 graphic file with name 11538_2025_1538_Figad_HTML.gif 705 5 1/23 8/429 4/315
8 graphic file with name 11538_2025_1538_Figae_HTML.gif 255 5 1/23 16/429 2/105
8 graphic file with name 11538_2025_1538_Figaf_HTML.gif 2348 6 1/23 32/429 2/63
8 graphic file with name 11538_2025_1538_Figag_HTML.gif 467 5 1/23 8/429 1/63
8 graphic file with name 11538_2025_1538_Figah_HTML.gif 155 5 1/23 16/429 1/21
8 graphic file with name 11538_2025_1538_Figai_HTML.gif 93 5 1/23 32/429 4/63
8 graphic file with name 11538_2025_1538_Figaj_HTML.gif 47 4 1/23 8/429 2/63
8 graphic file with name 11538_2025_1538_Figak_HTML.gif 2281 6 1/23 32/429 4/105
8 graphic file with name 11538_2025_1538_Figal_HTML.gif 438 5 1/23 8/429 2/105
8 graphic file with name 11538_2025_1538_Figam_HTML.gif 139 5 1/23 16/429 2/35
8 graphic file with name 11538_2025_1538_Figan_HTML.gif 81 5 1/23 16/429 8/105
8 graphic file with name 11538_2025_1538_Figao_HTML.gif 39 4 1/23 4/429 4/105
8 graphic file with name 11538_2025_1538_Figap_HTML.gif 24 4 1/23 8/429 2/35
8 graphic file with name 11538_2025_1538_Figaq_HTML.gif 70 5 1/23 32/429 2/21
8 graphic file with name 11538_2025_1538_Figar_HTML.gif 32 4 1/23 8/429 1/21
8 graphic file with name 11538_2025_1538_Figas_HTML.gif 19 4 1/23 16/429 1/7
8 graphic file with name 11538_2025_1538_Figat_HTML.gif 16 4 1/23 16/429 4/63
8 graphic file with name 11538_2025_1538_Figau_HTML.gif 15 4 1/23 8/429 4/63
8 graphic file with name 11538_2025_1538_Figav_HTML.gif 11 3 1/23 1/429 1/63

Figure 2 plots the values of E{log2logf(τn)}, the mean height Hn, and the asymptotic approximation for E{log2logf(τn)} under the three models. For each of the three models, we can observe similar shapes in plots for its three quantities. The values are greatest for the uniformly random leaf-labeled trees, with asymptotic approximation 2πn3.54491n, followed by the uniformly random unlabeled unordered trees, with asymptotic approximation 3.13699n, and finally, the Yule–Harding leaf-labeled trees, with asymptotic approximation 4.31107logn.

Fig. 2.

Fig. 2

Expected value of the double logarithm of CP rank, E{log2logf(τn)}, under three models, for n=2 to 20: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of E{log2logf(τn)} (open symbols) appear alongside exact values of the expected tree height E{Hn} (open symbols superimposed with crosses) under the three models and the asymptotic expressions (closed symbols, dashed lines): κn for unlabeled uniform unordered (Theorem 5ii), 2πn for leaf-labeled uniform (Theorem 6), and αlogn for leaf-labeled Yule–Harding (Theorem 7). κ3.13699, α4.31107 (color figure online)

Figures 3 and 4 plot the exact mean and variance of f(τn) under the three models alongside the asymptotic approximation based on the contribution of the caterpillar tree, taking the log2log of these quantities to produce a comparable scale to Figure 2. In the figure, we observe that even for quite small n, the exact mean and variance are closely approximated by the asymptotic πncn-1. The mean and variance are greatest for the uniformly random leaf-labeled trees, for which πnπ(n3/2)(0.5n) (12), followed by the uniformly random unlabeled unordered trees, with asymptotic approximation πnκn3/2ρn3.13699(n3/2)(0.40270n) (13). For the Yule–Harding model, caterpillars are least probable (14).

Fig. 3.

Fig. 3

Expected value of the CP rank, E{f(τn)}, under three models, for n=2 to 10: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of log2logE{f(τn)} (open symbols) appear alongside asymptotic expressions log2log(πncn-1) from Theorem 10 (closed symbols, dashed lines), where πn follows (12) for leaf-labeled uniform and (14) for leaf-labeled Yule–Harding and cn-1 is the CP rank of the caterpillar with n-1 internal nodes and n leaves (6). For unlabeled uniform unordered, πn is computed as the exact 1/un, where un is the Wedderburn–Etherington number defined by (1) (color figure online)

Fig. 4.

Fig. 4

Variance of the CP rank, V{f(τn)}, under three models, for n=2 to 10: uniformly random unlabeled unordered binary trees, uniformly random leaf-labeled binary trees, and Yule–Harding leaf-labeled binary trees. Exact values of log2logV{f(τn)} (open symbols) appear alongside asymptotic expressions log2log(πncn-12) from Theorem 10 (closed symbols, dashed lines), where πn follows (12) for leaf-labeled uniform and (14) for leaf-labeled Yule–Harding and cn-1 is the CP rank of the caterpillar with n-1 internal nodes and n leaves (6). For unlabeled uniform unordered, πn is computed as the exact 1/un, where un is the Wedderburn–Etherington number defined by (1) (color figure online)

Discussion

We have analyzed the Colijn–Plazzotta rank of rooted binary trees, showing that the rank of a tree is largely determined by its height. Indeed, the ranking proceeds through all trees of a given height h before moving on to trees of height h+1. We have also obtained asymptotic properties of the trees under three different models for selecting random trees, finding in particular the asymptotics of E{log2logf(τn)} for random trees τn. The asymptotic mean and variance of the CP rank across trees with n leaves depend only on the probability and CP rank of the n-leaf caterpillar, as the product of the probability and the CP rank of the caterpillar grows faster than the next-highest rank. A summary of mathematical results appears in Table 3.

Table 3.

Summary of the main asymptotic results under three models. τn refers to a random tree with n leaves under the model, and f(τn) is the associated random CP rank. Properties of random trees are the same for uniformly random leaf-labeled unordered trees and for uniformly random unlabeled ordered trees

Model
Property Unlabeled Leaf-labeled Leaf-labeled
uniform unordered uniform Yule–Harding
E{loglogf(τn)} Theorem 5 Theorem 6 Theorem 7
logf(τn) - - Theorem 8
E{f(τn)} Theorem 10 Theorem 10 Theorem 10
V{f(τn)} Theorem 10 Theorem 10 Theorem 10

Numerical investigations clarify a pattern observable in the mathematical results, namely that the “uniform” model—uniformly random leaf-labeled trees—has CP ranks greater than the Yule–Harding model on leaf-labeled trees (Figures 24). This observation can be viewed as a consequence of the greater probability of the caterpillar shape in the uniform (12) than in the Yule–Harding model (14).

It has been suggested that CP rank can serve as a measure of tree balance and imbalance in empirical studies (Fischer et al. 2023; Rosenberg 2021). We have found that as n grows, the CP rank of the caterpillar grows so fast that for both the uniform and Yule–Harding models on leaf-labeled trees, the mean CP rank across trees with n leaves is asymptotically determined by the contribution of the caterpillar. Hence, as a balance statistic beyond the smallest tree sizes, the use of CP rank f(τ) would amount primarily to distinguishing caterpillars from non-caterpillars. A potentially more suitable statistic is log2logf(τ), which places the CP ranks of different trees on a similar scale. Due to its extremely large values, the CP rank has been omitted from an empirical comparison of tree balance statistics (Kersting et al. 2025); we suggest that this problem could be resolved by including its double-logarithm in its place.

The results have been obtained by connecting studies of CP rank as a quantity of mathematical phylogenetics to the extensive literature on tree height in studies grounded in theoretical computer science. As has been demonstrated here, such applications of theoretical computer science results on tree properties have the potential to provide solutions to unsolved problems in mathematical phylogenetics.

Although we have obtained the asymptotics of the mean and variance of the CP rank under the uniform and Yule–Harding models—the two models for which the mean and variance were noted by Fischer et al. (2023) as open problems—we have not commented on the exact mean and variance. For practical applications of CP rank, an understanding of the asymptotics likely suffices, but we note that the precise determination of the mean and variance of the CP rank remains an open problem.

Acknowledgements

This project developed from conversations at the Analysis of Algorithms meeting in Bath, United Kingdom (AofA2024), and we are grateful to the conference organizers.

Funding

We acknowledge the Natural Sciences and Engineering Research Council of Canada (LD),

National Institutes of Health grant NIH R01-HG005855 (NAR),

National Science Foundation grant NSF DMS-2450005 (NAR),

and Swedish Research Council/Vetenskapsrådet grant 2022-04030 (SW).

Data Availability

The study has no associated data.

Declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Aldous D (1996) Probability distributions on cladograms. In Random Discrete Structures (Minneapolis, MN, 1993), volume 76 of IMA Vol. Math. Appl., pages 1–18. Springer, New York
  2. Aldous D, Pittel B (2025) The critical beta-splitting random tree I: heights and related results. Ann Appl Probab 35:158–195
  3. Aldous DJ (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat Sci 16:23–34 [Google Scholar]
  4. Alimpiev E, Rosenberg NA (2021) Enumeration of coalescent histories for caterpillar species trees and p-pseudocaterpillar gene trees. Adv Appl Math 131:102265 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Broutin N, Flajolet P (2008) The height of random binary unlabelled trees. Fifth Colloquium on Mathematics and Computer Science. Volume AI of Discrete Mathematics and Theoretical Computer Science Proceedings. Nancy, France, pp 121–134
  6. Broutin N, Flajolet P (2012) The distribution of height and diameter in random non-plane binary trees. Random Struct Algorithms 41:215–252
  7. Colijn C, Plazzotta G (2018) A metric on phylogenetic tree shapes. Syst Biol 67:113–126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Devroye L (1986) A note on the height of binary search trees. J Assoc Comput Machinery 33:489–498
  9. Devroye L (1997) Simulating theta random variates. Statist Probab Lett 31:275–279 [Google Scholar]
  10. Disanto F, Fuchs M, Paningbatan AR, Rosenberg NA (2022) The distributions under two species-tree models of the number of root ancestral configurations for matching gene trees and species trees. Ann Appl Probab 32:4426–4458 [Google Scholar]
  11. Doboli MR, Hwang H-K, Rosenberg NA (2024) Periodic behavior of the minimal Colijn-Plazzotta rank for trees with a fixed number of leaves. In 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, volume 302 of LIPIcs. Leibniz Int. Proc. Inform. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. Art. No. 18, 14 pages
  12. Drmota M (2003) An analytic approach to the height of binary search trees II. J Assoc Comput Machinery 30:333–374
  13. Drmota M (2009) Random Trees: an Interplay between Combinatorics and Probability. Springer, Wien
  14. Fischer M, Herbst L, Kersting S, Kühn AL, Wicke K (2023) Tree Balance Indices: A Comprehensive Survey. Springer, Cham, Switzerland [Google Scholar]
  15. Flajolet P, Odlyzko A (1982) The average height of binary trees and other simple trees. J Comput Syst Sci 25:171–213 [Google Scholar]
  16. Ford DJ (2005) Probabilities on cladograms: introduction to the alpha model. Arxiv:math/0511246v1
  17. Ford DJ (2006) Probabilities on cladograms: introduction to the alpha model. PhD thesis, Department of Mathematics, Stanford University,
  18. Fuchs M (2025) Shape parameters of evolutionary trees in theoretical computer science. Philos Trans R Soc B, Biol Sci 380:20230304
  19. Harary F, Palmer EM, Robinson RW (1992) Counting free binary trees admitting a given height. J Combin Inform System Sci 17:175–181
  20. Harding EF (1971) The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Probab 3:44–77 [Google Scholar]
  21. Kersting S, Wicke K, Fischer M (2025) Tree balance in phylogenetic models. Philos Trans R Soc B, Biol Sci 380:20230303 [Google Scholar]
  22. OEIS Foundation Inc. (2025). The On-Line Encyclopedia of Integer Sequences. Published electronically at https://oeis.org
  23. Otter R (1948) The number of trees. Ann Math 49:583–599
  24. Reed B (2003) The height of a random binary search tree. J Assoc Comput Machinery 30:306–332
  25. Rosenberg NA (2007) Counting coalescent histories. J Comput Biol 14:360–377 [DOI] [PubMed] [Google Scholar]
  26. Rosenberg NA (2021) On the Colijn-Plazzotta numbering scheme for unlabeled binary rooted trees. Discr Appl Math 291:88–98
  27. Sedgewick R, Flajolet P (1996) An Introduction to the Analysis of Algorithms. Addison-Wesley, Boston [Google Scholar]
  28. Slowinski JB (1990) Probability of -trees under two models: a demonstration that asymmetrical interior nodes are not improbable. Syst Biol 39:89–94 [Google Scholar]
  29. Stanley RP (2015) Catalan Numbers. Cambridge University Press, New York [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The study has no associated data.


Articles from Bulletin of Mathematical Biology are provided here courtesy of Springer

RESOURCES