The GFB Tree and Tree Imbalance Indices

Sean Cleary; Mareike Fischer; Katherine St John

doi:10.1007/s11538-025-01522-1

. 2025 Sep 5;87(10):145. doi: 10.1007/s11538-025-01522-1

The GFB Tree and Tree Imbalance Indices

Sean Cleary ^1,^#, Mareike Fischer ^2,^✉,^#, Katherine St John ^3,^#

PMCID: PMC12413428 PMID: 40911217

Abstract

Tree balance plays an important role in various research areas in phylogenetics and computer science. Typically, it is measured with the help of a balance index or imbalance index. There are more than 25 such indices available, recently surveyed in a book by Fischer et al. They are used to rank rooted binary trees on a scale from the most balanced to the least balanced. We show that a wide range of subtree-size based measures satisfying concavity and monotonicity conditions are minimized by the complete or greedy from the bottom (GFB) tree and maximized by the caterpillar tree, yielding an infinitely large family of distinct new imbalance indices. Answering an open question from the literature, we show that one such established measure, the $\hat{s}$ -shape statistic, has the GFB tree as its unique minimizer. We also provide an alternative characterization of GFB trees, showing that they are equivalent to complete trees, which arise in different contexts. We give asymptotic bounds on the expected $\hat{s}$ -shape statistic under the uniform and Yule-Harding distributions of trees, and answer questions for the related Q-shape statistic as well.

Keywords: Tree balance, GFB tree, Phylogenetic tree, Yule model

Introduction

Trees are a canonical data structure, providing an efficient way to implement fundamental concepts such as dynamic sets as well as representing hierarchical and phylogenetic relationships between data (see Cormen et al. (2001) and Semple and Steel (2003)). Much of the power of the tree data structure relies on well-distributed branching that can yield tree height logarithmic in the total size of the tree, and result in efficient access, assuming a reasonable balance. The balance, or lack thereof, often affects the running time of algorithms, with many tree-based algorithms having significantly different times depending if they are very balanced (an element in a balanced, binary search tree on n leaves can be found in $O (log n)$ time) or very imbalanced (the same search has O(n) time for the pectinate or caterpillar tree) (Cormen et al. 2001). There are many different measures suggested to assess balance, this fundamental property of trees, surveyed in Fischer et al. (2023). While similar in format, these indices can yield quite different rankings of trees, as illustrated in Figure 1. This figure compares the rankings of three different indices considered in the present manuscript (the $\hat{s}$ -shape statistic, the Q-shape statistic and the Sackin index) as well as two other well-known indices, namely the popular Colless and total cophenetic indices for the case $n = 10$ .

Fig. 1 — The rankings of all 98 tree shapes of size 10 with respect to the Sackin index, the $\hat{s}$ -shape statistic, the Q-shape statistic, the Colless index, and the total cophenetic index (see Fischer et al. 2023 for a survey of indices). All indices rate the caterpillar tree shown in blue as extremely unbalanced, and the indices rank trees of intermediate balance in different orders. The sets of other minimal trees with respect to the other rankings contain the maximally balanced tree shown in green and/or the GFB tree shown in red (Color figure online)

When the number of leaves, n, is a power of 2, with $n = 2^{h}$ for some h, all imbalance indices are minimal for the fully balanced tree and maximal for the caterpillar tree. For trees which are neither the fully balanced tree nor caterpillars, the values of each of these indices lie in the interval containing these extremes for that imbalance index.

We show a general result that applies to a broad range of tree imbalance indices. If a tree imbalance index function using subtree sizes satisfies some concavity and increasing conditions, the minimum value is achieved by “greedy from the bottom” (GFB) trees, as named in Fischer et al. (2023). We will show that these trees coincide with trees termed “complete trees” (Fill 1996), which have occurred in other contexts. The $\hat{s}$ -shape statistic of Blum and François (2006) satisfies this property. The $\hat{s}$ -shape statistic sums the logarithms of the subtree sizes across the tree, so for a rooted binary tree T, the $\hat{s}$ -shape statistic is $\sum log (n_{v} - 1)$ , where $n_{v}$ is the number of leaves in the subtree rooted at the internal node v. Blum and François (2006) note that once a normalizing constant has been removed, the $\hat{s}$ -shape statistic corresponds to the logarithm of the probability of a tree in the uniform or equal rates model (ERM) for generating random trees (see Semple and Steel 2003, p. 29-30), and provides a strong tool rejecting the various tree models against the Yule-Harding or proportional to distinguished arrangements model (PDA) (see Kersting et al. 2025 and Section 2 of the present manuscript concerning probabilistic models of phylogenetic trees).

Our findings answer the open question of which trees, among all rooted binary trees of size n, minimize the $\hat{s}$ -shape statistic as well as the question if there is an explicit formula for the minimum value of $\hat{s}$ , both of which were posed in Fischer et al. (2023). We further analyze a related measure of tree balance, namely the Q-shape statistic, described by Fill (1996). Motivated by building binary search trees from random permutations, it can be recast as a close parallel to the $\hat{s}$ -shape statistic. We discuss how his work shows the maximal and minimal tree shapes for this statistic and the moments for it under the uniform distribution. We consider the distribution of the relevant statistic under the Yule-Harding distribution as well. We describe many indices built from concave functions of subtree sizes as imbalance indices, in that they have maximal values on caterpillar trees and minimal values on GFB trees, which necessarily include fully balanced trees when the size is a power of two. We show that there are infinitely many distinct such indices.

The present manuscript is organized as follows: In Section 2, we present all definitions and notations needed throughout the manuscript. In Section 3, we state some known results from the literature which we will use to derive our results. Section 4 then contains all our results, which are structured as follows: Section 4.1 gives an overview of the minimizing properties of the GFB tree. The results of this subsection are used in Section 4.2 to show that both the $\hat{s}$ -shape statistic and the Q-shape statistic belong to an infinitely large family of different imbalance indices. Section 4.3 then gives two explicit formulas for the minimum value both for $\hat{s}$ and Q, both of which are derived from the GFB tree. Finally, Section 4.4 derives some expected values for the $\hat{s}$ -shape statistic, which also answers open questions posed in Fischer et al. (2023). We conclude with a brief discussion in Section 5.

Definitions

We outline the terminology used, following the standard notions from Fischer et al. (2023); King and Rosenberg (2021); Steel (2016).

Graph theoretical trees and phylogenetic trees

A rooted binary tree, or simply a tree, is a directed graph $T = (V (T), E (T))$ with vertex set V(T) and edge set E(T), containing precisely one vertex of in-degree 0, the root (denoted by $ρ$ ), such that for every $v \in V (T)$ there exists a unique path from $ρ$ to v, and such that all vertices have out-degree 0 or 2. In particular, the edges are directed away from the root. Nodes with out-degree 2 are internal nodes and nodes with out-degree 0 are leaf nodes or leaves. We use $\overset{˚}{V} (T)$ (or simply $\overset{˚}{V}$ ) to denote the set of internal vertices of T and $V_{L} (T)$ to denote the set of leaves of T, respectively.

Tree balance is independent of any leaf labeling, but in phylogenetics, the leaf labeling plays an important role and is used when considering evolutionary models. A rooted binary phylogenetic X-tree $T$ (or simply phylogenetic tree) is simply a tuple $T = (T, ϕ)$ , where T is a rooted binary tree and $ϕ$ is a bijection from the set of leaves $V_{L} (T)$ to X. The (unlabeled) tree T is often referred to as the topology or tree shape of $T$ and X is called the taxon set of $T$ . We assume that the label sets are the numbers: $X = {1, \dots, n}$ .

We consider two trees T and $T^{'}$ as equal if they are isomorphic; that is, if there exists a mapping $θ : V (T) \to V (T^{'})$ such that for all $u, v \in V (T)$ we have $(u, v) \in E (T) \Leftrightarrow (θ (u), θ (v)) \in E (T^{'})$ and with $θ (ρ (T)) = ρ (T^{'})$ . In particular, $θ$ is a graph isomorphism which preserves the root position. We use $B T_{n}^{*}$ to denote the space (of isomorphism classes) of (rooted binary) trees with n leaves, which are unlabeled.

We use $B T_{n}$ to denote the space (of isomorphism classes) of rooted binary phylogenetic X-trees with $| X | = n$ where the leaves are labeled. Moreover, we recall that $| B T_{1} | = 1$ and $| B T_{n} | = (2 n - 3)!! = (2 n - 3) (2 n - 5) \dots 1$ for $n \geq 2$ (Semple and Steel 2003, Corollary 2.2.4).

Vertices and subtrees

We now define some properties of vertices and subtrees, which apply both for trees as well as phylogenetic trees. Throughout, for a given tree T, we denote the number of leaves by n, with $n = | V_{L} (T) |$ . In the special case of $n = 1$ , the tree consists of only one vertex, which is at the same time the root and its only leaf. The size of a tree T is defined as the number of leaf nodes and is sometimes also indicated as |T|.

Whenever there is a directed path $P$ from a vertex u to a vertex v in T, we call u an ancestor of v and v a descendant of u. The depth $δ_{T} (v)$ of a vertex v in T denotes the number of edges on the unique path $P$ from the root $ρ$ of T to v. If $P$ consists of a single edge, u is the parent of v and v is the child of u. If two leaves u and v have the same parent w, u and v form a sibling pair or cherry, which we denote by [u, v].

Every internal vertex v of T induces a pending subtree $T_{v}$ , that is, the subtree of T which has v as its root and contains all descendants of v. The number of leaves of this subtree will be referred to as the size of $T_{v}$ and will be denoted by $n_{v}$ . Note that $n_{ρ} = n$ . An internal vertex u with children v and w is called balanced if $| n_{v} - n_{w} | \leq 1$ . Finally, for a (phylogenetic) tree T, we denote its standard decomposition into its maximal pending subtrees (i.e., the subtrees rooted at the children of the root $ρ$ of T) $T_{a}$ and $T_{b}$ by $T = (T_{a}, T_{b})$ . Note that the height h(T) of a tree T is defined as $h (T) = {max}_{v \in V (T)} δ_{T} (v)$ ; that is, the height of a tree coincides with the maximum depth of its vertices.

Special trees

We describe some trees which play important roles in tree balance. The first tree is the caterpillar tree, denoted by $T_{n}^{cat}$ , which is a rooted binary tree with n leaves which either consists of only one vertex or it contains precisely one cherry (see Figure 2(a)).

Fig. 2 — Three tree shapes on 10 leaves: (a) a caterpillar with $\hat{s}$ -shape of $\sum_{i = 2}^{9} log i = log (9!) \sim 12.8$ , (b) a greedy from the bottom (GFB) tree with $\hat{s}$ -shape of $log (9 \cdot 5^{1} \cdot 3^{2}) \sim 6.0$ , and (c) a maximally-balanced tree with $\hat{s}$ -shape of $log (9 \cdot 4^{2} \cdot 2^{2}) \sim 6.4$ , with the middle GFB tree (b) being the unique minimizer for the $\hat{s}$ -shape statistic among all trees with 10 leaves

In considering a tree of size $2^{h}$ , the fully balanced tree of height h, $T_{h}^{fb}$ , is the tree where every node has two children and all leaves have depth exactly h. Note that for $n \geq 2$ both maximal pending subtrees of a fully balanced tree are again fully balanced trees, and we have $T_{h}^{fb} = (T_{h - 1}^{fb}, T_{h - 1}^{fb})$ . The maximally balanced (MB) tree with n leaves, denoted by $T_{n}^{mb}$ , is the unique rooted binary tree with n leaves in which all internal vertices are balanced. Recursively, a rooted binary tree with $n \geq 2$ leaves is maximally balanced if its root is balanced and its two maximal pending subtrees are maximally balanced, with Inline graphic (see Figure 2(c)).

The greedy from the bottom (GFB) tree for n leaves, denoted by $T_{n}^{gfb}$ , is a rooted binary tree with n leaves that results from greedily clustering trees of minimal sizes from an initial forest in the following manner. We start with a forest of n trees each consisting of a single vertex and proceed by successively joining two of the smallest remaining trees in the forest until the forest is resolved into a single tree, with the resulting tree of the shape depicted in Figure 2(b). Coronado et al. describe this construction in Coronado (2020, Algorithm 2).

The complete tree (as defined in Fill 1996) for n leaves, denoted by $T_{n}^{c}$ , is a rooted binary tree with n leaves that results from creating the fully balanced tree of size $2^{⌊ {log}_{2} (n) ⌋}$ , the largest fully balanced tree with n leaves or fewer, ordering the leaves from the left to the right and then attaching sibling pairs on the leaves from left to right until a total of n leaves are obtained. We will show below that the complete tree coincides with the GFB tree for all sizes, cf. Lemma 3. In order to simplify notation, throughout we use the shorthand $log$ instead of ${log}_{2}$ whenever we refer to a base-2 logarithm.

Note that if $n = 2^{h}$ , we have $T_{n}^{gfb} = T_{h}^{fb} = T_{n}^{mb}$ . The latter equality follows from the fact that in the case where the number of leaves is a power of 2, $T_{h}^{fb}$ is the unique tree all of whose internal vertices have a balance value of zero. The first equality holds, because if $n = 2^{h}$ , $n mod 2^{i} \equiv 0$ for all $i = 0, \dots, h - 1$ and during the greedy clustering procedure, every tree with $2^{i}$ leaves clusters with another tree of $2^{i}$ leaves for all $i = 0, \dots, h - 1$ . This process continues until the single remaining tree is the fully-balanced tree.

Tree shape statistics and (im)balance indices

We focus on tree imbalance indices. Following (Fischer et al. 2023), we call a function $t : B T_{n}^{*} \to R$ a tree shape statistic if t(T) depends only on the shape of T and not on the labeling of vertices or the lengths of edges. Such a tree shape statistic t is called an imbalance index if

the caterpillar tree $T_{n}^{cat}$ is the unique tree maximizing t on $B T_{n}^{*}$ for all $n \geq 1$ , and
the fully balanced tree $T_{h}^{fb}$ is the unique tree minimizing t on $B T_{n}^{*}$ for all $n = 2^{h}$ for natural h.

The choice of base for the logarithm only affects the following indices by a multiplicative factor, and so we presume base 2 for all evaluations below. We first focus on several well-known and frequently used tree imbalance indices that are defined in terms of leaf counts of subtrees:

Definition 1

Let $T \in B T_{n}^{*}$ .

Sackin index (cf. Fischer 2021; Fischer et al. 2023; Shao and Sokal 1990): $S (T) = \sum_{v \in \overset{˚}{V} (T)} n_{v} .$
$\hat{s}$ -shape statistic (Blum and François 2006; Fischer et al. 2023): $\hat{s} (T) : = \sum_{v \in \overset{˚}{V} (T)} log (n_{v} - 1) = log (\prod_{v \in \overset{˚}{V} (T)}, (n_{v} - 1)) .$
Q-shape statistic (related to Fill 1996): $Q (T) : = \sum_{v \in \overset{˚}{V} (T)} log (n_{v}) = log (\prod_{v \in \overset{˚}{V} (T)}, n_{v}) .$

Note that the logarithm base was originally not stated (Blum and François 2006), however, it is common to use base 2 in binary (phylogenetic) trees, and we follow this convention here just as the authors of Fischer et al. (2023). In fact, we only consider logarithms of base 2 here; this is not limited to Definition 1.

Moreover, while we are mainly concerned with two imbalance indices, the $\hat{s}$ -shape statistic and the Q-shape statistic, we also consider broader families containing both indices. The Q-shape statistic is related to measures introduced by Fill (1996) and differs from the $\hat{s}$ -shape statistic by a difference of one before taking the logarithms for each term in the product. Despite the similarity, the statistics yield different rankings (see Figures 1 and 6). Note that Fill (1996) uses Q(T) for the reciprocal of the quantity of which we are taking the logarithm, and defines $L_{n} = - ln (Q (T))$ for the negative of its natural logarithm. $L_{n} (T)$ differs from how we choose to define the Q-shape statistic here in the base of logarithm, but we proceed as above to make the Q-shape statistic more closely parallel to the $\hat{s}$ -shape statistic. Fill (1996) shows that the complete (GFB) tree minimizes the Q-shape statistic and the caterpillar tree maximizes it, and further computes the moments under the uniform distribution as well as the random permutation model. King and Rosenberg (2021) employ a parallel structure to Fill’s methods for similar results on the Sackin index.

Fig. 6 — Ranking differences for the imbalance indices from product functions $π_{c}$ for nine different choices of c with $c > - 2$ . The x-axis represents indices for values $c = - 1.99, - 1.5, - 1, - 0.5, 0, 0.5, 1, 1.5, 2$ , and the y-axis shows the 98 rooted binary trees with 10 leaves. Each tree has an associated line showing how high (imbalanced) or low (balanced) it is ranked for the different choices of c, with the rankings of three trees (caterpillar, GFB, and maximally balanced) highlighted in bold (Color figure online)

Probabilistic models of phylogenetic trees

We consider two popular models of evolution: the Yule-Harding and the uniform models. The Yule-Harding model is a pure birth process in which species are born but do not go extinct. It is a forward process generating a tree T as follows. The process starts with a single vertex and, at each step, chooses a leaf uniformly at random from those present and subsequently replaces that leaf by a cherry. As soon as the desired number n of leaves is reached, leaf labels $X = {1, \dots, n}$ are assigned uniformly at random to the leaves. The probability $P_{Y, n} (T)$ of generating a phylogenetic X-tree $T = (T, ϕ)$ under the Yule-Harding model is then given by Steel (2016, Proposition 3.2):

\begin{matrix} P_{Y, n} (T) & = \frac{2^{n - 1}}{n!} \cdot \prod_{v \in \overset{˚}{V} (T)} \frac{1}{n_{v} - 1} . \end{matrix}

The uniform model selects a phylogenetic X-tree uniformly at random from the set of all possible phylogenetic trees (Rosen 1978). As $| B T_{n} | = (2 n - 3)!!$ for every $n \geq 1$ (with the convention that $(- 1)!! = 1$ ; see, for instance, Semple and Steel (2003, Corollary 2.2.4)), the probability $P_{U, n} (T)$ of generating a phylogenetic X-tree T under the uniform model is thus given by

\begin{matrix} P_{U, n} (T) & = \frac{1}{(2 n - 3)!!} . \end{matrix}

Prior results

In order to investigate the GFB tree and its relevance for the $\hat{s}$ -shape statistic more deeply, we use the following earlier results.

Lemma 1

(Lemma 5 in Coronado et al. 2020) If $T = (T_{a}, T_{b})$ is a GFB tree, then $T_{a}$ and $T_{b}$ are also GFB trees.

Note that recursively applying Lemma 1 to the maximum pending subtrees of T and again their maximum pending subtrees and so forth, we easily derive the following corollary.

Corollary 1

If T is a GFB tree and v is a vertex of T inducing the pending subtree $T_{v}$ , then $T_{v}$ is also a GFB tree.

The following proposition, which has been adapted from Proposition 5 in Coronado et al. (2020), characterizes the sizes of the maximal pending subtrees of GFB trees.

Proposition 1

(adapted from Proposition 5 in Coronado et al. (2020)) For $n \geq 2$ , we let $T_{n}^{gfb} = (T_{a}, T_{b})$ , where $n_{a}$ and $n_{b}$ denote the sizes of $T_{a}$ and $T_{b}$ , respectively. Let $ℓ_{n} = ⌊ log (n) ⌋$ . Then, we have:

If $2^{ℓ_{n}} \leq n \leq 3 \cdot 2^{ℓ_{n} - 1}$ , then $n_{a} = n - 2^{ℓ_{n} - 1}$ , $n_{b} = 2^{ℓ_{n} - 1}$ and $T_{b} = T_{ℓ_{n} - 1}^{fb}$ .
If $3 \cdot 2^{ℓ_{n} - 1} \leq n < 2^{ℓ_{n} + 1} - 1$ , then $n_{a} = 2^{ℓ_{n}}$ , $n_{b} = n - 2^{ℓ_{n}}$ and $T_{a} = T_{ℓ_{n}}^{fb}$ .

We mainly work with Inline graphic rather than $ℓ_{n} = ⌊ log (n) ⌋$ to simplify some proofs later.

Corollary 2

For $n \geq 2$ , we let $T_{n}^{gfb} = (T_{a}, T_{b})$ , where $n_{a}$ and $n_{b}$ denote the sizes of $T_{a}$ and $T_{b}$ , respectively. Let Inline graphic . Then, we have:

If $2^{k_{n} - 1} < n \leq 3 \cdot 2^{k_{n} - 2}$ , then $n_{a} = n - 2^{k_{n} - 2}$ , $n_{b} = 2^{k_{n} - 2}$ and $T_{b} = T_{k_{n} - 2}^{fb}$ .
If $3 \cdot 2^{k_{n} - 2} \leq n \leq 2^{k_{n}}$ , then $n_{a} = 2^{k_{n} - 1}$ , $n_{b} = n - 2^{k_{n} - 1}$ and $T_{a} = T_{k_{n} - 1}^{fb}$ .

Proof

For all cases in which n is not a power of 2, we have $k_{n} = ℓ_{n} + 1$ . For these cases, substituting $ℓ_{n}$ in Proposition 1 by $k_{n} - 1$ directly leads to the required claims.

So it only remains to consider the case in which n is a power of 2. Note that in this case, Inline graphic , so we have $n = 2^{ℓ_{n}} = 2^{k_{n}}$ . Proposition 1 covers this case in Case 1, which says that $n_{a} = n - 2^{ℓ_{n} - 1}$ and $n_{b} = 2^{ℓ_{n} - 1}$ . Using $n_{a} + n_{b} = n$ and $k_{n} = ℓ_{n}$ and $n = 2^{k_{n}}$ in this case, we get: $n_{a} = n - 2^{ℓ_{n} - 1} = 2^{k_{n}} - 2^{k_{n} - 1} = 2^{k_{n} - 1}$ and $n_{b} = 2^{ℓ_{n} - 1} = 2^{k_{n} - 1}$ .

Corollary 2 covers this case in Case 2, which implies that $n_{a} = 2^{k_{n} - 1}$ and $n_{b} = 2^{ℓ_{n} - 1} = 2^{k_{n} - 1}$ . $□$

Results

We show that the GFB tree plays a fundamental role for the $\hat{s}$ -shape statistic, even more so than it does for other balance indices like the Sackin index, for which it is known to be contained in the set of minimal trees (cf. Fischer 2021; Fischer et al. 2023). Note that the GFB tree plays a similar role for other imbalance indices like the well-known Colless index, for which it is also known to be minimal ((Coronado et al. 2020; Fischer et al. 2023)). Figure 1 depicts the role of the GFB tree for various indices. Notably, there are indices like the so-called total cophenetic index, which do not assume their minimum values at the GFB tree, but at the MB tree instead (Mir et al. 2013), cf. Figure 1.

We now start to investigate the GFB tree further.

Minimizing properties of the GFB tree

The aim of this section is to show that $T_{n}^{gfb}$ is the unique minimizer of all functions of the form $Φ_{f}$ , which we define for any rooted binary tree T as follows:

\begin{matrix} Φ_{f} (T) & = \sum_{v \in \overset{˚}{V} (T)} f (n_{v}), \end{matrix}

where f is any strictly monotonically increasing and strictly concave function. Moreover, we will show that this implies that the GFB tree is the unique minimizer of the product function, which we define as follows:

\begin{matrix} π_{c} (T) & = \prod_{v \in \overset{˚}{V} (T)} (n_{v} + c), \end{matrix}

where T is a rooted binary tree and $c \in R_{> - 2}$ is a constant. Note that the choice of $c > - 2$ guarantees that $n_{v} + c > 0$ for all $v \in \overset{˚}{V} (T)$ (as the smallest pending subtree size is $n_{v} = 2$ , when v is the parent of a cherry). The fact that all factors of the product function are strictly larger than 0 leads to meaningful properties of the product including the existence of the logarithm of the product function.

In Section 4.2, we will subsequently show that the above minimizing properties have significant implications on tree balance, as they lead to two new families of tree (im)balance indices, one of which can be shown to be a subfamily of the other one. Moreover, our results will lead to answers to open questions concerning existing imbalance indices.

We start with the following theorem, parts of which are based on the ideas of Fill (2021, Theorem 4).

Theorem 2

Let $n \in N$ with $n \geq 2$ and let $f : R_{\geq 2} \to R$ a strictly monotonically increasing and strictly concave function. That is, we have $f (n_{1}) > f (n_{2})$ if and only if $n_{1} > n_{2}$ , and we also have $f (λ x + (1 - λ) y) > λ f (x) + (1 - λ) f (y)$ for all $λ \in (0, 1)$ and all $x, y \in R_{\geq 2}$ with $x \neq y$ . We consider $Φ_{f} (T) = \sum_{v \in \overset{˚}{V} (T)} f (n_{v})$ . Then, $T_{n}^{gfb}$ is the unique tree in $B T_{n}^{*}$ minimizing $Φ_{f}$ .

Proof

Towards a contradiction, we assume n is the smallest number where the minimizing tree is not the GFB tree. We let T be a rooted binary tree with n leaves minimizing $Φ_{f}$ , such that T is not a GFB tree. Since there is only one tree when $n = 1$ , we can assume $n > 1$ .

By Lemma 1, the subtrees of a GFB tree are also GFB trees. T is by assumption not a GFB tree, so there will be at least two pending subtrees of T that would form a GFB tree but do not have a common parent in T. This is due to the fact that T, just like every rooted binary tree, can be obtained from clustering, starting with a forest of one-leaf trees and clustering two trees at a time until a single tree is obtained. We do this to build T by using the two smallest available trees at any point in time (just as in the GFB construction), until no such further clustering is possible. So there must be two trees which are present both in T and $T_{n}^{gfb}$ , that have a common parent in $T_{n}^{gfb}$ but do not in T. Let $T_{a}$ and $T_{b}$ be the smallest such subtrees of T.

We note that as all previous trees of $T_{n}^{gfb}$ were also formed when building T, only two situations are possible:

One of the two trees $T_{a}$ , $T_{b}$ , say $T_{a}$ , is contained in T as a sibling tree to some tree $T_{c}$ , and the other one, say $T_{b}$ , is a sibling to a subtree of T containing the first tree, say $T_{a}$ . This situation is depicted in Figure 3 at the left-hand side.
The tree $T_{a}$ is a sibling to some tree $T_{c}$ not containing $T_{b}$ , and $T_{b}$ is a sibling to some tree $T_{d}$ not containing $T_{a}$ . This situation is depicted in Figure 3 at the right-hand side.

We now consider these two cases separately.

Fig. 3 — Top: Case 1 of the proof of Theorem 2. Here, $T_{a}$ is a sibling of $T_{c}$ in T, with size strictly larger than that of $T_{b}$ , and $T_{b}$ is a sibling of a subtree of T containing $T_{a}$ . The highlighted path $P$ contains all vertices whose induced subtree sizes change when subtrees $T_{b}$ and $T_{c}$ are swapped. The dotted edges and subtrees may or may not exist in T. Bottom: Case 2 of the proof of Theorem 2. Here, $T_{a}$ is a sibling of $T_{c}$ in T, with size strictly larger than that of $T_{b}$ , but not containing $T_{b}$ . Similarly, $T_{b}$ is a sibling of $T_{d}$ in T, with size strictly larger than that of $T_{a}$ , but not containing $T_{a}$ . The highlighted paths $P_{1}$ and $P_{2}$ contain all vertices whose induced subtree sizes change when $T_{b}$ and $T_{c}$ or $T_{a}$ and $T_{d}$ are swapped. The dotted edges and subtrees may or may not exist in T

We start with Case 1 as depicted in Figure 3. We construct a new tree $T^{'}$ as follows: $T^{'}$ is like T, but the subtrees $T_{b}$ and $T_{c}$ are interchanged. We let $P$ be the path highlighted in Figure 3, where $P$ is the path containing all vertices v for which we have $n_{v} (T) \neq n_{v} (T^{'})$ , where $n_{v} (T)$ and $n_{v} (T^{'})$ denote the induced subtree sizes of v in T and $T^{'}$ , respectively. We note that for all vertices $v \in \overset{˚}{V} (T) \ P$ we have $n_{v} (T) = n_{v} (T^{'})$ , as these vertices’ subtree sizes are not affected by the subtree swap of $T_{b}$ and $T_{c}$ . This reasoning leads to the following observation:
$\begin{matrix} Φ_{f} (T^{'}) & = Φ_{f} (T) - \sum_{v \in P} (f (n_{v} (T)) - f (n_{v} (T) - n_{c} + n_{b})), \end{matrix}$ 3
where $n_{b}$ and $n_{c}$ are the subtree sizes of $T_{b}$ and $T_{c}$ , respectively. Now, we know that $n_{c} > n_{b}$ , because if $n_{c} < n_{b}$ , the GFB algorithm would not have merged $T_{a}$ and $T_{b}$ (but $T_{a}$ and $T_{c}$ instead). Moreover, if we had $n_{b} = n_{c}$ , then $T_{b}$ and $T_{c}$ would be isomorphic, as all steps prior to merging $T_{a}$ and $T_{b}$ worked in T in the same way as in $T_{n}^{gfb}$ (by choice of $(T_{a}, T_{b})$ as the minimal subtree of $T_{n}^{gfb}$ which could not be formed to build T). In this case, however, $(T_{a}, T_{b})$ would be isomorphic to $(T_{a}, T_{c})$ and thus be contained in T, a contradiction to our choice of $T_{a}$ and $T_{b}$ . Thus, $n_{c} > n_{b}$ and therefore $n_{v} (T) - n_{c} + n_{b} < n_{v} (T)$ for all $v \in P$ . As f is strictly increasing, this implies that $f (n_{v} (T)) > f (n_{v} (T) - n_{c} + n_{b})$ . With Equation (3), this directly leads to:
$\begin{matrix} Φ_{f} (T^{'}) & = Φ_{f} (T) - \sum_{v \in P} \underset{> 0}{\underset{⏟}{(f (n_{v} (T)) - f (n_{v} (T) - n_{c} + n_{b}))}} < Φ_{f} (T) . \end{matrix}$ 4
Thus, we have $Φ_{f} (T^{'}) < Φ_{f} (T)$ , which contradicts the minimality of T and thus completes this case.

Next, we let T be as depicted as Case 2 in Figure 3, with

T_{a}

a sibling to some tree

T_{c}

not containing

T_{b}

, and

T_{b}

a sibling to some tree

T_{d}

not containing

T_{a}

. We now construct two trees

T^{'}

and

T^{''}

as follows:

T^{'}

is like T, but subtrees

T_{b}

and

T_{c}

are swapped. Similarly,

T^{''}

is like T, but subtrees

T_{a}

and

T_{d}

are swapped. We denote by

n_{a}

n_{b}

n_{c}

and

n_{d}

the number of leaves in

T_{a}

T_{b}

T_{c}

and

T_{d}

, respectively. As in the first case, we necessarily have

n_{d} > n_{a}

and

n_{c} > n_{b}

by choice of

T_{a}

and

T_{b}

. We let

T_{1}, \dots, T_{k}

be the subtrees pending on path

P_{1}

as depicted in Figure 3 if these subtrees exist in T. Similarly, we let

{\hat{T}}_{1}, \dots, {\hat{T}}_{l}

be the subtrees pending on path

P_{2}

as depicted in Figure 3 if these subtrees exist in T. We denote by

n_{i}

the number of leaves in

T_{i}

for

i = 1,, \dots, k

, and by

{\hat{n}}_{i}

the number of leaves in

{\hat{T}}_{i}

for

i = 1, \dots, l

. Moreover, we define

n_{0} = {\hat{n}}_{0} = 0

. With these definitions, we can now introduce variables

t_{i}

and

{\hat{t}}_{i}

defined as follows:

t_{i} = \sum_{j = 0}^{i} n_{j}

for

i = 0, \dots, k

and

{\hat{t}}_{i} = \sum_{j = 0}^{i} {\hat{n}}_{j}

for

i = 0, \dots, l

We enumerate the vertices of $P_{1}$ such that $v_{0}$ is the parent of $T_{a}$ and $T_{c}$ in T and such that $v_{i}$ is the parent of $v_{i - 1}$ for each $i = 1, \dots, k$ if $k > 0$ ; that is, if trees $T_{1}, \dots, T_{k}$ exist in T. Then, for the subtree sizes $n_{v_{i}}$ we derive:
- $n_{v_{i}} (T) = n_{a} + n_{c} + t_{i}$ for $i = 0, \dots, k$ ,
- $n_{v_{i}} (T^{'}) = n_{a} + n_{b} + t_{i}$ for $i = 0, \dots, k$ ,
- $n_{v_{i}} (T^{''}) = n_{c} + n_{d} + t_{i}$ for $i = 0, \dots, k$ .
Similarly, we enumerate the vertices of $P_{2}$ such that $w_{0}$ is the parent of $T_{b}$ and $T_{d}$ in T and such that $w_{i}$ is the parent of $w_{i - 1}$ for each $i = 1, \dots, l$ if $l > 0$ ; that is, if trees ${\hat{T}}_{1}, \dots, {\hat{T}}_{l}$ exist in T. Then, for the subtree sizes $n_{w_{i}}$ we derive:
- $n_{w_{i}} (T) = n_{b} + n_{d} + {\hat{t}}_{i}$ for $i = 0, \dots, l$ ,
- $n_{w_{i}} (T^{'}) = n_{c} + n_{d} + {\hat{t}}_{i}$ for $i = 0, \dots, l$ ,
- $n_{w_{i}} (T^{''}) = n_{a} + n_{b} + {\hat{t}}_{i}$ for $i = 0, \dots, l$ .

We now set

λ = \frac{n_{d} - n_{a}}{n_{d} - n_{a} + n_{c} - n_{b}}

. Since we have

n_{d} > n_{a}

and

n_{c} > n_{b}

, we have

λ \in (0, 1)

. We now show that this choice of

λ

has two additional properties, which will be useful regarding the concavity of f:

We have $n_{b} + n_{d} + {\hat{t}}_{i} = λ (n_{c} + n_{d} + {\hat{t}}_{i}) + (1 - λ) (n_{a} + n_{b} + {\hat{t}}_{i})$ for all $i = 0, \dots, l$ :
$\begin{matrix} λ (n_{c} + n_{d} + {\hat{t}}_{i}) + (1 - λ) (n_{a} + n_{b} + {\hat{t}}_{i}) \\ = λ n_{c} + λ n_{d} - λ n_{a} - λ n_{b} + n_{a} + n_{b} + {\hat{t}}_{i} \\ = n_{a} + n_{b} + {\hat{t}}_{i} + λ (n_{d} - n_{a} + n_{c} - n_{b}) \\ = n_{a} + n_{b} + {\hat{t}}_{i} + \frac{n_{d} - n_{a}}{n_{d} - n_{a} + n_{c} - n_{b}} (n_{d} - n_{a} + n_{c} - n_{b}) \\ = n_{a} + n_{b} + {\hat{t}}_{i} + n_{d} - n_{a} \\ = n_{b} + n_{d} + {\hat{t}}_{i} . \end{matrix}$
Analogously, we have $n_{a} + n_{c} + t_{i} = λ (n_{a} + n_{b} + t_{i}) + (1 - λ) (n_{c} + n_{d} + t_{i})$ for all $i = 0, \dots, k$ .

The first one of these two points above shows that there exists

λ \in (0, 1)

with

n_{b} + n_{d} + {\hat{t}}_{i} = λ (n_{c} + n_{d} + {\hat{t}}_{i}) + (1 - λ) (n_{a} + n_{b} + {\hat{t}}_{i})

, so that we have for all

i = 0, \dots, l

\begin{matrix} f (n_{b} + n_{d} + {\hat{t}}_{i}) & = f (λ (n_{c} + n_{d} + {\hat{t}}_{i}) + (1 - λ) (n_{a} + n_{b} + {\hat{t}}_{i})) \\ > λ f (n_{c} + n_{d} + {\hat{t}}_{i}) + (1 - λ) f (n_{a} + n_{b} + {\hat{t}}_{i}), \end{matrix}

where the inequality holds due to the strict concavity of f. Analogously, by the second point, we have for all

i = 0, \dots, k

\begin{matrix} f (n_{a} + n_{c} + t_{i}) & = f (λ (n_{a} + n_{b} + t_{i}) + (1 - λ) (n_{c} + n_{d} + t_{i})) \\ > λ f (n_{a} + n_{b} + t_{i}) + (1 - λ) f (n_{c} + n_{d} + t_{i}) . \end{matrix}

Now we are finally in a position to derive a contradiction, namely by investigating the term

Φ_{f} (T) - λ Φ_{f} (T^{'}) - (1 - λ) Φ_{f} (T^{''})

in two different ways.

By assumption, T is a minimizer of $Φ_{f}$ , so we have that $Φ_{f} (T^{'}) \geq Φ_{f} (T)$ as well as $Φ_{f} (T^{''}) \geq Φ_{f} (T)$ . Thus:
$\begin{matrix} Φ_{f} (T) - λ Φ_{f} (T^{'}) - (1 - λ) Φ_{f} (T^{''}) \\ \leq Φ_{f} (T) - λ Φ_{f} (T) - (1 - λ) Φ_{f} (T) = 0 . \end{matrix}$ 7

We now split the sum of

Φ_{f} (T) = \sum_{v \in \overset{˚}{V} (T)} f (n_{v})

into three partial sums, namely the inner vertices belonging to

P_{1}

, the ones belonging to

P_{2}

and the ones belonging to neither one of the paths. Note that as all vertices that are not contained in any one of the paths are not affected by the swaps leading from T to

T^{'}

T^{''}

, respectively, the last sum is the same for

Φ_{f} (T)

Φ_{f} (T^{'})

and

Φ_{f} (T^{''})

. From our above observations concerning the subtree sizes

n_{v_{i}}

P_{1}

and

n_{w_{i}}

P_{2}

, we derive:

\begin{matrix} Φ_{f} (T) - λ Φ_{f} (T^{'}) - (1 - λ) Φ_{f} (T^{''}) \\ = \sum_{v \in P_{1}} f (n_{v} (T)) + \sum_{v \in P_{2}} f (n_{v} (T)) + \sum_{v \in \overset{˚}{V} \ {P_{1}, P_{2}}} f (n_{v} (T)) \\ - λ \cdot \sum_{v \in P_{1}} f (n_{v} (T^{'})) - λ \cdot \sum_{v \in P_{2}} f (n_{v} (T^{'})) \\ - λ \cdot \sum_{v \in \overset{˚}{V} \ {P_{1}, P_{2}}} f (n_{v} (T^{'})) - (1 - λ) \cdot \sum_{v \in P_{1}} f (n_{v} (T^{''})) \\ - (1 - λ) \cdot \sum_{v \in P_{2}} f (n_{v} (T^{''})) \\ - (1 - λ) \cdot \sum_{v \in \overset{˚}{V} \ {P_{1}, P_{2}}} f (n_{v} (T^{''})) \end{matrix}

\begin{matrix} = \sum_{i = 0}^{k} \underset{> 0 by Eq. (6)}{\underset{⏟}{(f (n_{a} + n_{c} + t_{i}) - λ f (n_{a} + n_{b} + t_{i}) - (1 - λ) f (n_{c} + n_{d} + t_{i}))}} \\ + \sum_{i = 0}^{l} \underset{> 0 by Eq. (5)}{\underset{⏟}{(f (n_{b} + n_{d} + {\hat{t}}_{i}) - λ f (n_{c} + n_{d} + {\hat{t}}_{i}) - (1 - λ) f (n_{a} + n_{b} + {\hat{t}}_{i}))}} \\ > 0 . \end{matrix}

The obvious contradiction between Inequalities (7), which states that

Φ_{f} (T) - λ Φ_{f} (T^{'}) - (1 - λ) Φ_{f} (T^{''}) \leq 0

and (9), which states that

Φ_{f} (T) - λ Φ_{f} (T^{'}) - (1 - λ) Φ_{f} (T^{''}) > 0

, shows that our assumption concerning the existence of T must have been wrong. In fact, this contradiction shows that at least one of the two trees

T^{'}

and

T^{''}

must have a lower

Φ_{f}

value than T. This completes the proof and thus shows that

T_{n}^{gfb}

is the unique tree minimizing

Φ_{f}

$□$

Before we investigate the implications of Theorem 2 on imbalance indices, we derive the following corollary.

Corollary 3

Let $n \in N$ , $n \geq 2$ and let $c \in R$ with $c > - 2$ . Let $f : R_{\geq 2} \to R$ be a strictly increasing function. Then, we have that $T_{n}^{gfb}$ is the unique minimizer of $f (π_{c} (T))$ among all rooted binary trees T with n leaves, where $π_{c} (T) = \prod_{v \in \overset{˚}{V} (T)} (n_{v} + c)$ . In particular, this holds for the identity function, $f (x) = x$ for all $x \in R_{\geq 2}$ .

Proof

We start with considering the product function $π_{c}$ . We let c and n be as described in the corollary. We have $n_{v} \geq 2$ for all inner nodes v of a rooted binary tree T, as the smallest possible subtree size is 2 (which is the case in which v is the parent of a cherry). Thus, we have $n_{v} + c > 0$ for all $v \in \overset{˚}{V}$ , as $c > - 2$ by assumption. This, however, means that all factors in $π_{c} (T)$ are strictly larger than 0, which shows that $π_{c} (T) > 0$ . This, in turn, means that $log (π_{c} (T))$ is defined.

Now we consider this term further:

log (π_{c} (T)) = log (\prod_{v \in \overset{˚}{V} (T)}, (n_{v} + c)) = \sum_{v \in \overset{˚}{V} (T)} log (n_{v} + c) .

As the logarithm is strictly concave and strictly monotonically increasing, we know by Theorem 2 that the latter sum is uniquely minimized by $T_{n}^{gfb}$ . Thus, the same applies to $log (π_{c} (T))$ . However, by the strict monotonicity of $log$ , the minimum of $log (π_{c} (T))$ is reached precisely when the minimum of $π_{c} (T)$ is reached, which shows that $T_{n}^{gfb}$ is also the unique tree minimizing $π_{c} (T)$ .

Now, for any strictly increasing function $f : R_{\geq 2} \to R$ , we have that $T_{n}^{gfb}$ is also the unique minimizer of $f (π_{c} (T))$ due to the monotonicity of f. This completes the proof. $□$

Implications of the extremal GFB properties on measures of tree balance

The main aim of this section is two-fold: First, we want to show that both functions $Φ_{f}$ and $π_{c}$ as defined in the previous section form families of imbalance indices for certain choices of f and c, respectively. We will continue to show that the imbalance index family based on the product function and a constant c is merely a subfamily of the imbalance index family based on strictly increasing and strictly concave functions f.

Then, we want to use our findings to characterize all trees minimizing the $\hat{s}$ -shape and Q-shape statistics, thus answering several open questions from Fischer et al. (2023).

However, in order to show that a function is an imbalance index, analyzing the minimum as in the previous section is not sufficient. Instead, we also need to investigate the caterpillar in order to investigate the maximum. We start with $Φ_{f}$ . Note that the following theorem can already be found in Hamann (2023, Theorem 4.7), albeit with a different proof.

Theorem 3

Let $n \in N$ , $n \geq 2$ and let $f : R_{\geq 2} \to R$ be strictly monotonically increasing, with $f (n_{1}) > f (n_{2})$ if and only if $n_{1} > n_{2}$ . We consider $Φ_{f} (T) = \sum_{v \in \overset{˚}{V} (T)} f (n_{v})$ . Then, $T_{n}^{cat}$ is the unique tree maximizing $Φ_{f}$ .

Proof

We prove the statement by contradiction. We suppose that there is a non-caterpillar tree T maximizing $Φ_{f}$ . We choose the smallest possible n for which such a tree T with n leaves exists. Thus, for all numbers smaller than n, the unique maximizer of $Φ_{f}$ is the caterpillar. In particular, this shows that $n \geq 4$ , because for any value smaller than 4, there is only one rooted binary tree.

We let $T = (T_{1}, T_{2})$ be the standard decomposition of T. Then, $Φ_{f} (T) = Φ_{f} (T_{1}) + Φ_{f} (T_{2}) + f (n)$ , where the last summand f(n) results from the root $ρ$ of T. This equality shows that T can only maximize $Φ_{f}$ among all trees with n leaves if $T_{1}$ and $T_{2}$ maximize $Φ_{f}$ among all trees with $n_{1}$ and $n_{2}$ leaves, respectively, where $n_{1}$ is the number of leaves of $T_{1}$ and $n_{2}$ is the number of leaves of $T_{2}$ . Thus, as we chose T to be a counterexample of minimal size concerning the statement of the theorem, we know by assumption that $T_{1}$ and $T_{2}$ must be caterpillars. Note that this implies that $n_{1} \geq 2$ and $n_{2} \geq 2$ (which is possible as $n_{1} + n_{2} = n \geq 4$ ), because if we had $n_{2} = 1$ and $T_{1}$ is a caterpillar, then T would be a caterpillar also. The same would happen if $n_{1} = 1$ .

Thus, we know that, as $T_{1}$ and $T_{2}$ are caterpillars with at least two leaves each, each of them has precisely one cherry. Let $[a_{1}, b_{1}]$ denote the cherry of $T_{1}$ and $[a_{2}, b_{2}]$ denote the cherry of $T_{2}$ . The parents of $[a_{1}, b_{1}]$ and $[a_{2}, b_{2}]$ are denoted by $v_{0}$ and $w_{0}$ , respectively. Note that on the path from $v_{0}$ to the root $ρ$ of T, there might be more vertices $v_{1}, \dots, v_{k}$ , all of which – if they exist – are adjacent to a leaf as $T_{1}$ is a caterpillar. Analogously, there might be more vertices $w_{1}, \dots, w_{l}$ on the path from $w_{0}$ to $ρ$ , all of which – if they exist – are adjacent to a leaf as $T_{2}$ is a caterpillar. Note that this means that T looks as depicted on the left-hand side of Figure 4. We denote the leaves adjacent to $v_{i}$ with $x_{i}$ for $i = 1, \dots, k$ (if they exist), and the leaves adjacent to $w_{i}$ with $y_{i}$ for $i = 1, \dots, l$ (if they exist).

Fig. 4 — Trees T and $T^{'}$ as described in the proof of Theorem 3. The only differences between the subtree sizes are at nodes $w_{l}$ and $v_{k + 1}$ , highlighted with a box

Now we assume without loss of generality that $k \geq l$ and consider $w_{l}$ (note that l might be 0) and consider a leaf $z_{l}$ adjacent to $w_{l}$ (note that $z_{l}$ might be either $a_{2}$ or $b_{2}$ if $l = 0$ ; otherwise we have $z_{l} = y_{l}$ ). We now create a tree $T^{'}$ by deleting edge $(w_{l}, z_{l})$ , subdividing edge $(ρ, v_{k})$ by introducing a new degree-2 vertex $v_{k + 1}$ and then adding a new edge $(v_{k + 1}, z_{l})$ and suppressing $w_{l}$ . The resulting tree $T^{'}$ is depicted on the right-hand side of Figure 4. Note that between T and $T^{'}$ , all subtree sizes are equal except for that of $v_{k + 1}$ , which equals $k + 3$ in $T^{'}$ as above but does not exist in T, and that of $w_{l}$ , which equals $l + 2$ (stemming from the fact that $w_{l}$ is ancestral to $a_{2}$ , $b_{2}$ and $y_{1}, \dots, y_{l}$ if the latter leaves exist). Thus, we derive for $Φ_{f} (T^{'})$ :

Φ_{f} (T^{'}) = Φ_{f} (T) + f (k + 3) - f (l + 2),

which immediately shows that $Φ_{f} (T^{'}) > Φ_{f} (T)$ , as we assumed that $k \geq l$ . This contradicts the maximality of T and thus completes the proof. $□$

The maximality of the caterpillar now immediately leads to the following corollary, which shows that each $Φ_{f}$ is indeed an imbalance index.

Corollary 4

Let $n \in N, n \geq 2$ and let $f : R_{\geq 2} \to R$ be a strictly increasing and strictly concave function. Then, $Φ_{f}$ is an imbalance index. Moreover, the GFB tree $T_{n}^{gfb}$ is the only minimizer of this function in $B T_{n}^{*}$ .

Proof

This is a direct consequence of the definition of an imbalance index in combination with Theorems 2 and 3 and the fact that $T_{h}^{fb} = T_{n}^{gfb}$ if $n = 2^{h}$ . $□$

We now use Corollary 4 to answer several open problems from the literature, most notably from Fischer et al. (2023). Note that while it is already known that the $\hat{s}$ -shape statistic is an imbalance index (Fischer et al. 2023), this has not been formally proven yet for the Q-shape statistic. However, while for the $\hat{s}$ -shape statistic the minima have already been known for the cases in which $n = 2^{h}$ (it is $T_{n}^{fb}$ as $\hat{s}$ is an imbalance index), it has not been known what the minimal trees are in cases in which n is not a power of two (Fischer et al. 2023). The following corollary fully characterizes these minima both for $\hat{s}$ and Q by showing that in both cases, $T_{n}^{gfb}$ is the unique minimizer.

Corollary 5

The $\hat{s}$ -shape statistic and the Q-shape statistic are both tree imbalance indices with the property that the GFB tree $T_{n}^{gfb}$ is their only minimizer in $B T_{n}^{*}$ for any value of $n \in N, n \geq 2$ .

Proof

Let $n \in N, n \geq 2$ . We define $f_{\hat{s}} (i) = log (i - 1)$ and $f_{Q} (i) = log (i)$ for $i \in R_{\geq 2}$ . Note that $f_{\hat{s}}$ and $f_{Q}$ are both strictly increasing and strictly concave. Now, by definition of $\hat{s}$ and Q, we have for all rooted binary trees T with n leaves:

\begin{matrix} Φ_{f_{\hat{s}} (T)} & = \sum_{v \in \overset{˚}{V} (T)} f_{\hat{s}} (n_{v}) = \sum_{v \in \overset{˚}{V} (T)} log (n_{v} - 1) = \hat{s} (T), \end{matrix}

as well as

\begin{matrix} Φ_{f_{Q} (T)} & = \sum_{v \in \overset{˚}{V} (T)} f_{Q} (n_{v}) = \sum_{v \in \overset{˚}{V} (T)} log (n_{v}) = Q (T) . \end{matrix}

Applying Corollary 4 completes the proof. $□$

So, we now know that for all values of n, there is only one tree minimizing the $\hat{s}$ -shape statistic (thus answering the question concerning the number of minima posed in Fischer et al. (2023, Chapter 9)), and we have fully characterized this unique minimum as $T_{n}^{gfb}$ . In Section 4.3, we will also deliver explicit formulas to calculate the minimal value of $\hat{s}$ for all n.

We now turn our attention to the product function to show that functions of this type also form a family of tree imbalance indices. We again start with considering the caterpillar.

Corollary 6

Let $n \in N, n \geq 2$ and let $c \in R$ , $c > - 2$ . Then, we have that $T_{n}^{cat}$ is the unique maximizer of $π_{c} (T)$ in $B T_{n}^{*}$ , where $π_{c} (T) = \prod_{v \in \overset{˚}{V} (T)} (n_{v} + c)$ .

Proof

Let $c \in R$ , $c > - 2$ . We can set $f_{c} (i) = log (i + c)$ for $i \in R_{\geq 2}$ . Then, $f_{c}$ is both strictly concave and strictly increasing (as the logarithm with base 2 has these properties). With this function $f_{c}$ , we have for any rooted binary tree T with n leaves:

\begin{matrix} Φ_{f_{c}} (T) & = \sum_{v \in \overset{˚}{V} (T)} f_{c} (n_{v}) = \sum_{v \in \overset{˚}{V} (T)} log (n_{v} + c) \\ = log (\prod_{v \in \overset{˚}{V} (T)}, (n_{v} + c)) = log (π_{c} (T)) . \end{matrix}

By Theorem 3 we know that $T_{n}^{cat}$ is the unique maximizer of $Φ_{f_{c}}$ , and, thus, we can conclude that $log (π_{c} (T_{n}^{cat})) > log (π_{c} (T))$ for all rooted binary trees T with n leaves. By the monotonicity of $log$ , this directly implies $π_{c} (T_{n}^{cat}) > π_{c} (T)$ for all such trees T. This concludes the proof. $□$

We now use Corollary 6 to show that the product function leads to a family of imbalance indices.

Corollary 7

Let $n \in N$ , $n \geq 2$ and let $c \in R, c > - 2$ . Moreover, let $f : R_{\geq 2} \to R$ be a strictly increasing function. Then, $f (π_{c})$ is an imbalance index. Moreover, the GFB tree $T_{n}^{gfb}$ is the only minimizer of this function.

Proof

The fact that $T_{n}^{cat}$ is the unique maximizer of $π_{c}$ follows from Corollary 6, which in turn shows by the monotonicity of f that $T_{n}^{cat}$ is the unique maximizer of $f (π_{c})$ . The fact that $T_{n}^{gfb}$ is the unique minimizer of $f (π_{c})$ was shown in Corollary 3. Using the fact that $T_{h}^{fb} = T_{n}^{gfb}$ if $n = 2^{h}$ and the definition of an imbalance index thus completes the proof. $□$

Corollary 7 shows that the product functions are a family of tree imbalance indices. We further classify this family as merely a subfamily of the family of tree imbalance indices $Φ_{f}$ in the sense that the tree rankings from balanced to imbalanced induced by these indices coincide with rankings induced by members of the $Φ_{f}$ family, as the following proposition shows.

Proposition 4

Let $c \in R, c > - 2$ and $n \in N$ , $n \geq 2$ , let f be a strictly increasing function. We consider the imbalance index $f (π_{c})$ and its induced ranking of trees $T_{n}^{gfb}, \dots, T_{n}^{cat}$ from balanced to imbalanced. Then, there is a strictly increasing and strictly concave function $f_{c} : R_{\geq 2} \to R$ such that $Φ_{f_{c}}$ induces the same ranking as $f (π_{c})$ .

Proof

First we note that by the monotonicity of f, $f (π_{c})$ induces the exact same ranking as $π_{c}$ , which in turn induces the exact same ranking as $log (π_{c})$ by the monotonicity of the logarithm.

Now, we set $f_{c} (i) = log (i + c)$ for $i \in R_{\geq 2}$ . Then, as in the proof of Corollary 6, $f_{c}$ is both strictly concave and strictly increasing, and just as in Equation (10) we can conclude $Φ_{f_{c}} (T) = log (π_{c} (T))$ for all rooted binary trees T with n leaves. This shows that $Φ_{f_{c}}$ and $f (π_{c})$ induce the same rankings of all rooted binary trees with n leaves and thus completes the proof. $□$

Before we turn our attention to deriving explicit formulas for the minimum values of $\hat{s}$ and Q in Section 4.3, we investigate the family of tree imbalance indices $π_{c}$ further. We first show that the family contains infinitely many different members in the sense that for choices of real $c_{1}, c_{2}$ larger than $- 2$ with $c_{1} \neq c_{2}$ , we can find two trees T and $T^{'}$ such that $π_{c_{1}}$ will consider T as more imbalanced than $T^{'}$ and $π_{c_{2}}$ gives the opposite ranking. Thus, there is an uncountably infinite family of genuinely distinct imbalance indices.

Proposition 5

Let $c_{1}, c_{2} \in R$ be distinct with $c_{1}, c_{2} > - 2$ . Then there exist two trees T and $T^{'}$ such that $π_{c_{1}} (T) > π_{c_{1}} (T^{'})$ and $π_{c_{2}} (T) < π_{c_{2}} (T^{'})$ ; that is, the imbalance indices $π_{c_{1}}$ and $π_{c_{2}}$ rank T and $T^{'}$ differently.

Proof

We will construct T and $T^{'}$ as depicted in Figure 5 with suitable choices of sizes of subtrees: $n_{11}$ of $T_{11}$ , $n_{12}$ of $T_{12}$ , $n_{21}$ of $T_{21}$ and $n_{22}$ of $T_{22}$ , respectively. Subtrees $T_{11}$ , $T_{12}$ , $T_{21}$ and $T_{22}$ can then be chosen arbitrarily as long as they have the respective numbers of leaves.

Fig. 5 — Trees T, $T^{'}$ , and $T^{''}$ as needed in the proofs of Propositions 5 and 6. Note that all three trees share the same subtrees $T_{11}$ , $T_{12}$ , $T_{21}$ , and $T_{22}$ , which are depicted schematically as triangles. The stars depict the inner vertices that play an important role in the proofs

Since the subtree sizes induced by the two trees T and $T^{'}$ only differ in two nodes, we can easily express $π_{c} (T^{'})$ using $π_{c} (T)$ for any choice of $c \in R$ as follows:

\begin{matrix} π_{c} (T^{'}) = π_{c} (T) \cdot \frac{(n_{11} + n_{21} + c) \cdot (n_{11} + n_{12} + n_{21} + c)}{(n_{11} + n_{12} + c) \cdot (n_{21} + n_{22} + c)} . \end{matrix}

This implies:

\begin{matrix} π_{c} (T^{'}) ≷ π_{c} (T) \end{matrix}

\begin{matrix} ⟺ (n_{11} + n_{21} + c) \cdot (n_{11} + n_{12} + n_{21} + c) ≷ (n_{11} + n_{12} + c) \cdot (n_{21} + n_{22} + c) \\ ⟺ c n_{11} + c n_{21} - c n_{22} ≷ n_{11} n_{22} + n_{12} n_{22} - n_{11}^{2} - n_{11} n_{12} - n_{11} n_{21} - n_{21}^{2} \\ ⟺ c ≷ \frac{n_{11} n_{22} + n_{12} n_{22} - n_{11}^{2} - n_{11} n_{12} - n_{11} n_{21} - n_{21}^{2}}{n_{11} + n_{21} - n_{22}}, \end{matrix}

(where the $≷$ -symbol stands for either > or < consistently throughout).

The proof strategy now is to show that for any choice of $c_{1}, c_{2}$ larger than $- 2$ we can choose $n_{11}$ , $n_{12}$ , $n_{21}$ and $n_{22}$ such that the fraction of Equation (11) lies between $c_{1}$ and $c_{2}$ . By Equation (11), this will show that $π_{c_{1}} (T^{'}) < π_{c_{1}} (T)$ and $π_{c_{2}} (T^{'}) > π_{c_{2}} (T)$ and thus conclude the proof. In the following, we assume without loss of generality that $c_{2} > c_{1}$ by exchanging T and $T^{'}$ if needed.

So now we let $c_{1}, c_{2} \in R$ with $c_{1} > - 2$ and $c_{2} > c_{1}$ . We let $k \in N$ be such that $k \cdot (c_{2} - c_{1}) > 2$ . This guarantees that the open interval $(k \cdot c_{1}, k \cdot c_{2})$ contains two consecutive integers m and $m + 1$ , where $m \in Z$ . This implies that $\frac{m}{k}$ and $\frac{m + 1}{k}$ are contained in the open interval $(c_{1}, c_{2})$ . We have two rational numbers contained in $(c_{1}, c_{2})$ . We now consider their mean $\frac{2 m + 1}{2 k}$ . We have:

c_{1} < \frac{m}{k} < \frac{2 m + 1}{2 k} < \frac{m + 1}{k} < c_{2} .

Following our proof strategy, the proof is thus complete if we can show that we can choose $n_{11}$ , $n_{12}$ , $n_{21}$ and $n_{22}$ such that the fraction of Equation (11) equals $\frac{2 m + 1}{2 k}$ .

We now set $n_{11} = 1$ , $n_{22} = 2$ , $n_{21} = 2 k + 1$ , $n_{12} = 4 k^{2} + 2 + 6 k + 2 m$ . We first verify that these are all valid leaf numbers: that all of these numbers are natural. Clearly, this holds for $n_{11}$ and $n_{22}$ . Moreover, recall that $k \in N$ , so $n_{21}$ is also natural. But as $m \in Z$ , $m < 0$ could be possible. So we need to verify that $n_{12}$ is positive. However, we know that $\frac{m}{k} > c_{1} > - 2$ by the choice of $c_{1}$ and $\frac{m}{k}$ , respectively. This shows that $m > - 2 k$ and thus $2 m > - 4 k$ , which leads to $n_{12} = 4 k^{2} + 2 + 6 k + 2 m > 4 k^{2} + 2 + 6 k - 4 k = 4 k^{2} + 2 + 2 k \in N$ . So our choices of $n_{11}$ , $n_{12}$ , $n_{21}$ and $n_{22}$ result in four positive integers and can be realized as subtree sizes in trees. We show that with these choices, we indeed get that the fraction of Equation (11) equals $\frac{2 m + 1}{2 k}$ :

\begin{matrix} \frac{n_{11} n_{22} + n_{12} n_{22} - n_{11}^{2} - n_{11} n_{12} - n_{11} n_{21} - n_{21}^{2}}{n_{11} + n_{21} - n_{22}} \\ = \frac{2 + 2 (4 k^{2} + 2 + 6 k + 2 m) - 1 - (4 k^{2} + 2 + 6 k + 2 m) - (2 k + 1) - {(2 k + 1)}^{2}}{1 + (2 k + 1) - 2} \\ = \frac{1 + (4 k^{2} + 2 + 6 k + 2 m) - (2 k + 1) - (4 k^{2} + 4 k + 1)}{2 k} \\ = \frac{2 m + 1}{2 k} . \end{matrix}

This completes the proof. $□$

Note that unsurprisingly, it is easier to make $π_{c_{1}}$ and $π_{c_{2}}$ disagree concerning the ranking of T and $T^{'}$ if $c_{2} - c_{1}$ is large. If the difference is larger than one, the value of k chosen in the proof can be 1, which is as small as possible. As our choice of subtree sizes in the proof was $n_{11} = 1$ , $n_{22} = 2$ , $n_{21} = 2 k + 1$ , and $n_{12}$ such that $n_{12} > 4 k^{2} + 2 + 2 k$ , this shows that even if $k = 1$ , we already need $n = 1 + 2 + 3 + 8 = 14$ leaves for our construction. It may be possible to have smaller examples showing different rankings, but it is clear that no two indices $π_{c_{1}}$ and $π_{c_{2}}$ will rank all trees in the same order if $c_{1} \neq c_{2}$ .

We note T and $T^{'}$ as used in the proof of Proposition 5 only differ in two subtree sizes, yet they will be ranked differently by certain members of the $π_{c}$ family. On the other hand, there are always pairs of trees that are ranked identically for all choices of c. Most prominently, this is of course the case for $T_{n}^{cat}$ and $T_{n}^{gfb}$ , but these differ in most subtree sizes. The next proposition shows that for all n at least 6, there are pairs of trees T and $T^{''}$ such that these trees differ only in two subtree sizes and we have $π_{c} (T) > π_{c} (T^{'})$ for all $c \in R, c > - 2$ .

Proposition 6

Let $n \in N$ with $n \geq 6$ . Then there exist two trees T and $T^{''}$ which only differ in two subtree sizes such that $π_{c} (T) > π_{c} (T^{''})$ for all $c \in R$ , with $c > - 2$ .

Proof

We give an explicit construction for two non-isomorphic trees T and $T^{''}$ which fulfill the condition. We begin by choosing $n_{ij} \in N$ such that $n_{11} < n_{22}$ and $n_{12} < n_{21}$ and such that $n_{11} + n_{12} + n_{21} + n_{22} = n \geq 6$ . These values of $n_{ij}$ (with $i, j \in {1, 2}$ ) will be used as leaf numbers for the subtrees $T_{ij}$ for T and $T^{''}$ as depicted in Figure 5. Note that the conditions on $n_{ij}$ guarantee that T and $T^{''}$ are not isomorphic.

Moreover, note that as the subtree sizes induced by the two trees T and $T^{''}$ from Figure 5 only differ in two nodes, we can easily express $π_{c} (T^{''})$ using $π_{c} (T)$ for any choice of $c \in R$ :

\begin{matrix} π_{c} (T^{''}) = π_{c} (T) \cdot \frac{(n_{11} + n_{21} + c) \cdot (n_{12} + n_{22} + c)}{(n_{11} + n_{12} + c) \cdot (n_{21} + n_{22} + c)} . \end{matrix}

This shows that we have:

\begin{matrix} π_{c} (T^{''}) ≷ π_{c} (T) & ⟺ (n_{11} + n_{21} + c) \cdot (n_{12} + n_{22} + c) \\ ≷ (n_{11} + n_{12} + c) \cdot (n_{21} + n_{22} + c) \\ ⟺ n_{11} n_{12} + n_{21} n_{22} ≷ n_{11} n_{21} + n_{12} n_{22} \end{matrix}

(where again the $≷$ -symbol is a consistent inequality throughout). As the latter term is completely independent of c, this shows that all indices in the $π_{c}$ family will agree on how to rank T and $T^{''}$ , and this will only be determined by these trees’ subtree sizes. This completes the proof. $□$

Figure 6 illustrates the differences of rankings within the $π_{c}$ family by considering the range of values $c = - 1.99, - 1.5, - 1, - 0.5, 0, 0.5, 1, 1.5, 2$ and comparing the rankings induced by these values for $n = 10$ .

Explicit formulas for $π_{- 1} (T_{n}^{gfb})$ and $π_{0} (T_{n}^{gfb})$

The first aim of this section is to provide two alternative direct (non-recursive) formulas to calculate the values both of $π_{- 1} (T_{n}^{gfb})$ as well as of $π_{0} (T_{n}^{gfb})$ , corresponding to the $\hat{s}$ -shape and the Q-shape statistic, respectively. It turns out that the sequence ${(π_{- 1}, (T_{n}^{gfb}))}_{n \in N}$ can already be found in the Online Encyclopedia of Integers Sequences OEIS (Sloane 1964), namely under reference number A132862. However, the OEIS only contained a recursive formula to calculate ${(π_{- 1}, (T_{n}^{gfb}))}_{n \in N}$ . Within the scope of the present manuscript, we have submitted our two explicit formulas for $π_{- 1} (T_{n}^{gfb})$ to the OEIS for addition to their database. Moreover, the sequence $π_{0} (T_{n}^{gfb})$ of the Q statistic was added to the OEIS under the identifier A386912 (Sloane 1964, Sequence A386912).

The second aim of this section is to use our new formulas to derive explicit formulas for the minimum value of the $\hat{s}$ -shape statistic (which answers an open question (Fischer et al. 2023, Chapter 9)) as well as for the Q-shape statistic.

We first prove the following lemma to give insight into the greedy clustering algorithm defining $T_{n}^{gfb}$ .

Lemma 2

Let $n \in N$ . Let $A$ be the greedy clustering algorithm constructing $T_{n}^{gfb}$ . Then, at any point during the run of this algorithm, there is at most one tree contained in the present set of trees whose size is not a power of two.

Proof

Towards a contradiction, we assume at some point during the course of the algorithm, there exists more than one tree in the present set of trees whose size is not a power of two. Consider the first step, j, where two such trees exist, and call them T and $T^{'}$ . We assume that $T = (T_{1}, T_{2})$ (where $(T_{1}, T_{2})$ denotes the standard decomposition of T as defined above) was constructed from the trees of minimal size, $T_{1}$ and $T_{2}$ at step $i < j$ . Further assume that at some later step j, $T^{'} = (T_{1}^{'}, T_{2}^{'})$ was constructed from the trees of minimal size, $T_{1}^{'}$ and $T_{2}^{'}$ . By hypothesis, at the previous step, $i - 1$ , all trees, have size a power of 2. And at step $j - 1$ , all trees, besides T, have size a power of 2. Since both $T_{1}$ and $T_{2}$ have size a power of 2 and the result of joining them, T, does not, we know that $T_{1}$ and $T_{2}$ have different heights, and without loss of generality, we have that $| T_{1} | < | T_{2} | < | T |$ . Further, by construction, $h (T_{2}) + 1 = h (T)$ , and, by Corollary 2, $h (T_{1}) + 1 = h (T_{2})$ . By a similar argument, $| T_{1}^{'} | < | T_{2}^{'} | < | T^{'} |$ , $h (T_{2}^{'}) + 1 = h (T^{'})$ , and $h (T_{1}^{'}) + 1 = h (T_{2}^{'})$ .

At each step, algorithm $A$ joins the two trees of the present set of the smallest size, so, all other trees at step $i - 1$ must have at least the size of $T_{2}$ , and all other trees at step $j - 1$ must have at least the size of $T_{2}^{'}$ . In particular, T must be as least as large as $T_{2}^{'}$ . Since T does not have size a power of 2, but $T_{2}^{'}$ does, we conclude from $| T | \geq | T_{2}^{'} |$ that in fact $| T | > | T_{2}^{'} |$ . Moreover, as a power of two, $| T_{2}^{'} |$ as a subtree of the GFB tree must be a fully balanced tree by Corollary 1, so it has minimal height for any tree with $| T_{2}^{'} |$ leaves. Thus, as $| T | > | T_{2}^{'} |$ , we conclude that we additionally have $h (T_{2}) + 1 = h (T) > h (T_{2}^{'}) > h (T_{1}^{'})$ . This necessarily implies $h (T_{2}) \geq h (T_{2}^{'}) > h (T_{1}^{'})$ . Together with $h (T_{2}) > h (T_{1})$ , this gives a contradiction, as in step i, $T_{1}$ would not have been clustered with $T_{2}$ as tree $T_{1}^{'}$ of strictly smaller size than $T_{2}$ was still available. This completes the proof. $□$

The following theorem will prove useful for calculating an explicit formula for the minimum value of the $\hat{s}$ -shape statistic. It counts the number of subtrees of $T_{n}^{gfb}$ for all possible subtree sizes.

Theorem 7

Let $n \geq 1$ , and let $a_{n} (i)$ denote the number of subtrees of $T_{n}^{gfb}$ of size i for $i = 1, \dots, n$ . Let Inline graphic . Then, we have:

\begin{matrix} a_{n} (i) = \{\begin{matrix} ⌊\frac{n}{i}⌋ & if i = 2^{k_{i}} and if ((n mod i) = 0 or (n mod i) \geq 2^{k_{i} - 1}), \\ ⌊\frac{n}{i}⌋ - 1 & if i = 2^{k_{i}} and if (0 < (n mod i) < 2^{k_{i} - 1}), \\ 1 & if i \neq 2^{k_{i}} and ((n - i) mod 2^{k_{i} - 1}) = 0, \\ 0 & if i \neq 2^{k_{i}} and ((n - i) mod 2^{k_{i} - 1}) > 0 . \end{matrix}) . \end{matrix}

The appendix contains a proof of Theorem 7, which is based on Lemma 2.

We now turn our attention to $π_{c} (T_{n}^{gfb})$ .

Corollary 8

Let $n \in N$ , $n \geq 2$ and let $c \in R, c > - 2$ . Then, $π_{c} (T_{n}^{gfb}) = \prod_{i = 2}^{n} {(i + c)}^{a_{n} (i)}$ , where $a_{n} (i)$ is given by Theorem 7.

Proof

By definition, we have $π_{c} (T) = \prod_{v \in \overset{˚}{V} (T)} (n_{v} + c)$ . Since the values of $n_{v}$ range between 2 and n, sorting the subtrees by sizes and using Theorem 7 gives the desired result. $□$

Corollary 5 together with Corollary 8 leads to an explicit formula for the minimum value for the $\hat{s}$ -statistic, thus solving an open problem stated in Fischer et al. (2023, Chapter 9).

Corollary 9

Let ${\hat{s}}_{n}^{*} = min_{T^{'} \in B T_{n}^{*}} \hat{s} (T^{'})$ . Then, we have: ${\hat{s}}_{n}^{*} = log (\prod_{i = 2}^{n}, {(i - 1)}^{a_{n} (i)})$ .

Proof

By Corollary 5, $\hat{s}$ is uniquely minimized by $T_{n}^{gfb}$ , which directly implies ${\hat{s}}_{n}^{*} = \hat{s} (T_{n}^{gfb})$ . By definition, this shows ${\hat{s}}_{n}^{*} = log (\prod_{v \in \overset{˚}{V} (T_{n}^{gfb})}, (n_{v} - 1)) .$ Applying Corollary 8 with $c = - 1$ and the monotonicity of the logarithm, immediately leads to the required result and thus completes the proof. $□$

Remark 1

The sequence ${(\prod_{i = 2}^{n}, {(i - 1)}^{a_{n} (i)})}_{n \geq 1}$ starts with the values:

1, 1, 2, 3, 8, 15, 36, 63, 192, 405, 1080, 2079, 6048, 12285, 31752, 59535, 193536

has already occurred in different contexts (cf. Bodini et al. 2022). It is listed in the Online Encyclopedia of Integer Sequences (Sloane 1964, Sequence A132862), where only recursive formulas were given. Using the GFB tree, in particular Theorem 7 and Corollary 9, now allows for a simple explicit formula to calculate this sequence.1 The explicit formulas arising from the present manuscript have meanwhile been submitted to the OEIS.

Analogously, we can derive a result as given by Corollary 9 also for the Q-shape statistic.

Corollary 10

Let $Q_{n}^{*} = min_{T^{'} \in B T_{n}^{*}} Q (T^{'})$ . Then, we have: $Q_{n}^{*} = log (\prod_{i = 2}^{n}, i^{a_{n} (i)})$ .

Proof

Using $c = 0$ , the proof of Corollary 10 is analogous to the proof of Corollary 9. $□$

While the explicit formulas for $π_{- 1} (T_{n}^{gfb})$ and for $π_{0} (T_{n}^{gfb})$ provided by Corollary 8 are non-recursive and thus already an improvement to the previous state of the literature, they heavily depend on Theorem 7. The following theorem provides another direct formula independent of the values $a_{n} (i)$ .

Theorem 8

Let $n \in N$ , $n \geq 2$ and let $c \in R, c > - 2$ . Let Inline graphic , and let $d_{n} = n - 2^{k_{n} - 1}$ be the difference between n and the next lower power of 2. Then, we have:

If $d_{n} = 2^{k_{n} - 1}$ (with $n = 2^{k_{n}}$ ), then $π_{c} (T_{n}^{gfb}) = π_{c} (T_{k_{n}}^{fb}) = \prod_{i = 0}^{k_{n} - 1} {(2^{k_{n} - i} + c)}^{i}$ .
If $d_{n} < 2^{k_{n} - 1}$ , (with $n < 2^{k_{n}}$ ), then we have:

Before we can prove Theorem 8, we briefly show that the complete tree $T_{n}^{c}$ of Fill (1996) coincides with the GFB tree $T_{n}^{gfb}$ , which has been independently shown in Riesterer (2022, Theorem 4.2), albeit with a different proof. This result will enable us to use a different construction of $T_{n}^{gfb}$ , namely to start with $T_{k_{n} - 1}^{fb}$ , fix an orientation and then replace a certain number of leaves by cherries from left to right, until n leaves in total are reached.

Lemma 3

(Theorem 4.2 in Riesterer 2022) Let $n \geq 1$ . Then, $T_{n}^{gfb} = T_{n}^{c}$ .

Proof

We prove the statement by induction on n. In the following, let $T_{n}^{gfb} = (T_{a}, T_{b})$ , where $T_{a}$ has size $n_{a}$ and $T_{b}$ has size $n_{b}$ with $n_{a} \geq n_{b}$ . Similarly, let $T_{n}^{c} = (T_{a}^{c}, T_{b}^{c})$ , where $T_{a}^{c}$ has size $n_{a}^{c}$ and $T_{b}^{c}$ has size $n_{b}^{c}$ with $n_{a}^{c} \geq n_{b}^{c}$ .

For $n = 1$ there is only one tree, so there is nothing to show. We now assume the statement holds for all trees with up to $n - 1$ leaves and consider $n \geq 2$ . Let $ℓ_{n} = ⌊ log (n) ⌋$ , that is $n \in {2^{ℓ_{n}}, \dots, 2^{ℓ_{n} + 1} - 1}$ . Then, by construction of $T_{n}^{c}$ , we know that if $n \leq 3 \cdot 2^{ℓ_{n} - 1}$ , $n_{b}^{c} = 2^{ℓ_{n} - 1}$ (as in this case, no leaves of $T_{b}$ get replaced by cherries), and $n_{a}^{c} = n - n_{b}^{c} = n - 2^{ℓ_{n} - 1}$ . By Part (1) of Proposition 1, we can then conclude $n_{a} = n_{a}^{c}$ and $n_{b} = n_{b}^{c}$ .

If, on the other hand, $n > 3 \cdot 2^{ℓ_{n} - 1}$ , we know by construction that $T_{a}^{c} = T_{ℓ_{n}}^{fb}$ and thus $n_{a}^{c} = 2^{ℓ_{n}}$ (as all leaves of the left subtree have been replaced by cherries) and thus $n_{b}^{c} = n - n_{a}^{c} = n - 2^{ℓ_{n}}$ . By Part (2) of Proposition 1, we can then conclude $n_{a} = n_{a}^{c}$ and $n_{b} = n_{b}^{c}$ .

Thus, in both cases, the sizes of $T_{a}$ and $T_{a}^{c}$ as well as the sizes of $T_{b}$ and $T_{b}^{c}$ , respectively, coincide, which by induction shows that $T_{a} = T_{a}^{c}$ and $T_{b} = T_{b}^{c}$ . Thus, $T_{n}^{gfb} = (T_{a}, T_{b}) = (T_{a}^{c}, T_{b}^{c}) = T_{n}^{c}$ , which completes the proof. $□$

We are now in the position to prove Theorem 8.

Proof of Theorem 8

For the case in which $d_{n} = 2^{k_{n} - 1}$ , the statement is a direct conclusion of the fact that for $n = 2^{k_{n}}$ , the GFB tree and the fully balanced tree coincide.

It remains to consider the case $d_{n} < 2^{k_{n} - 1}$ . In this case, by Lemma 3, we can derive the GFB tree from $T_{k_{n} - 1}^{fb}$ by replacing $d_{n}$ leaves by cherries from left to right. Thus, we can start with $π_{c} (T_{k_{n} - 1}^{fb})$ and modify this product by dividing out all factors of $π_{c} (T_{k_{n} - 1}^{fb})$ that are no longer present in $π_{c} (T_{n}^{gfb})$ and multiypling in factors newly occuring in the latter term. We do this using the following considerations:

The factor ${(2 + c)}^{d_{n}}$ stems from the $d_{n}$ newly formed cherries, each of which contributes a factor of $2 + c$ . For this new factor, nothing needs to be divided out as the former leaves did not occur in the term $π_{c} (T_{k_{n} - 1}^{fb})$ .
Next, for each $i = 1, \dots, k_{n} - 1$ , we need to check how many subtrees of size $2^{i}$ in $T_{k_{n} - 1}^{fb}$ get replaced by subtrees of size $2^{i + 1}$ when $T_{n}^{gfb}$ is formed. These are precisely the subtrees of size $2^{i}$ in $T_{k_{n} - 1}^{fb}$ all of whose leaves get replaced by cherries. As we fill the tree up from left to right, it can be easily seen that there are $⌊\frac{d_{n}}{2^{i}}⌋$ such trees, explaining the term ${(\frac{2^{i + 1} + c}{2^{i} + c})}^{⌊\frac{d_{n}}{2^{i}}⌋}$ .
Last, for each $i = 1, \dots, k_{n} - 1$ , there may be at most one subtree of size $2^{i}$ in $T_{k_{n} - 1}^{fb}$ some of whose leaves but not all get replaced by cherries. This depends on whether $\frac{d_{n}}{2^{i}}$ is an integer. If it is, then the $d_{n}$ new cherries completely fill up all trees of size $2^{i}$ to which they were added, (that is, all these trees have already been considered in the previous term). This is the case when , which implies that in this case, the latter term in the equation equals 1 and thus does not contribute to the overall product. If, however, $\frac{d_{n}}{2^{i}}$ is not an integer and thus , then there is a tree of size $2^{i}$ to which $d_{n} - 2^{i} \cdot ⌊\frac{d_{n}}{2^{i}}⌋$ leaves get added, namely precisely the “left over” leaves after filling up $⌊\frac{d_{n}}{2^{i}}⌋$ many subtrees of $T_{k_{n} - 1}^{fb}$ with $2^{i}$ new leaves each. This explains the last factor and thus completes the proof. $□$

Next, we again turn our attention to the $\hat{s}$ -shape statistic.

Expected values of the $\hat{s}$ -shape statistic

The $\hat{s}$ -shape statistic plays an important role in tree balance, particularly in the context of mathematical phylogenetics and the Yule-Harding model, cf. Kersting et al. (2025). However, so far the expected values of the $\hat{s}$ -shape statistic under the Yule-Harding and the uniform models, which are common distributions of trees, are unknown (Fischer et al. 2023, Chapter 9). In the following, we give bounds on the expected value of the $\hat{s}$ -shape statistic under these two distributions: the uniform distribution where each (binary) tree on n leaves is equally likely, and the Yule-Harding distribution. To show our bounds, we outline and use the elegant approach of King and Rosenberg (2021). They note that the expectation of the Sackin index can be computed in terms of the cluster sizes for any distribution, $θ$ , on trees which has the exchangeability property, that is, for each $T \in B T_{n}$ and each permutation, $σ$ , of its leaf labels, $P_{θ} (T) = P_{θ} (σ (T))$ . For such distributions, the proposition from Than and Rosenberg (2014) applies for the Sackin index, $S_{n}$ :

Proposition 9

(Than and Rosenberg 2014, Lemma 6) If a probability distribution, $Θ$ , on $B T_{n}$ satisfies the exchangeability property, then the expected value for the Sackin index on n-leaf trees is:

E_{θ} [S_{n}] = \sum_{k = 1}^{n - 1} (\begin{matrix} n \\ k \end{matrix}) k p_{n} (k)

where $p_{n} (k)$ is the probability that a given subset $A \subseteq X$ with $| A | = k$ , $1 \leq k \leq n$ , is a cluster of a tree of size n leaves sampled from $B T_{n}$ according to $Θ$ .

Note that the $\hat{s}$ -shape statistic, like the Sackin index, sums across all cluster sizes of a given tree: the Sackin index sums up the respective size k, while the $\hat{s}$ -shape statistic sums the logarithm of said size (namely, $log (k - 1)$ ). This similarity between the indices allows us to use the above approach introduced for the Sackin index also for the $\hat{s}$ -shape statistic:

Proposition 10

If a probability distribution, $Θ$ , on $B T_{n}$ satisfies the exchangeability property, then the expected value for the $\hat{s}$ -shape statistic on n-leaf trees is

E_{θ} [{\hat{s}}_{n}] = \sum_{k = 2}^{n - 1} (\begin{matrix} n \\ k \end{matrix}) log (k - 1) p_{n} (k)

where $p_{n} (k)$ is the probability that a given subset $A \subseteq X$ with $| A | = k$ , $2 \leq k \leq n$ , is a cluster of a tree of size n leaves sampled from $B T_{n}$ according to $Θ$ .

King and Rosenberg (2021) give an elegant proof of the expected value of the Sackin index and the resulting closed form:

Theorem 11

(King and Rosenberg 2021, Corollary 7) The expectation of the Sackin index under the uniform model on rooted binary labeled trees of n leaves is:

E_{U} [S_{n}] = \frac{4^{n - 1}}{C_{n - 1}} - n

where $C_{n} = \frac{1}{n + 1} (\begin{matrix} 2 n \\ n \end{matrix})$ , the $n^{th}$ Catalan number.

Using the bounds from King and Rosenberg (2021), we can show:

Theorem 12

The expectation of the $\hat{s}$ -shape statistic under the uniform model on rooted binary labeled trees of n leaves is:

\frac{log n}{n} [\frac{4^{n - 1}}{C_{n - 1}} - n] \leq E_{U} [{\hat{s}}_{n}] \leq \frac{4^{n - 1}}{C_{n - 1}} - n

Proof

The upper bound follows directly from the Sackin index being an upper bound for the $\hat{s}$ -shape statistic.

For the lower bound, we use Proposition 10 and the simple bound of $\frac{log (k - 1)}{k} \geq \frac{log (n - 1)}{n}$ for $2 \leq k \leq n$ :

\begin{matrix} E_{θ} [{\hat{s}}_{n}] & = & \sum_{k = 2}^{n - 1} (\begin{matrix} n \\ k \end{matrix}) log (k - 1) p_{n} (k) = \sum_{k = 2}^{n - 1} (\begin{matrix} n \\ k \end{matrix}) \frac{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})}{(\begin{matrix} 2 n - 2 \\ 2 k - 2 \end{matrix})} log (k - 1) \\ \geq & \frac{log (n - 1)}{n C_{n - 1}} (4^{n - 1} - [(\begin{matrix} 2 (n - 1) \\ n - 1 \end{matrix}) + 2 (\begin{matrix} 2 n - 5 \\ n - 3 \end{matrix}) + (\begin{matrix} 2 (n - 1) \\ n - 1 \end{matrix})]) \\ = & \frac{log (n - 1)}{n C_{n - 1}} (4^{n - 1} - [2 (\begin{matrix} 2 (n - 1) \\ n - 1 \end{matrix}) + 2 (\begin{matrix} 2 n - 5 \\ n - 3 \end{matrix})]) . \end{matrix}

This completes the proof. $□$

Discussion and Conclusion

We have introduced two families of imbalance indices, namely $Φ_{f}$ for strictly increasing and strictly concave functions f and the product functions $π_{c}$ for $c > - 2$ , and shown that both of them are uniquely minimized by the GFB tree. For the $\hat{s}$ -shape statistic, which is an important imbalance index used in the phylogenetic literature, this finding answered the open question concerning its extrema from (Fischer et al. 2023, Chapter 9). However, our approach is more general and not just focused on the $\hat{s}$ -shape statistic. In particular, we have shown that our families of imbalance indices contain infinitely many different imbalance indices, some of which might be useful in phylogenetics and other research areas where tree balance plays a role.

Acknowledgements

The authors wish to thank Tom N. Hamann, Kristina Wicke and Volkmar Liebscher for helpful discussions. The present manuscript is based upon work supported by the National Science Foundation under Grant No. DMS-1929284 while the authors were in residence at the Institute for Computational and Experimental Research in Mathematics (ICERM) in Providence, RI, during the Theory, Methods, and Applications of Quantitative Phylogenomics semester program.

Appendix A Proof of Theorem 7

Here we present the somewhat technical proof of Theorem 7.

Proof of Theorem 7

We first analyze the procedure with which the GFB tree is generated. When the greedy clustering is performed to form $T_{n}^{gfb}$ , we refer to all clusterings that involve at least one tree of size $2^{m}$ and no tree of size $2^{m - 1}$ as phase m of the algorithm. Phase 0 includes all steps that cluster subtrees of size $2^{m} = 2^{0} = 1$ , that is the leaves to form cherries. If n is odd, this includes as the last step the clustering of a leaf and a cherry (subtree of size 2). Phase 1 contains all steps that involve clustering of cherries. Almost all these clusterings will be two cherries clustered into a new subtrees of size 4. The last clustering in phase 1 could cluster a cherry with a subtree of size 3 or 4, if $n mod 4 \equiv 3$ or 2, respectively.

We now proceed by considering some $i \in {1, \dots, n}$ with Inline graphic at the end of phase $k_{i} - 1$ of the algorithm (or, equivalently, the beginning of phase $k_{i}$ ). The trees present at this stage will be referred to as the current set of trees. All trees in the current set must have size at least $2^{k_{i} - 1} + 1$ as all trees of size up to $2^{k_{i} - 1}$ have already been clustered into larger trees. Moreover, all trees in the current set have size strictly smaller than $2^{k_{i} + 1}$ . This is due to the following: first, phase $k_{i} - 1$ only performs clusterings in which at least one tree has size at most $2^{k_{i} - 1}$ . Second, all trees formed throughout the course of the algorithm are again GFB trees by Corollary 1. This implies by Corollary 2 that a tree of size $2^{k_{i} + 1}$ or larger would have two maximum pending subtrees of size at least $2^{k_{i}}$ each – but two such trees could not have been clustered during phase $k_{i} - 1$ of the algorithm. So, all trees T in the current set have size $n_{T} \in S = {2^{k_{i} - 1} + 1, \dots, 2^{k_{i} + 1} - 1}$ . However, by Lemma 2, there is at most one tree present in the current set of trees whose size is not a power of 2. As S only contains one power of 2, namely $2^{k_{i}}$ , we conclude that all trees except possibly for one have size $2^{k_{i}}$ .

We now consider the decomposition $n = a \cdot 2^{k_{i}} + b$ with $b = (n mod 2^{k_{i}}) \in {0, \dots, 2^{k_{i}} - 1}$ and $a = ⌊\frac{n}{2^{k_{i}}}⌋$ . We distinguish three cases depending on the value of b. Note that this decomposition together with the above observations immediately implies that there are $a - 1$ or a trees of size $2^{k_{i}}$ in the current set. In the first case, the only remaining tree has size $2^{k_{i}} + b$ , and in the second case, the only remaining tree has size b.

If $b = 0$ , we have $n = a \cdot 2^{k_{i}}$ . Thus, we have $(n mod 2^{k_{i}}) = 0$ , and obviously all trees in the current set have size $2^{k_{i}}$ , because $n - (a - 1) \cdot 2^{k_{i}} = 2^{k_{i}}$ , so if you consider $a - 1$ trees of size $2^{k_{i}}$ and one remaining tree, the remaining tree has the same size. This directly proves the statement $a_{n} (i) = ⌊\frac{n}{i}⌋$ of the theorem for $i = 2^{k_{i}}$ and $(n mod i) = 0$ (as all trees of the current set are subtrees of $T_{n}^{gfb}$ and no other trees of size $i = 2^{k_{i}}$ can be formed throughout the algorithm). Moreover, as $b = 0$ , we also have for $i \in {2^{k_{i} - 1} + 1, \dots, 2^{k_{i}} - 1}$ that $((n - i) mod 2^{k_{i} - 1}) > 0$ , and as there is no tree of such a size i (as all trees in the current set have size $2^{k_{i}}$ , which by Corollary 2 are formed of two trees of size $2^{k_{i} - 1}$ each), we have $a_{n} (i) = 0$ for such values of i. This is in accordance with the last case of the theorem.
If $0 < b < 2^{k_{i} - 1}$ , we observe that there cannot be a trees of size $2^{k_{i}}$ in the current set. Otherwise, there would be b leaves not contained in any $2^{k_{i}}$ tree, thus belonging to a tree T of size b in the current set – but this would imply that the size of T is strictly smaller than $2^{k_{i} - 1}$ , a contradiction to the current set being the set at the end of phase $k_{i} - 1$ , as T would have had to cluster in this phase. So the only possibility is that there are $a - 1$ trees of size $2^{k_{i}}$ in the current set and one tree T of size $2^{k_{i}} + b$ . But by Corollary 2, T cannot have a subtree of size $2^{k_{i}}$ (as a tree of size $2^{k_{i}}$ cannot be clustered with a tree of size $b < 2^{k_{i} - 1}$ to form a GFB tree, and T has to be a GFB tree, too, as it is a subtree of $T_{n}^{gfb}$ ); it instead has one maximum pending subtree of size $2^{k_{i} - 1}$ and one of size $2^{k_{i} - 1} + b$ . Thus, $T_{n}^{gfb}$ contains $a - 1 = ⌊\frac{n}{i}⌋ - 1$ trees of size $i = 2^{k_{i}}$ , so $a_{n} (2^{k_{i}}) = ⌊\frac{n}{i}⌋ - 1$ , which proves the second case of the theorem as we have $0 < b = (n mod i) < 2^{k_{i} - 1}$ . Further, if $i = 2^{k_{i} - 1} + b$ , we have $n - i = a \cdot 2^{k_{i}} + b - (2^{k_{i} - 1} + b) = (2 a - 1) \cdot 2^{k_{i} - 1}$ , and thus $((n - i) mod 2^{k_{i} - 1}) = 0$ . As we have already seen that a subtree of size $2^{k_{i} - 1} + b$ was formed in the course of the algorithm (and no more such trees can be formed as the algorithm proceeds), we have $a_{n} (2^{k_{i} - 1} + b) = 1$ , which is in accordance with the third case of the theorem. Last, if $i \in {2^{k_{i} - 1} + 1, \dots, 2^{k_{i}} - 1}$ with $i \neq 2^{k_{i} - 1} + b$ , we have seen that no such tree can be formed in the course of the algorithm, so $a_{n} (i) = 0$ . As in this case we have $((n - i) mod 2^{k_{i} - 1}) > 0$ , this is in accordance with the fourth case of the theorem.
Finally, we consider the case $b > 2^{k_{i} - 1}$ . In this case, before considering the beginning of phase $k_{i}$ of the algorithm, we consider the beginning of phase $k_{i} - 1$ , which is the end of phase $k_{i} - 2$ . We can write $n = a \cdot 2^{k_{i}} + b = (2 a + 1) \cdot 2^{k_{i} - 1} + b^{'}$ , where $b^{'} = b - 2^{k_{i} - 1} < 2^{k_{i} - 1}$ (as $b < 2^{k_{i}}$ ). Using the same reasoning as above, in the beginning of phase $k_{i} - 1$ , there must either be 2a trees of size $2^{k_{i} - 1}$ and one tree of size $n - 2 a \cdot 2^{k_{i} - 1} = 2^{k_{i} - 1} + b^{'}$ , or there are $2 a + 1$ trees of size $2^{k_{i} - 1}$ and one tree of size $b^{'} < 2^{k_{i} - 1}$ . In the latter case, if in the beginning of phase $k_{i} - 1$ we have $2 a + 1$ trees of size $2^{k_{i} - 1}$ and one tree of size $b^{'} < 2^{k_{i} - 1}$ , note that we must have $b^{'} > 0$ (empty trees are never formed in the algorithm). Then, the tree of size $b^{'}$ will be the first one to be clustered in phase $k_{i} - 1$ to one of the $2 a + 1$ trees of size $2^{k_{i} - 1}$ . This forms a tree of size $2^{k_{i} - 1} + b^{'}$ , and the remaining 2a trees of size $2^{k_{i} - 1}$ will cluster to form a trees of size $2^{k_{i}}$ in the course of phase $k_{i} - 1$ . Thus, in the end of this phase, we have a trees of size $2^{k_{i}}$ and one tree of size $2^{k_{i} - 1} + b^{'} = 2^{k_{i} - 1} + b - 2^{k_{i} - 1} = b$ . This tree of size b would be the first tree to be clustered in phase $k_{i}$ with one of trees of size $2^{k_{i}}$ to form a tree of size $2^{k_{i}} + b$ . Now, we consider the first case, the case where in the beginning of phase $k_{i} - 1$ we have 2a trees of size $2^{k_{i} - 1}$ and one tree of size $2^{k_{i} - 1} + b^{'}$ . If $b^{'} > 0$ , phase $k_{i} - 1$ would not involve the last tree, it would only cluster the 2a trees of size $2^{k_{i} - 1}$ to form a trees of size $2^{k_{i}}$ . The remaining tree of size $2^{k_{i} - 1} + b^{'}$ would be the first one to be clustered in phase $k_{i}$ with one of trees of size $2^{k_{i}}$ , though, to form a tree of size $3 \cdot 2^{k_{i} - 1} + b^{'} = 2^{k_{i}} + b^{'} + 2^{k_{i} - 1} = 2^{k_{i}} + b$ . Similarly, if $b^{'} = 0$ , phase $k_{i} - 1$ would first cluster the 2a trees of size $2^{k_{i} - 1}$ to form a trees of size $2^{k_{i}}$ and then, in its last step, it would cluster the remaining tree of size $2^{k_{i} - 1} + b^{'} = 2^{k_{i} - 1}$ with one of the already formed trees of size $2^{k_{i}}$ to form a tree of size $3 \cdot 2^{k_{i} - 1} + b^{'} = 2^{k_{i}} + b$ . So in all cases, we can see (either directly after phase $k_{i} - 1$ or after the first step of phase $k_{i}$ ) that $T_{n}^{gfb}$ contains precisely a trees of size $2^{k_{i}}$ , one of which gets clustered with a tree of size b to form a tree of size $2^{k_{i}} + b$ . Now, if $i = 2^{k_{i}}$ , the $a = ⌊\frac{n}{i}⌋$ many trees of size $2^{k_{i}}$ are in accordance with the first case of the theorem, because here we have $(n mod 2^{k_{i}}) = b \geq 2^{k_{i} - 1}$ . Therefore, in this case we have $a_{n} (i) = ⌊\frac{n}{i}⌋$ . However, if $i \in {2^{k_{i} - 1} + 1, \dots, 2^{k_{i}} - 1}$ , we have $a_{n} (i) = 1$ if $i = b$ (because a tree of size b gets clustered with a tree of size $2^{k_{i}}$ ). This is in accordance with the third case of the theorem, as here we have $n - i = n - b = a \cdot 2^{k_{i}} + b - b = a \cdot 2^{k_{i}}$ , and thus $((n - i) mod 2^{k_{i} - 1}) = 0$ . If, on the other hand, we have $i \in {2^{k_{i} - 1} + 1, \dots, 2^{k_{i}} - 1}$ and $i \neq b$ , then there cannot be a subtree of size i: It would have to be clustered away last in phase $k_{i}$ of the algorithm. So it would have to be there in the end of phase $k_{i} - 1$ (which is not the case as we have analyzed above), or it would have to arise within phase $k_{i}$ . The latter cannot happen as we have seen that at the latest after one step of phase $k_{i}$ , we are left with only trees of size at least $2^{k_{i}}$ . Thus, we conclude that for such values of i, we have $a_{n} (i) = 0$ . This is in accordance with the fourth case of the theorem, because in this case we have $n - i = a \cdot 2^{k_{i}} + b - i$ . Since both b and i are contained in the set ${2^{k_{i} - 1}, \dots, 2^{k_{i}} - 1}$ , their maximal difference is bounded by $2^{k_{i}} - 2^{k_{i} - 1}$ . We can distinguish two cases: If $b > i$ , we have $n - i = a \cdot 2^{k_{i}} + b - i > a \cdot 2^{k_{i}} = 2 a \cdot 2^{k_{i} - 1}$ as well as $n - i = a \cdot 2^{k_{i}} + b - i < a \cdot 2^{k_{i}} + 2^{k_{i}} - 2^{k_{i} - 1} = (2 a + 1) \cdot 2^{k_{i} - 1}$ , so we have $(2 a + 1) \cdot 2^{k_{i} - 1} > n - i > 2 a \cdot 2^{k_{i} - 1}$ , which shows that $((n - i) mod 2^{k_{i} - 1}) > 0$ . Similarly, if $b < i$ , we have $n - i = a \cdot 2^{k_{i}} + b - i < a \cdot 2^{k_{i}} = 2 a \cdot 2^{k_{i} - 1}$ as well as $n - i = a \cdot 2^{k_{i}} + b - i > a \cdot 2^{k_{i}} - 2^{k_{i}} + 2^{k_{i} - 1} = (2 a - 1) \cdot 2^{k_{i} - 1}$ , so we have $2 a \cdot 2^{k_{i} - 1} > n - i > (2 a - 1) \cdot 2^{k_{i} - 1}$ , which shows that $((n - i) mod 2^{k_{i} - 1}) > 0$ .

This completes the proof of the theorem. $□$

Funding

Open Access funding enabled and organized by Projekt DEAL. The present manuscript is based upon work supported by the National Science Foundation under Grant No. DMS-1929284 while the authors were in residence at the Institute for Computational and Experimental Research in Mathematics (ICERM) in Providence, RI, during the Theory, Methods, and Applications of Quantitative Phylogenomics semester program.

Data Availability

The authors declare that all data generated for this study are contained in the manuscript. No other sources of data have been used.

Declarations

Competing interests

All others received the funding declared above. The authors confirm that they have no other financial or non-financial interests to declare.

Footnotes

We note that another explicit, but more complicated, formula was independently derived in Riesterer (2022).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Sean Cleary, Mareike Fischer and Katherine St. John all authors contributed equally to this work.

References

Blum MG, François O (2006) Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst Biol 55(4):685–691 [DOI] [PubMed] [Google Scholar]
Bodini O, Genitrini A, Gittenberger B, Larcher I, Naima M (2022) Compaction for two models of logarithmic-depth trees: Analysis and experiments. Random Struct Algorithms 61(1):31–61. 10.1002/rsa.21056https://onlinelibrary.wiley.com/doi/pdf/10.1002/rsa.21056
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, 2nd edn. MIT Press, Cambridge, MA [Google Scholar]
Coronado TM, Fischer M, Herbst L, Rosselló F, Wicke K (2020) On the minimum value of the Colless index and the bifurcating trees that achieve it. J Math Biol 80(7):1993–2054. 10.1007/s00285-020-01488-9 [DOI] [PubMed] [Google Scholar]
Fill JA (1996) On the distribution of binary search trees under the random permutation model. Random Struct Algorithms 8(1):1–25. 10.1002/(sici)1098-2418(199601)8:1<1::aid-rsa1>3.0.co;2-1
Fischer M (2021) Extremal values of the Sackin tree balance index. Ann Comb 25(2):515–541. 10.1007/s00026-021-00539-2 [Google Scholar]
Fischer M, Liebscher V (2021) On the balance of unrooted trees. J Graph Algorithms Appl 25(1):133–150. 10.7155/jgaa.00553 [Google Scholar]
Fischer M, Herbst L, Kersting SJ, Kühn L, Wicke K (2023) Tree Balance Indices - A Comprehensive Survey. Springer, Berlin [Google Scholar]
Hamann TN (2023) Metaconcepts for rooted tree balance. Master’s thesis. University of Greifswald, Germany [Google Scholar]
Kersting SJ, Wicke K, Fischer M (2025) Tree balance in phylogenetic models. Philos Trans R Soc Lond B Biol Sci 380(1919):20230303 [DOI] [PMC free article] [PubMed] [Google Scholar]
King MC, Rosenberg NA (2021) A simple derivation of the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. Math Biosci 342:108688 [DOI] [PMC free article] [PubMed] [Google Scholar]
Mir A, Rosselló F, Rotger L (2013) A new balance index for phylogenetic trees. Math Biosci 241(1):125–136. 10.1016/j.mbs.2012.10.005 [DOI] [PubMed] [Google Scholar]
Riesterer T (2022) Der Greedy-from-the-bottom-Baum und seine Bedeutung für die Phylogenetik. Master’s thesis. University of Greifswald, Germany [Google Scholar]
Rosen DE (1978) Vicariant patterns and historical explanation in biogeography. Syst Zool 27(2):159. 10.2307/2412970 [Google Scholar]
Semple C, Steel M (2003) Phylogenetics. Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, Oxford [Google Scholar]
Shao K-T, Sokal RR (1990) Tree balance. Syst Zool 39(3):266. 10.2307/2992186 [Google Scholar]
Sloane NJA (1964) The On-Line Encyclopedia of Integer Sequences. http://oeis.org
Steel M (2016) Phylogeny: Discrete and Random Processes in Evolution. Society for Industrial and Applied Mathematics, Philadelphia PA [Google Scholar]
Than CV, Rosenberg NA (2014) Mean deep coalescence cost under exchangeable probability distributions. Discret Appl Math 174:11–26 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The authors declare that all data generated for this study are contained in the manuscript. No other sources of data have been used.

[CR1] Blum MG, François O (2006) Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst Biol 55(4):685–691 [DOI] [PubMed] [Google Scholar]

[CR2] Bodini O, Genitrini A, Gittenberger B, Larcher I, Naima M (2022) Compaction for two models of logarithmic-depth trees: Analysis and experiments. Random Struct Algorithms 61(1):31–61. 10.1002/rsa.21056https://onlinelibrary.wiley.com/doi/pdf/10.1002/rsa.21056

[CR3] Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, 2nd edn. MIT Press, Cambridge, MA [Google Scholar]

[CR4] Coronado TM, Fischer M, Herbst L, Rosselló F, Wicke K (2020) On the minimum value of the Colless index and the bifurcating trees that achieve it. J Math Biol 80(7):1993–2054. 10.1007/s00285-020-01488-9 [DOI] [PubMed] [Google Scholar]

[CR5] Fill JA (1996) On the distribution of binary search trees under the random permutation model. Random Struct Algorithms 8(1):1–25. 10.1002/(sici)1098-2418(199601)8:1<1::aid-rsa1>3.0.co;2-1

[CR6] Fischer M (2021) Extremal values of the Sackin tree balance index. Ann Comb 25(2):515–541. 10.1007/s00026-021-00539-2 [Google Scholar]

[CR7] Fischer M, Liebscher V (2021) On the balance of unrooted trees. J Graph Algorithms Appl 25(1):133–150. 10.7155/jgaa.00553 [Google Scholar]

[CR8] Fischer M, Herbst L, Kersting SJ, Kühn L, Wicke K (2023) Tree Balance Indices - A Comprehensive Survey. Springer, Berlin [Google Scholar]

[CR9] Hamann TN (2023) Metaconcepts for rooted tree balance. Master’s thesis. University of Greifswald, Germany [Google Scholar]

[CR10] Kersting SJ, Wicke K, Fischer M (2025) Tree balance in phylogenetic models. Philos Trans R Soc Lond B Biol Sci 380(1919):20230303 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] King MC, Rosenberg NA (2021) A simple derivation of the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. Math Biosci 342:108688 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] Mir A, Rosselló F, Rotger L (2013) A new balance index for phylogenetic trees. Math Biosci 241(1):125–136. 10.1016/j.mbs.2012.10.005 [DOI] [PubMed] [Google Scholar]

[CR13] Riesterer T (2022) Der Greedy-from-the-bottom-Baum und seine Bedeutung für die Phylogenetik. Master’s thesis. University of Greifswald, Germany [Google Scholar]

[CR14] Rosen DE (1978) Vicariant patterns and historical explanation in biogeography. Syst Zool 27(2):159. 10.2307/2412970 [Google Scholar]

[CR15] Semple C, Steel M (2003) Phylogenetics. Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, Oxford [Google Scholar]

[CR16] Shao K-T, Sokal RR (1990) Tree balance. Syst Zool 39(3):266. 10.2307/2992186 [Google Scholar]

[CR17] Sloane NJA (1964) The On-Line Encyclopedia of Integer Sequences. http://oeis.org

[CR18] Steel M (2016) Phylogeny: Discrete and Random Processes in Evolution. Society for Industrial and Applied Mathematics, Philadelphia PA [Google Scholar]

[CR19] Than CV, Rosenberg NA (2014) Mean deep coalescence cost under exchangeable probability distributions. Discret Appl Math 174:11–26 [Google Scholar]

PERMALINK

The GFB Tree and Tree Imbalance Indices

Sean Cleary

Mareike Fischer

Katherine St John

Abstract

Introduction

Fig. 1.

Definitions

Graph theoretical trees and phylogenetic trees

Vertices and subtrees

Special trees

Fig. 2.

Tree shape statistics and (im)balance indices

Definition 1

Fig. 6.

Probabilistic models of phylogenetic trees

Prior results

Lemma 1

Corollary 1

Proposition 1

Corollary 2

Proof

Results

Minimizing properties of the GFB tree

Theorem 2

Proof

Fig. 3.

Corollary 3

Proof

Implications of the extremal GFB properties on measures of tree balance

Theorem 3

Proof

Fig. 4.

Corollary 4

Proof

Corollary 5

Proof

Corollary 6

Proof

Corollary 7

Proof

Proposition 4

Proof

Proposition 5

Proof

Fig. 5.

Proposition 6

Proof

Explicit formulas for π-1Tngfb and π0Tngfb

Lemma 2

Proof

Theorem 7

Corollary 8

Proof

Corollary 9

Proof

Remark 1

Corollary 10

Proof

Theorem 8

Lemma 3

Proof

Proof of Theorem 8

Expected values of the s^-shape statistic

Proposition 9

Proposition 10

Theorem 11

Theorem 12

Proof

Discussion and Conclusion

Acknowledgements

Appendix A Proof of Theorem 7

Proof of Theorem 7

Funding

Data Availability

Declarations

Competing interests

Footnotes

References

Explicit formulas for $π_{- 1} (T_{n}^{gfb})$ and $π_{0} (T_{n}^{gfb})$

Expected values of the $\hat{s}$ -shape statistic