Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2022 Apr 12;71(5):1210–1224. doi: 10.1093/sysbio/syac027

Robust, Universal Tree Balance Indices

Jeanne Lemant 1,2,3, Cécile Le Sueur 4, Veselin Manojlović 5, Robert Noble 6,7,
Editor: James Rosindell
PMCID: PMC9773123  PMID: 35412638

Abstract

Balance indices that quantify the symmetry of branching events and the compactness of trees are widely used to compare evolutionary processes or tree-generating algorithms. Yet, existing indices are not defined for all rooted trees, are unreliable for comparing trees with different numbers of leaves, and are sensitive to the presence or absence of rare types. The contributions of this article are twofold. First, we define a new class of robust, universal tree balance indices. These indices take a form similar to Colless’ index but can account for population sizes, are defined for trees with any degree distribution, and enable meaningful comparison of trees with different numbers of leaves. Second, we show that for bifurcating and all other full m-ary cladograms (in which every internal node has the same out-degree), one such Colless-like index is equivalent to the normalized reciprocal of Sackin’s index. Hence, we both unify and generalize the two most popular existing tree balance indices. Our indices are intrinsically normalized and can be computed in linear time. We conclude that these more widely applicable indices have the potential to supersede those in current use. [Cancer; clone tree; Colless index; Sackin index; species tree; tree balance.]


Tree balance indices—most notably those credited to Sackin (1972) and Colless (1982)—are widely used to describe speciation processes, compare cladograms, and assert the correctness of tree reconstruction methods (Shao and Sokal 1990; Mooers and Heard 1997; Fischer et al. 2021). Existing tree balance indices have several important flaws. First, they cannot be applied to any tree in which any node has only one descendant. Second, existing indices are unreliable for comparing trees with different numbers of leaves. Third, because they do not account for population sizes, these indices are sensitive to the omission or inclusion of rare types. The latter issue is, for example, a problem in oncology (Chkhaidze et al. 2019; Scott et al. 2020), where methods for determining and classifying evolutionary modes have clinical value (Davis et al. 2017; Maley et al. 2017).

Here, we develop a new class of robust, universal tree balance indices. Our definitions not only extend the tree balance concept and open up new applications but also unify the two main approaches to quantifying balance as proposed by Sackin and Colless. We describe several general advantages of our indices compared to those in current use.

Materials and Methods

Rooted Trees

We consider exclusively rooted trees in which all edges are oriented away from the root (which will be topmost in our figures). This orientation defines a natural order on the tree, from top to bottom: edges descend from the root to the other internal nodes and finally to the terminal nodes or leaves. The out-degree of a node Inline graphic, written Inline graphic, is the number of direct descendants, ignoring any subtrees in which all nodes have zero size. Internal nodes have out-degree at least one, whereas leaves have out-degree zero. If all internal nodes have out-degree 1, then the tree is called linear. If all internal nodes have out-degree Inline graphic then the tree is a full Inline graphic-ary tree, and if Inline graphic then it is also called bifurcating (such as Fig. 1a,b).

Figure 1.


Figure 1.

Contrasting trees. a) Caterpillar tree with Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic. b) Fully symmetric bifurcating tree with Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic. c) Star tree with Inline graphic, Inline graphic, Inline graphic and Inline graphic undefined, Inline graphic. d) Clone tree of the lung tumor CRUK0065 in the TRACERx cohort (Jamal-Hanjani et al. 2017). In the clone tree, nodes represented by empty circles correspond to extinct clones, and the diameters of other nodes are proportional to the corresponding clone population sizes.

Some other tree topologies have particular names. A caterpillar tree (Fig. 1a) is a bifurcating tree in which every internal node except one has exactly one leaf. A fully symmetric tree (Fig. 1b) is such that every internal node with the same depth has the same degree or, equivalently, for each internal node Inline graphic all the subtrees rooted at Inline graphic are identical. A star tree (Fig. 1c) is a tree whose leaves are all attached to the root, which is the only internal node.

Node Sizes, Tree Magnitudes, and Leafy Trees

Although our definitions can be applied in other contexts, we will assume that nodes correspond to biological taxa or clones, and on this basis, we assign non-negative node sizes. If we know (or care) only whether each type is extant or extinct—as is typical in taxonomy—then we assign size zero to every node representing an extinct type, and size one otherwise. If nodes represent clones with known population sizes—as is often the case in studies of cancer and microbial evolution—then each node size is equal to the population size of the corresponding clone. The magnitude of a tree or subtree is then defined as the sum of its node sizes (we use magnitude here because a tree’s size is conventionally defined as its number of nodes). We define a leafy tree as a rooted tree in which all internal nodes have size zero.

Cladograms, Taxon Trees, and Clone Trees

Tree types can also be defined in terms of what they represent. Following Podani (2013), we distinguish between two representations used in systematic biology.

We define a cladogram as a rooted tree in which internal nodes represent hypothetical extinct ancestors, leaves represent extant biological taxa, and edges represent evolutionary relationships. This is equivalent to the synchronous cladogram definition of Podani (2013). Every cladogram is by definition a leafy tree, with a magnitude equal to its number of leaves. A common conception is that only bifurcating cladograms can be considered fully resolved. However, the linear two-node cladogram is appropriate for representing serial anagenesis (in which each descendant replaces its ancestor), while budding (in which an ancestor produces a descendant and remains extant) can give rise to cladogram nodes with an out-degree greater than two (Podani 2013). Hence, there is no restriction on cladogram node degrees. An extant ancestor is represented in a cladogram by a leaf stemming from the internal ancestor node, in which case, as Podani notes, “an ancestor is identical to an extant taxon connected directly to it.”

Alternatively, extant or known ancestors may be represented uniquely by internal nodes (like in a genealogy with overlapping generations). Such diagrams are known to organismal biologists as species trees or taxon trees, and to oncologists as clone trees. We define a taxon tree as a rooted tree in which all nodes represent biological taxa, and edges represent ancestor-descendant relationships. Similarly, a clone tree is defined as a rooted tree in which each node represents a clone (a set of cells that share alterations of interest due to common descent), and edges represent the chronology of alterations. Both taxon tree and clone tree fit the achronous tree definition of Podani (2013). Clone tree nodes can have any out-degree, including Inline graphic, and each node—including internal nodes—can be associated with a non-negative size, as illustrated in Figure 1d.

When nodes are associated with sizes, the addition of subtrees comprising even vanishingly small nodes can change leaves into internal nodes and so substantially change the value of existing tree balance indices. This behavior is unsatisfactory because relatively small nodes typically represent either newly created types that have yet to experience evolutionary forces or types on the verge of extinction, and in either case convey negligible information about the mode of evolution. Data sets may also omit rare types due to sampling error or because genetic sequencing methods have imperfect sensitivity (Turajlic et al. 2018).

The change due to the addition of terminal nodes is greater when the tree is a cladogram rather than a taxon or clone tree. For example, when a three-node, two-leaf tree (Fig. 2a) is augmented by adding a node Inline graphic to a leaf Inline graphic (Fig. 2b), the three original nodes retain their positions in the clone tree (middle column of Fig. 2), but in the cladogram (right column) node Inline graphic becomes two nodes (Inline graphic and Inline graphic), the larger of which is now further from the root (see Podani (2013) for further illustrations of this difference). As the size of the new node Inline graphic is continuously reduced to zero, the clone tree changes continuously, whereas the cladogram undergoes an abrupt change of topology when the size of node Inline graphic reaches zero. We conclude that the taxon tree or clone tree representation is more robust than the cladogram representation in the general case in which nodes are associated with sizes and ancestors can be extant. Also, an index that accounts for nonzero internal node sizes can be made more robust than one that does not. Accordingly, we will define indices for the more general domain of clone trees and then obtain results for cladograms as a special case.

Figure 2.


Figure 2.

Muller plots (left column), taxon or clone trees (middle column), and cladograms (right column) representing evolution by splitting only (a) and both splitting and budding (b). In a Muller plot, polygons represent proportional subpopulation sizes (vertical axis) over time (horizontal axis), and each descendant is shown emerging from its parent polygon. In the trees, nodes represented by empty circles correspond to extinct types.

Existing Tree Balance Indices

The most widely used tree balance indices are in fact imbalance indices, such that more balanced trees are assigned smaller values. These indices were introduced to study cladograms; they take no account of node size, and, even after applying standard normalizations, they are appropriate only for comparing trees with equal numbers of leaves. The most popular are Sackin’s index and Colless’ index.

Sackin’s index

Let Inline graphic be a tree with a set of leaves Inline graphic. For a leaf Inline graphic, let Inline graphic be the number of internal nodes between Inline graphic and the root, which is included in the count. Then, the index credited to Sackin (1972) is

graphic file with name Equation1.gif

For two bifurcating trees with the same number of leaves, a less balanced tree has higher values of Inline graphic as the tree is in a sense less compact (compare trees a and b in Fig. 1).

Since the value tends to increase with the number of nodes, Shao and Sokal (1990) proposed normalizing Inline graphic with respect to trees on Inline graphic leaves by subtracting its minimum possible value for such trees and then dividing by the difference between the maximum and minimum possible values. The minimal Inline graphic is reached on the star tree, such as tree c in Figure 1, and hence Inline graphic. The maximum is attained on the caterpillar tree, such as tree a:

graphic file with name Equation2.gif

The normalized index is then

graphic file with name Equation3.gif

This normalized index is not very satisfactory as a balance index because it fails to capture an intuitive notion of balance. For example, it is not obvious why a fully symmetric tree (b) should be considered less balanced than the star tree (c) in Figure 1, yet its Inline graphic value is much larger. To address this issue, Shao and Sokal (1990) further suggested normalizing Inline graphic relative to its extremal values among trees with the same number of internal nodes as well as the same number of leaves. But even then the index remains unreliable for comparing trees with different numbers of leaves. For example, the index is 1 for every caterpillar tree, yet long caterpillar trees are intuitively less balanced than short ones. The conventional Inline graphic normalizations are not defined for trees containing linear parts. Moreover, since Inline graphic does not account for node size, it is sensitive to the addition or removal of subtrees comprising relatively small nodes.

Colless’ index

For an internal node Inline graphic of a bifurcating tree Inline graphic, define Inline graphic as the number of leaves of the left branch of the subtree rooted at Inline graphic, and Inline graphic as the number of leaves of the right branch. Then, the index defined by Colless (1982) is

graphic file with name Equation4.gif

where Inline graphic is the set of all internal nodes of Inline graphic. The index can be normalized for the set of trees on Inline graphic leaves by dividing by its maximal value, Inline graphic, which is reached on the caterpillar tree (as in Fig. 1a).

Because Colless’ index cannot be applied to multifurcating trees, Mir et al. (2018) recently introduced a family of Colless-like balance indices, including Inline graphic as a special case. Each of these indices Inline graphic is determined by a weight function Inline graphic, which assigns a size to each subtree as a function of its out-degree, and a dissimilarity function Inline graphic. By definition of Inline graphic, Colless-like indices are zero if and only if each internal node divides its descendants into subtrees of equal size. But since these indices are normalized by dividing by the maximal value for trees on the same number of leaves, they are unreliable for comparing trees with different numbers of leaves. In common with Sackin’s index, the total cophenetic index Inline graphic (Mir et al. 2013) (see Appendix), and other existing indices (surveyed by Fischer et al. (2021)), the Colless-like indices so far defined do not account for node sizes and can be applied only to trees in which all nodes have out-degree greater than one.

Desirable Properties of a Universal, Robust Tree Balance Index

Our aim is to derive a tree balance index Inline graphic that is useful for classifying and comparing rooted trees that can have any distributions of node degrees and node sizes. Here, we specify four desirable properties that such an index should have. The first two axioms relate to extrema. We will call an index universal if it is defined for trees with any degree distribution and obeys these first two axioms. An index that conforms to the other three axioms—which are relevant only when nodes can have arbitrary sizes—will be called robust.

We will begin by introducing some additional notation (see also Table 1). For a tree Inline graphic, we will use Inline graphic to denote the set of all nodes of Inline graphic, which we will abbreviate to Inline graphic when the identity of the tree is unambiguous. Let Inline graphic denote the size of node Inline graphic. Then, Inline graphic denotes the subtree rooted at node Inline graphic (i.e., the subtree that contains node Inline graphic and all its descendants); Inline graphic is the magnitude of Inline graphic; and Inline graphic is the magnitude of Inline graphic excluding its root:

Table 1.

Notation used throughout this article

Properties of a node Inline graphic  
Inline graphic Out-degree
Inline graphic Set of children
Inline graphic Depth
Inline graphic Size
Inline graphic Subtree rooted at Inline graphic
Inline graphic Number of leaves of Inline graphic
Inline graphic Magnitude of Inline graphic (sum of node sizes)
Inline graphic Magnitude of Inline graphic excluding its root
Inline graphic Importance factor
Inline graphic Inline graphic , where Inline graphic
Inline graphic Balance score
Inline graphic Balance score based on Inline graphic
Inline graphic Nonroot dominance factor
Sets of nodes  
Inline graphic All nodes
Inline graphic Internal nodes Inline graphic such that Inline graphic
Inline graphic Leaves
Entropies and tree balance indices
Inline graphic Generalized entropy with parameter Inline graphic
Inline graphic Shannon entropy with base Inline graphic
Inline graphic Sackin’s index
Inline graphic Colless’ index
Inline graphic Total cophenetic index
Inline graphic Colless-like index
Inline graphic Generalized Sackin’s index
Inline graphic Generalized Colless’ index
Inline graphic Tree balance index based on Inline graphic
Inline graphic Normalized inverse Sackin index
Inline graphic A conservative tree balance index
graphic file with name Equation5.gif

We will use Inline graphic or simply Inline graphic to denote the set of all internal nodes such that Inline graphic.

Conventionally, a tree is considered maximally balanced only if every internal node splits its descendants into subtrees on the same number of leaves (Shao and Sokal 1990). We generalize this concept by requiring that every internal node splits its descendants into at least two subtrees of equal magnitude, as in Figure 3a. We call this the equal splits property, and we make it a necessary and sufficient condition for maximal balance.

Figure 3.


Figure 3.

a) A tree in which each internal node has null size and splits its descendants into subtrees of equal magnitude, and hence Inline graphic. This tree can be considered balanced only according to an index that accounts for node size. b) A linear tree, for which Inline graphic. c–e) A robust, universal tree balance index Inline graphic is insensitive to the addition of a subtree of arbitrarily small magnitude if it is added to a leaf (a) or a nonroot node with out-degree 1 (b), but not necessarily if the subtree is added to a nonroot node with greater out-degree (c).

Axiom 1 (Maximum value).

Inline graphic for all trees Inline graphic, and Inline graphic if and only if Inline graphic has equal splits.

Another convention is that trees with relatively many internal nodes are considered highly imbalanced. According to this convention, linear trees (i.e., trees in which every node Inline graphic has Inline graphic, as in Fig. 3b) should be considered even less balanced than caterpillar trees. Also, given that balance implies branching, the most imbalanced split is one that assigns all descendants to one branch and none to any other branches. Hence our second desirable property:

Axiom 2 (Minimum value).

Inline graphic for all trees Inline graphic, and Inline graphic if and only if Inline graphic is a linear tree.

Our third desirable property ensures that our index is insensitive to the properties of nodes that have relatively few descendants.

Axiom 3 (Insensitivity).

Let Inline graphic be a tree and Inline graphic be one of its leaves. If we create a new tree Inline graphic from Inline graphic by adding a subtree with finitely many nodes rooted at Inline graphic then Inline graphic as Inline graphic.

Our fourth axiom ensures that a linear section of a tree is regarded as a maximally unequal split.

Axiom 4 (Linear limit).

Let Inline graphic be a tree and Inline graphic with Inline graphic. Let Inline graphic be the unique child of Inline graphic. If we create a new tree Inline graphic from Inline graphic by adding additional subtrees with finitely many nodes rooted at Inline graphic then Inline graphic as Inline graphic.

Lastly, we require continuity with respect to varying node size:

Axiom 5 (Continuity).

Suppose we create a new tree Inline graphic by selecting a node of tree Inline graphic and changing the node’s size from Inline graphic to Inline graphic. Then Inline graphic as Inline graphic.

Alternative axioms are considered in the Appendix.

Sensitivity to Changes in Out-degree of Nonroot Nodes

By design, our definition of a robust tree balance index does not require insensitivity to the addition or removal of rare types in all cases. To see why, suppose we transform a tree Inline graphic into Inline graphic by adding one or more subtrees of arbitrarily small magnitude, attached to a nonroot node Inline graphic. As illustrated in Figure 3c–e, there are three topologically distinct cases to consider. If Inline graphic is a leaf of Inline graphic (Fig. 3c) or Inline graphic in Inline graphic (Fig. 3d) then Inline graphic due to Axioms 3 or 4. In the first case, Inline graphic is an unimportant node, which we define to mean that Inline graphic. In the second case, if Inline graphic is not an unimportant node in Inline graphic then Inline graphic must have a dominant branch, meaning that Inline graphic has a child Inline graphic such that Inline graphic. The third case, when Inline graphic in Inline graphic (Fig. 3e), is more complicated. If Inline graphic is an unimportant node in Inline graphic then Inline graphic as Inline graphic in Inline graphic, by Axiom 3. If Inline graphic in Inline graphic has a dominant branch Inline graphic in Inline graphic then Inline graphic as Inline graphic in Inline graphic, by Axiom 4. But if neither of those conditions hold then our axioms do not specify the size of the effect on Inline graphic.

Although we could modify Axiom 4 so that Inline graphic is always insensitive to the addition of relatively low-magnitude subtrees—thus increasing the index’s robustness—we argue that this would undermine its utility as a tree balance index. The balance of a node can be conventionally defined as the extent to which it splits its descendants into multiple subtrees of equal magnitude. By this definition, the attachment of a new, relatively low-magnitude subtree to a perfectly balanced node will create an imbalance even as—in fact especially as—the magnitude of this new subtree, relative to the magnitude of the node’s pre-existing descendants, approaches zero. Therefore, it is desirable for a tree balance index to be sensitive to certain changes in node degree, such that in the third scenario considered above, Inline graphic if and only if Inline graphic is an unimportant node or Inline graphic has a dominant branch (Fig. 3e).

Results

General Definition of Universal, Robust Tree Balance Indices

Our general definition depends on two continuous functions of subtree magnitudes:

  • An importance factor Inline graphic with Inline graphic as Inline graphic;

  • A balance scoreInline graphic that assigns Inline graphic to each internal node Inline graphic such that Inline graphic if and only if Inline graphic, and Inline graphic if and only if Inline graphic splits its descendants into at least two equal-magnitude subtrees.

To allow us to define Inline graphic more rigorously, let Inline graphic denote the set of vectors with positive components that sum to unity:

graphic file with name Equation6.gif

Then, Inline graphic is such that, for all Inline graphic:

  • (Associativity) For every permutation Inline graphic, Inline graphic;

  • (Maximum value) Inline graphic if and only if Inline graphic and Inline graphic;

  • (Minimum value) Inline graphic if and only if Inline graphic;

  • (Continuity) Inline graphic is a continuous function with respect to each of its arguments.

We then define a balance index in terms of subtree magnitudes as

graphic file with name Equation7.gif (1)

where Inline graphic and Inline graphic are the children of node Inline graphic (see Table 1 for a recap of notation). A short proof that this type of index satisfies our five axioms for robustness and universality (Axioms 1–5) is presented in the Appendix.

The balance score Inline graphic in Equation 1 measures the extent to which an internal node splits its descendants into equal-magnitude subtrees. The importance factor Inline graphic assigns more weight to nodes that are the roots of large subtrees. In biological terms, this means giving more weight to types that have more descendants. Sackin’s and Colless’ indices similarly assign more weight to nodes that have more descendant leaves or are closer to the root. Mooers and Heard (1997) have argued that it is reasonable to put more weight on nodes deeper within the tree because “those nodes are the most informative, as the subclades they define are older and therefore sample longer periods of evolutionary time.”

A Specific Index Based on the Shannon Entropy

In defining a specific index, we start by opting for the simplest importance factor function: Inline graphic The role of the balance score function Inline graphic is to quantify the extent to which a set of objects (specifically subtrees) have equal magnitude. A well-known index that satisfies the necessary conditions is the normalized Shannon entropy.

Assume a population is partitioned into Inline graphic types, with each type Inline graphic accounting for a proportion Inline graphic. Then, the Shannon entropy with base Inline graphic is defined as Inline graphic If all types have equal frequencies Inline graphic then Inline graphic. If the types have unequal sizes, then Inline graphic. And if the abundance is mostly concentrated on one type Inline graphic, such that Inline graphic, then Inline graphic.

Let Inline graphic denote the set of children (immediate descendants) of a node Inline graphic, and for Inline graphic let Inline graphic denote the relative magnitude of subtree Inline graphic compared to all subtrees attached to Inline graphic.

A balance score based on the normalized Shannon entropy is then

graphic file with name Equation8.gif (2)

For every internal node Inline graphic, the number of frequencies Inline graphic is equal to Inline graphic, and if all these frequencies are equal then Inline graphic, for any base Inline graphic. Changing the base of the logarithm from Inline graphic to Inline graphic is equivalent to dividing the sum by Inline graphic, which implies that Inline graphic when all the Inline graphic are equal. From aforementioned properties of the Shannon entropy, it then follows that Inline graphic, with Inline graphic if and only if Inline graphic, and Inline graphic if and only if Inline graphic splits its descendants into at least two equal-magnitude subtrees. Therefore, the following specific balance index satisfies our robustness and universality axioms:

graphic file with name Equation9.gif (3)

The calculation of Inline graphic is illustrated in Figure 4a.

Figure 4.


Figure 4.

a) An example calculation of Inline graphic. Numbers shown inside nodes are the node sizes. b) All multifurcating leafy trees on six leaves without linear parts and with equally sized leaves, sorted and labelled by Inline graphic value.

The definition simplifies when we restrict the domain to the set of multifurcating leafy trees in which all leaves have equal size Inline graphic. This includes cladograms in which internal nodes represent extinct ancestors and leaves correspond to equally important extant types. For all internal nodes Inline graphic in such trees, Inline graphic, where Inline graphic is the number of leaves of the subtree rooted at node Inline graphic. The general definition of Equation 1 can then be expressed in terms of node balance scores and leaf counts:

graphic file with name Equation10.gif (4)

and the specific definition of Equation 3 becomes

graphic file with name Equation11.gif (5)

For example, Figure 4b shows the Inline graphic values of all leafy trees on six equally sized leaves without linear parts. Unlike Sackin’s and Colless’ indices, Inline graphic does not consider the caterpillar tree the least balanced of these trees.

There are of course many alternative options for Inline graphic. For example, Colless’ index can be generalized to define a robust, though not universal, tree balance index on the domain of bifurcating trees (see Appendix). Since the Shannon entropy belongs to families of generalized entropies (Rényi 1961; Chao et al. 2014) parameterized by Inline graphic, the above reasoning can be generalized to define a balance score Inline graphic, and hence a robust, universal balance index Inline graphic, for every Inline graphic (see Appendix). Other candidates for Inline graphic include one minus the variance of the proportional subtree magnitudes or one minus the mean deviation from the median (Mir et al. 2018). We prefer Inline graphic mostly because, as we shall show, it is the only function for which Equation 4 is a generalization of the normalized inverse Sackin index.

Relationship with Colless’ Index

Like Colless’ index and Colless-like indices as previously defined, our new family of tree balance indices is based on the intuitive idea of assigning a value to each internal node, summing these values, and then normalizing the sum. A Colless-like index in the sense of Mir et al. (2018) depends on a function Inline graphic, which assigns node sizes, and a dissimilarity score Inline graphic, where Inline graphic is the set of non-null real vectors. Before normalization, such an index has the form

graphic file with name Equation12.gif

where Inline graphic are the children of node Inline graphic. The function Inline graphic assigns a size to each subtree by summing the node sizes: Inline graphic Neglecting the initial normalizing factor, our general definition (Equation 1) has a similar form and can be considered Colless-like in only a slightly broader sense. Our definition nevertheless differs in two important ways.

First, whereas the unbounded dissimilarity index Inline graphic measures both node imbalance and importance and is undefined for nodes with out-degree one, we split these two roles into a normalized balance score Inline graphic and an unbounded importance factor Inline graphic, and we assign a Inline graphic value (specifically zero) to nodes with out-degree one. This difference enables us to extend the balance index definition to trees with any degree distribution. It also makes it easy to normalize our indices for any tree, simply by dividing by the sum of the important factors. Furthermore, our normalization is universal, rather than being based on comparison with other trees with the same number of leaves. For example, our Inline graphic indices judge long caterpillar trees less balanced than short ones (Fig. 5a), whereas Sackin’s index, Colless’ index, and the total cophenetic index consider all caterpillar trees on more than two leaves equally imbalanced.

Figure 5.


Figure 5.

a) Inline graphic values for caterpillar trees and random trees generated from the Yule and uniform models (1000 trees per data point). All internal nodes have null size and all leaves have equal size. Solid black curves are the means; dashed curves are the 5th and 95th percentiles; and gray curves are Inline graphic divided by the corresponding expectation of Inline graphic (where Inline graphic is the number of leaves). b) Inline graphic distributions for random trees on 64 leaves generated from the Yule and uniform models (1000 trees per model). c) Inline graphic values for 100 random trees on 16 leaves, before and after applying a 1Inline graphic sensitivity threshold. These random trees were generated from the alpha-gamma model with Inline graphic and Inline graphic. d) Inline graphic values for the same set of random trees. e) Absolute change in normalized index values due to applying a 1Inline graphic sensitivity threshold. Results are based on 100 random trees for each number of leaves, generated as in (c) and (d). Inline graphic here is the Colless-like index with Inline graphic and Inline graphic is the mean deviation from the median, as recommended by Mir et al. (2018). f) Values of Inline graphic versus Inline graphic for random multifurcating trees on 16 leaves, with node sizes drawn from a continuous uniform distribution. The dashed reference line has slope 1.

Second, instead of assigning a size to each node as a function of its out-degree, we associate a node’s size with the size of the biological population it represents. This ensures that our indices can be made reliably robust by including population size data.

Relationship with Sackin’s Index

The sum Inline graphic is just another way of expressing Sackin’s index (summing over internal nodes instead of leaves). Therefore, Inline graphic in Equation 4 is essentially a weighted Sackin index (with each term in the sum weighted by the balance score Inline graphic) divided by the unweighted Sackin index. In the special, important case of full Inline graphic-ary leafy trees (including full Inline graphic-ary cladograms), the weighted sum in Inline graphic (Equation 5) simplifies yet further. Let Inline graphic denote the set of all trees on Inline graphic leaves such that all internal nodes have the same out-degree Inline graphic, every internal node has null size, and all leaf sizes are equal. Then, we obtain a remarkably simple relationship between Inline graphic and Sackin’s index:

Proposition 6.

Let Inline graphic be a tree on Inline graphic leaves with Inline graphic and Inline graphic for every internal node Inline graphic. Then  

Proposition 6.

 where Inline graphic is the Shannon entropy (base Inline graphic) of the proportional node sizes, Inline graphic is the magnitude of Inline graphic, and Inline graphic. If additionally all leaves of Inline graphic have the same size (so Inline graphic) then  

Proposition 6. (6)

 where Inline graphic is the minimum Inline graphic value of trees in Inline graphic.

The above result is somewhat surprising as it unifies our Colless-like index, which can be viewed as a weighted average of internal node balance scores, and Sackin’s index, which is the sum of all leaf depths. A short proof of Proposition 6 is presented in the Appendix. The converse result, which is also proved in the Appendix, justifies our choice of Inline graphic instead of alternative balance score functions:

Proposition 7.

Let Inline graphic be a tree balance index such that  

Proposition 7.

 where Inline graphic are the children of node Inline graphic, and Inline graphic is a balance score satisfying the conditions stated before Equation 1. Suppose that for all trees Inline graphic, Inline graphic Then, Inline graphic.

The right-hand side of Equation 6 incidentally provides an alternative way of normalizing Sackin’s index on full Inline graphic-ary leafy trees, including the bifurcating cladograms on which the index was originally defined. This normalized inverse Sackin index, which we can define as Inline graphic, provides a more satisfactory way of comparing trees that differ in their node degrees or leaf counts. Inline graphic if and only if the tree has minimal depth given Inline graphic, which is equivalent to being fully symmetric, and so Inline graphic is a sound tree balance index in the sense defined by Mir et al. (2018) (see Appendix for a proof). For Inline graphic, we have Inline graphic but Inline graphic as Inline graphic, which makes sense because trees with more leaves can be made less balanced. In particular, when Inline graphic is a caterpillar tree on Inline graphic leaves,

graphic file with name Equation16.gif

as illustrated in Figure 5a. The definition of Inline graphic can be naturally extended to the case Inline graphic by setting Inline graphic if Inline graphic is linear or has only one node. From this point of view, Inline graphic (a Colless-like index) is a generalization of Inline graphic (the normalized reciprocal of Sackin’s index) to the domain of trees with arbitrary degree distributions and arbitrary node sizes.

Distributions under the Yule and Uniform Models

An immediate corollary of Proposition 6 is that Inline graphic can be used to test whether a set of full Inline graphic-ary cladograms is consistent with a particular tree-generating model, with exactly the same sensitivity as Sackin’s index. For example, Figure 5a,b shows Inline graphic distributions for random bifurcating trees in Inline graphic generated from the Yule and uniform models. These two distributions have insignificant overlap when the trees have at least a few dozen leaves.

Kirkpatrick and Slatkin (1993) showed that the expectation of Inline graphic for the Yule model is

graphic file with name Equation17.gif

where Inline graphic is Euler’s constant and Inline graphic is the number of leaves. Mir et al. (2013) have shown that the expectation of Inline graphic for the uniform model is

graphic file with name Equation18.gif

which approaches Inline graphic as the number of leaves Inline graphic approaches infinity (Blum et al. 2006; King and Rosenberg 2021). Consistent with Proposition 6, we find that for random trees in Inline graphic generated by either the Yule or the uniform model, a good approximation to the Inline graphic mean is Inline graphic divided by the corresponding expectation of Inline graphic (gray curves in Fig. 5a). As Inline graphic, these approximations approach Inline graphic and zero for the Yule and uniform models, respectively.

Robustness when Applied to Random Trees

To test the robustness of Inline graphic, we generated random multifurcating trees with node sizes drawn from a continuous uniform distribution and then compared Inline graphic values for these trees before and after applying a 1Inline graphic sensitivity threshold. In the latter case, whenever the combined frequency of a clone and its descendants was below 1Inline graphic, we merged the corresponding subtree with the clone’s parent, to simulate imperfect detection of rare types. As expected, the Inline graphic values for the two sets of trees were highly similar, with a median absolute difference of only 0.01 for trees that initially had 16 leaves (Fig. 5c). In contrast, the median absolute difference in the normalized Sackin’s index for the same two sets of trees (after resolving any linear parts in the manner of Fig. 2) was 0.20 (Fig. 5d), confirming that Inline graphic is much more robust to the omission of rare types.

As the number of leaves per tree increases, indices such as Sackin’s index and the Colless-like index recommended by Mir et al. (2018) become more robust to the removal of rare types (Fig. 5e). Like Inline graphic, these previously defined indices give more weight to nodes nearer the root. In larger trees, the nodes near the root tend to have large numbers of descendant leaves. It follows that removing a random sample of nodes from near the tips of the tree is likely to have only a modest effect on balance, as the tree’s core structure is preserved. In our results, this effect outweighs an increase in the proportion of nodes removed (a median of 7Inline graphic, 19Inline graphic, and 24Inline graphic of nodes were removed from trees that originally had 16, 32, and 48 leaves, respectively, by applying the 1Inline graphic sensitivity threshold). Therefore the robustness benefit of Inline graphic is more pronounced in trees with fewer leaves.

Comparison with a Conservative Tree Balance Index

We additionally investigated the robustness of an alternative new tree balance index Inline graphic, defined as

graphic file with name Equation19.gif

 Inline graphic—which we denoted Inline graphic in a previous paper (Noble et al. 2022)—conforms to an alternative set of axioms that define what we call a conservative tree balance index. This index is maximal not for all trees with equal splits, but only for leafy trees with equal splits (see Appendix for details).

An advantage of Inline graphic is that, unlike Inline graphic, it is always insensitive to adding relatively low-magnitude subtrees to the root of the tree. Nevertheless, as the number of nodes increases, the difference between Inline graphic and Inline graphic rapidly diminishes, unless the root node is disproportionately large (Fig. 6). For example, when Inline graphic and Inline graphic are applied to random multifurcating trees on 16 leaves, with node sizes drawn from a continuous uniform distribution, the linear correlation between the two indices is 0.998 (Inline graphic is approximately 10Inline graphic smaller than Inline graphic in this case; Fig. 5f). Accordingly, we find that Inline graphic is only slightly more robust than Inline graphic to the removal of rare types when applied to reasonably large random trees (Fig. 5e). For most practical purposes, we see no strong reason to favor Inline graphic over the simpler index Inline graphic.

Figure 6.


Figure 6.

Example values of Inline graphic versus the conservative tree balance index Inline graphic. The latter index takes account of the size of each internal node, relative to the sum of its descendant node sizes.

Resolution Power

Mir et al. (2013) have argued that a useful tree balance index should have good resolution power, meaning a low probability of assigning the same value to two trees with the same number of leaves, chosen uniformly at random. Proposition 6 implies that, when applied to full Inline graphic-ary leafy trees with equally sized leaves, Inline graphic has the same resolution power as Sackin’s index.

Correlations with Pre-existing Indices

To compare Inline graphic to Sackin’s index, a Colless-like index, and the total cophenetic index (defined in the Appendix) on a diverse set of trees, we generated 2000 random multifurcating leafy trees on 100 equally sized leaves using the alpha-gamma model (Chen et al. 2009) via the R package CollessLike (Mir et al. 2018). As shown in Figure 7, our new balance index correlates negatively with the previously defined imbalance indices on this set of random trees, indicating that it captures a similar notion of balance. The strongest correlation is between Inline graphic and the total cophenetic index (Spearman’s Inline graphic for all trees, and Inline graphic for trees with a mean out-degree greater than 3). The marginal histograms in Figure 7 additionally show that more than 85Inline graphic of these random trees have balance values less than 0.25 according to the previously defined indices, whereas Inline graphic values are more evenly distributed between zero and one, with mean and median approximately equal to 0.6.

Figure 7.


Figure 7.

Scatter plots of Inline graphic versus normalized Sackin’s, Colless-like, and total cophenetic indices for 2000 random multifurcating leafy trees with 100 equally sized leaves. Histograms in the margins show the marginal distributions. Dashed reference curves in the first panel are obtained by substituting Inline graphic into Equation 6 with Inline graphic and Inline graphic (upper curve) or Inline graphic (lower curve). We use the Colless-like index with Inline graphic and Inline graphic the mean deviation from the median, as recommended by Mir et al. (2018). Normalization of each index other than Inline graphic depends only on the number of leaves and so does not affect correlations. Trees were generated from the alpha-gamma model with Inline graphic and Inline graphic.

Sensitivity to Certain Changes in Node Degree

As explained in the Methods section, we consider it desirable for tree balance indices to be sensitive to certain changes in node degree. In Inline graphic this sensitivity arises because, in the calculation of the node balance score, the node out-degree features as the base of the logarithm. For example, consider a star tree Inline graphic with Inline graphic leaves each of size Inline graphic. Suppose we add to the root another Inline graphic leaves, each of size Inline graphic. If Inline graphic then Inline graphic since all the leaves have the same size. Otherwise

graphic file with name Equation20.gif

As Inline graphic decreases from Inline graphic towards zero, Inline graphic decreases monotonically to account for the growing loss of balance. And as Inline graphic, so Inline graphic. If we then remove these vanishingly small leaves, the value of Inline graphic will jump from Inline graphic back to 1 because the remaining leaves are of equal size. The sensitivity of Inline graphic to such changes in node degree is thus a straightforward consequence of the conventional notion of node balance. The size of the jump in Inline graphic is at most Inline graphic, and it approaches zero as Inline graphic (i.e., when the new nodes are relatively few). The analyses shown in Figure 5e,f show that such discontinuities do not compromise the overall robustness of Inline graphic to the removal of rare types.

Implementation and Algorithmic Complexity

Assuming the identity of the root is known, our new indices can be computed from an adjacency matrix in Inline graphic time, where Inline graphic is the number of nodes (or the number of edges plus one). Subtree magnitudes are computed via depth-first search, which takes linear time, and the computation of the balance index takes at most Inline graphic steps, where Inline graphic is the adjacency list of node Inline graphic. Efficient R code for calculating Inline graphic is shared in an online repository (Noble and Lemant 2021).

Discussion

Here, we have defined a new class of tree balance index that unifies, generalizes, and in various ways improves upon previous definitions. Even when restricted to the tree types on which pre-existing indices are defined, our indices enable a more meaningful comparison of trees with different degree distributions or different numbers of leaves. Due to these advantages, our indices have the potential to supersede those in current use.

Our indices also enable important new applications. A challenge in comparing simulated phylogenies and trees inferred from data is that the former are exact, whereas the latter are often incomplete (Scott et al. 2020). In oncology, for example, it has been shown that whether or not a rare tumor clone is detected depends on both methodology and chance (Turajlic et al. 2018). Our balance indices largely solve this problem as they are insensitive to the omission of rare types, as demonstrated briefly here and more comprehensively in a companion paper (Noble et al. 2022).

Because of its unique relationship with Sackin’s index, we especially recommend Inline graphic—a weighted average of the normalized entropies of the internal nodes—as defined in general by Equation 3 and more simply for cladograms by Equation 5. Given that Sackin’s index has been well studied, it is convenient that Inline graphic inherits some of the properties of that index when applied to full Inline graphic-ary cladograms, including its relatively high sensitivity in distinguishing between alternative tree-generating models (Kirkpatrick and Slatkin 1993; Agapow and Purvis 2002). Within our framework, Sackin’s index is seen not as a general balance index but rather as a normalizing factor, which works as a balance index only in the special case of full Inline graphic-ary leafy trees (for which the numerator of Inline graphic is independent of tree topology).

Proposition 6 implies that determining the precise moments of Inline graphic for a model that generates full Inline graphic-ary leafy trees is equivalent to determining the moments of the reciprocal of Sackin’s index. Figure 7 suggests that Inline graphic has interesting relationships with other indices such as the total cophenetic index. These are promising areas for further investigation.

Acknowledgments

We thank Laura Keller, Lisa Lamberti, Niko Beerenwinkel, Francesco Marass, Jack Kuipers, and Katharina Jahn for helpful conversations, and János Podani for advice on terminology.

Appendix

Definition of the Total Cophenetic Index

The cophenetic value Inline graphic of a pair of leaves Inline graphic is the depth of their lowest common ancestor. The total cophenetic index (Mir et al. 2013) is then the sum of the cophenetic values over all pairs of leaves:

graphic file with name Equation21.gif

where Inline graphic is the number of nodes and Inline graphic the number of leaves. As in Sackin’s index, the principle is that an unbalanced tree stretches more than a balanced tree. Being explicitly defined for all multifurcating trees, the total cophenetic index permits meaningful comparison of any two multifurcating trees on the same number of leaves.

For trees on Inline graphic leaves, the minimum of the total cophenetic index is reached on the star tree, with Inline graphic. The maximum is attained on the caterpillar tree:

graphic file with name Equation22.gif

Hence, a normalized version of the total cophenetic index is Inline graphic This normalized imbalance index is not minimal for all fully symmetric trees. For example, the cophenetic value of the two leftmost leaves of the fully symmetric tree in Figure 1b is two, and so both the un-normalized and normalized cophenetic indices of this tree will be nonzero.

Conservative Tree Balance Indices

Our axioms permit Inline graphic to change discontinuously when we add rare types to the root. This is because Axioms 3 and 4 consider the addition of subtrees that have vanishingly small magnitude relative to other subtrees excluding their roots, whereas the relative size of the root of the entire tree is immaterial. For example, consider a two-node linear tree Inline graphic in which the nonroot node has size Inline graphic, relative to the size of the root. Then Inline graphic by Axiom 4. But if we add another child to the root of Inline graphic, also of relative size Inline graphic, then the Inline graphic value of the new tree will be 1 (by Axiom 1), even as Inline graphic. To make our index robust in such cases, we can add another axiom:

Axiom A.1 (Root limit).

Let Inline graphic be a tree with root Inline graphic. Then, Inline graphic as Inline graphic.

But this new axiom conflicts with Axiom 1, which we must then modify, such that equal splits are no longer sufficient for maximal balance:

Axiom A.2 (Alternative maximum value).

Inline graphic for all trees Inline graphic, and Inline graphic only if Inline graphic has equal splits. Furthermore, if Inline graphic has equal splits and is a leafy tree then Inline graphic.

We will call a tree balance index conservative if it conforms to these two alternative axioms in addition to Axioms 2, 3, 4, and 5. This name is appropriate because Axiom A.1 implies that a tree will be considered imbalanced unless there is strong evidence to the contrary (in the form of a relatively small root node). Every conservative index is both universal and robust.

One way to define a class of conservative indices is to add to Equation 1 a nonroot dominance factor Inline graphic with Inline graphic as Inline graphic, and Inline graphic if and only if Inline graphic. We then obtain

graphic file with name Equation23.gif

with Inline graphic The role of Inline graphic is to quantify the extent to which a node should be considered a leaf (which does not contribute to the index’s value) as opposed to an internal node (which does). Adding this factor has no effect on the balance values assigned to leafy trees, including cladograms, because if an internal node Inline graphic has zero size then Inline graphic. Setting Inline graphic, we can modify Equation 3 to obtain the specific conservative index

graphic file with name Equation24.gif

We previously used Inline graphic instead of Inline graphic to denote the above index (Noble et al. 2022).

Alternative Axioms Proposed by Fischer et al. (2021)

Shortly after we posted a preprint version of the current article, Fischer et al. (2021) posted a preprint in which they proposed two alternative axioms for nonrobust, nonuniversal tree balance indices, such as Sackin’s and Colless’ indices. In these axioms, Inline graphic denotes the set of rooted bifurcating trees with Inline graphic leaves, Inline graphic is the set of all rooted trees with Inline graphic leaves such that Inline graphic for all internal nodes Inline graphic, and the tree balance index is denoted Inline graphic.

Axiom A.3 (Fischer et al. minimum value).

The caterpillar tree with Inline graphic leaves is the unique tree minimizing Inline graphic on Inline graphic (if Inline graphic is defined on multifurcating trees) or on Inline graphic (if Inline graphic is defined only on bifurcating trees) for all Inline graphic.

Axiom A.4 (Fischer et al. maximum value).

The fully symmetric bifurcating tree with Inline graphic leaves is the unique tree maximizing Inline graphic on Inline graphic for all Inline graphic with Inline graphic.

These axioms can be compared with our axioms if we consider only leafy trees in which all leaves have equal size (such as cladograms). Axiom A.4 is then just a special case of our more general Axiom 1 because the fully symmetric bifurcating tree with Inline graphic leaves is the only tree in Inline graphic that has equal splits. But Axiom A.3 is not necessarily consistent with our Axiom 2. In particular, as shown in Figure 4b, our index Inline graphic does not comply with Axiom A.3 in the case of multifurcating leafy trees. We can resolve this incompatibility with the following simplification:

Axiom A.5 (Alternative Fischer et al. minimum value).

The caterpillar tree with Inline graphic leaves is the unique tree minimizing Inline graphic on Inline graphic for all Inline graphic (whether or not Inline graphic is defined on multifurcating trees).

Inline graphic is consistent with Axiom A.5 because, when we consider only bifurcating leafy trees in which all leaves have equal size, Inline graphic is equal to Inline graphic (by Proposition 6), which is inversely proportional to Inline graphic by definition, and the caterpillar tree is the unique bifurcating tree that maximizes Inline graphic (Fischer et al. 2021). Although Axiom 1 does not necessarily imply Axiom A.5, it is reasonable to expect useful universal tree balance indices to satisfy both conditions.

Proof that the Index of Equation 1 Satisfies Our Five Axioms

Proof. (Axiom 1 (Maximum value):

We have Inline graphic since Inline graphic and Inline graphic lie between zero and one by definition. Also if any internal node Inline graphic of tree Inline graphic does not split its descendants into at least two equal-magnitude subtrees then Inline graphic by definition and so


Proof. (Axiom 1 (Maximum value):

Now, let Inline graphic be a tree such that every internal node splits its descendants into at least two equal-magnitude subtrees. Then Inline graphic for all Inline graphic by definition. Hence,


Proof. (Axiom 1 (Maximum value):

Axiom 2 (Minimum value): We have Inline graphic since Inline graphic and Inline graphic are always non-negative by definition. Also if Inline graphic is a linear tree then Inline graphic for all Inline graphic by definition, and hence Inline graphic. Conversely, if some internal node Inline graphic has Inline graphic then Inline graphic by definition and, because Inline graphic must be positive by definition, we must have Inline graphic.

Axiom 3 (Insensitivity): Adding a subtree to a leaf Inline graphic changes the tree balance value via the contributions of two sets of nodes: the internal nodes of Inline graphic (including Inline graphic), and all other internal nodes. For each internal node, Inline graphic, as Inline graphic so also Inline graphic (because Inline graphic), which implies Inline graphic by definition, and hence all such contributions approach zero. The contribution of all other internal nodes also approaches zero because Inline graphic and Inline graphic are continuous by definition.

Axiom 4 (Linear limit): Let Inline graphic with Inline graphic. Without loss of generality, let Inline graphic denote the original child of Inline graphic, and Inline graphic denote the newly added children of Inline graphic. Adding subtrees to Inline graphic changes the tree balance value via the contributions of the newly added nodes and of node Inline graphic. As Inline graphic, so Inline graphic for all Inline graphic. This implies that Inline graphic and hence Inline graphic by definition for all Inline graphic. Therefore, the first contribution approaches zero. Also as Inline graphic, we have Inline graphic, and so Inline graphic by definition. Therefore, the second contribution also approaches zero.

Axiom 5 (Continuity): The continuity of Inline graphic follows immediately from the continuity of Inline graphic and Inline graphic. □

New Generalizations of Sackin’s and Colless’ Indices

The number of distinct subtrees that contain a given leaf Inline graphic is equal to its number of ancestors, which is the same as Inline graphic, the depth of Inline graphic. Hence, Sackin’s index is equivalent to the sum of the leaf counts of the subtrees rooted at each internal node. By extension, we can define a new, more general form of Sackin’s index that accounts for node sizes:

graphic file with name Equation27.gif

where Inline graphic is the magnitude of the subtree rooted at node Inline graphic, excluding the root. In the special case of leafy trees in which all leaves have size one, we recover Inline graphic. This new index is not very useful for assessing tree balance because it increases with the total tree magnitude, but in our framework, it performs an important role as a normalizing factor.

If we let Inline graphic denote the magnitude of the left branch of the subtree rooted at Inline graphic, and Inline graphic denote the magnitude of the right branch, then we can generalize Colless’ index to account for node sizes in bifurcating trees:

graphic file with name Equation28.gif

where Inline graphic. This definition reduces to Inline graphic in the case of leafy trees in which all leaves have size one. The right-hand expression above clarifies that the contribution of each node to Colless’ index is the product of the node’s importance (i.e., its number of descendants) and its balance (the degree to which the node splits its descendants into two equal-magnitude subtrees). We further see that Inline graphic for all trees Inline graphic (because Inline graphic for all Inline graphic), which suggests the normalization

graphic file with name Equation29.gif

This new generalization of Colless’ index is more robust than the conventional form, in the sense that its value is insensitive to the addition or removal of relatively small nodes. Inline graphic also enables meaningful comparison of trees with different numbers of leaves. But, the problem remains that Inline graphic applies only to bifurcating trees.

Other Balance Indices Based on Generalized Entropies

As defined by Chao et al. (2014), generalized entropies for Inline graphic are

graphic file with name Equation30.gif

Parameter Inline graphic determines the sensitivity to the type frequencies. Inline graphic is simply the richness (minus 1) of the population, which corresponds to ignoring the frequencies and just counting the types. For Inline graphic, rare types are given more weight than implied by their proportion, whereas for Inline graphic abundant types matter more. Inline graphic is the Gini–Simpson coefficient. In the limit Inline graphic we recover the Shannon entropy Inline graphic.

For Inline graphic, Inline graphic attains its maximum value if and only if all types have equal frequency Inline graphic:

graphic file with name Equation31.gif

We can therefore define a normalized balance score Inline graphic for Inline graphic and Inline graphic:

graphic file with name Equation32.gif

Similarly, one can define Inline graphic for Inline graphic based on the entropy defined by Rényi (1961):

graphic file with name Equation33.gif

In either case, a balance index Inline graphic satisfying our axioms is

graphic file with name Equation34.gif

for any Inline graphic. And in either case, Inline graphic as Inline graphic.

Proof of Proposition 6

Proof.

By definition of Inline graphic, if Inline graphic is a tree on Inline graphic leaves with Inline graphic and Inline graphic for every internal node Inline graphic then


Proof.

The sum of subtree magnitudes over the set of all internal nodes is equal to the sum of Inline graphic multiplied by leaf size over the set of all leaves:


Proof.

Summing first over the internal nodes and then over their children gives the same result:


Proof.

Let Inline graphic denote the ancestor of node Inline graphic at distance Inline graphic, with Inline graphic and Inline graphic (the root) for all Inline graphic. Then by extension,


Proof.

for any function Inline graphic. In particular, we have


Proof.

Substituting this result into the expression for Inline graphic, we find


Proof.

The right-hand sum is a telescoping series that collapses to give


Proof.

Now since Inline graphic is a leaf, Inline graphic. Also Inline graphic. Hence,


Proof.

If additionally all leaves Inline graphic of Inline graphic have the same size Inline graphic then Inline graphic, Inline graphic, and Inline graphic, which implies Inline graphic

Proof of Proposition 7

Proof.

Since Inline graphic, the conditions are equivalent to


Proof.

where Inline graphic are the children of Inline graphic. Let Inline graphic be a tree in Inline graphic and Inline graphic be an internal node of Inline graphic. Then, Inline graphic and Inline graphic for every child Inline graphic of Inline graphic. Therefore


Proof.

Also, Inline graphic, so we have


Proof.

 


Proof.

Since Inline graphic, this implies


Proof.

Proof that Inline graphic is a Sound Tree Balance Index

Proof.

By the definition of Mir et al. (2018), a sound tree balance index Inline graphic is such that Inline graphic is maximal if and only if Inline graphic is fully symmetric. The fully symmetric full Inline graphic-ary tree on Inline graphic leaves is the unique tree that minimizes Inline graphic among full Inline graphic-ary trees on Inline graphic leaves. This minimum value is Inline graphic (since every leaf Inline graphic has the same depth Inline graphic). Because Inline graphic is defined only on full Inline graphic-ary trees, if follows that Inline graphic is maximal if and only if Inline graphic is fully symmetric. □

Contributor Information

Jeanne Lemant, Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland; Department of Epidemiology and Public Health, Swiss Tropical and Public Health Institute, Kreuzstrasse 2, 4123 Allschwil, Switzerland; University of Basel, Petersplatz 1, 4001 Basel, Switzerland.

Cécile Le Sueur, Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland.

Veselin Manojlović, Department of Mathematics, City, University of London, Northampton Square, London EC1V 0HB, UK.

Robert Noble, Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland; Department of Mathematics, City, University of London, Northampton Square, London EC1V 0HB, UK.

Funding

This work was supported by the National Cancer Institute at the National Institutes of Health [U54CA217376 to R.N. and V.M.]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  1. Agapow  P.M., Purvis  A.  2002. Power of eight tree shape statistics to detect nonrandom diversification: a comparison by simulation of two models of cladogenesis. Syst. Biol.  51(6): 866–872. [DOI] [PubMed] [Google Scholar]
  2. Blum  M.G.B., François  O., Janson  S.  2006. The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance. Ann. Appl. Prob.  16(4): 2195–2214. [Google Scholar]
  3. Chao  A., Chiu  C.-H., Jost  L.  2014. Unifying species diversity, phylogenetic diversity, functional diversity, and related similarity and differentiation measures through hill numbers. Annu. Rev. Ecol. Evol. Syst.  45 (1): 297–324. [Google Scholar]
  4. Chen  B., Ford  D., Winkel  M.  2009. A new family of Markov branching trees: the alpha-gamma model. Electron. J. Prob.  14: 400–430. [Google Scholar]
  5. Chkhaidze  K., Heide  T., Werner  B., Williams  M.J., Huang  W., Caravagna  G., Graham  T.A., Sottoriva  A.  2019. Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data. PLoS Comput. Biol.  15(7):e1007243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Colless  D.H.  1982. Review of phylogenetics: the theory and practice of phylogenetic systematics. Syst. Zool.  31(1):100–104. [Google Scholar]
  7. Davis  A., Gao  R., Navin  N.  2017. Tumor evolution: linear, branching, neutral or punctuated?  Biochim. Biophys. Acta  1867 (2): 151–161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fischer  M., Herbst  L., Kersting  S., Kühn  L., Wicke  K.  2021. Tree balance indices: a comprehensive survey. arXiv preprint arXiv:2109.12281. [Google Scholar]
  9. Jamal-Hanjani  M., Wilson  G.A., McGranahan  N., Birkbak  N.J., Watkins  T.B.K., Veeriah  S., Shafi  S., Johnson  D.H., Mitter  R., Rosenthal  R., Salm  M., Horswell  S., Escudero  M., Matthews  N., Rowan  A., Chambers  T., Moore  D.A., Turajlic  S., Xu  H., Lee  S.-M., Forster  M.D., Ahmad  T., Hiley  C.T., Abbosh  C., Falzon  M., Borg  E., Marafioti  T., Lawrence  D., Hayward  M., Kolvekar  S., Panagiotopoulos  N., Janes  S.M., Thakrar  R., Ahmed  A., Blackhall  F., Summers  Y., Shah  R., Joseph  L., Quinn  A.M., Crosbie  P.A., Naidu  B., Middleton  G., Langman  G., Trotter  S., Nicolson  M., Remmen  H., Kerr  K., Chetty  M., Gomersall  L., Fennell  D.A., Nakas  A., Rathinam  S., Anand  G., Khan  S., Russell  P., Ezhil  V., Ismail  B., Irvin-Sellers  M., Prakash  V., Lester  J.F., Kornaszewska  M., Attanoos  R., Adams  H., Davies  H., Dentro  S., Taniere  P., O’Sullivan  B., Lowe  H.L., Hartley  J.A., Iles  N., Bell  H., Ngai  Y., Shaw  J.A., Herrero  J., Szallasi  Z., Schwarz  R.F., Stewart  A., Quezada  S.A., Le Quesne  J., Van Loo  P., Dive  C., Hackshaw  A., Swanton  C.  2017. Tracking the evolution of non–small-cell lung cancer. N. Engl. J. Med.  376 (22): 2109–2121. [DOI] [PubMed] [Google Scholar]
  10. King  M.C., Rosenberg  N.A.  2021. A simple derivation of the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. Math. Biosci.  342:108688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kirkpatrick  M., Slatkin  M.  1993. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution  47 (4):1171–1181. [DOI] [PubMed] [Google Scholar]
  12. Maley  C.C., Aktipis  A., Graham  T.A., Sottoriva  A., Boddy  A.M., Janiszewska  M., Silva  A.S., Gerlinger  M., Yuan  Y., Pienta  K.J., Anderson  K.S., Gatenby  R., Swanton  C., Posada  D., Wu  C.-I., Schiffman  J. D., Shelley Hwang  E., Polyak  K., Anderson  A.R.A., Brown  J.S., Greaves  M., Shibata  D.  2017. Classifying the evolutionary and ecological features of neoplasms. Nat. Rev. Cancer  17 (10): 605–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Mir  A., Rosselló  F., Rotger  L.A.  2013. A new balance index for phylogenetic trees. Math. Biosci.  241 (1): 125–136. [DOI] [PubMed] [Google Scholar]
  14. Mir  A., Rotger  L., Rosselló  F.  2018. Sound Colless-like balance indices for multifurcating trees. PLoS One  13 (9):e0203401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Mooers  A.O., Heard  S.B.  1997. Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol.  72 (1): 31–54. [Google Scholar]
  16. Noble  R., Lemant  J.  2021. RUtreebalance: robust, universal tree balance indices, 2021. https://zenodo.org/badge/latestdoi/399934945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Noble  R., Burri  D., Le Sueur  C., Lemant  J., Viossat  Y., Kather  J.N., Beerenwinkel  N.  2022. Spatial structure governs the mode of tumour evolution. Nat. Ecol. Evol.  6(2):207–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Podani  J.  2013. Tree thinking, time and topology: comments on the interpretation of tree diagrams in evolutionary/phylogenetic systematics. Cladistics  29(3):315–327. [DOI] [PubMed] [Google Scholar]
  19. Rényi  A.  1961. On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, vol. 4; University of California Press. p. 547–562. [Google Scholar]
  20. Sackin  M.J.  1972. “Good” and “bad” phenograms. Syst. Biol.  21 (2): 225–226. [Google Scholar]
  21. Scott  J.G., Maini  P.K., Anderson  A.R.A.A., Fletcher  A.G.  2020. Inferring tumor proliferative organization from phylogenetic tree measures in a computational model. Syst. Biol.  69 (4):623–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Shao  K.-T., Sokal  R.R.  1990. Tree balance. Syst. Zool.  39 (3): 266. [Google Scholar]
  23. Turajlic  S., Xu  H., Litchfield  K., Rowan  A., Horswell  S., Chambers  T., O’Brien  T., Lopez  J.I., Watkins  T.B.K., Nicol  D., Stares  M., Challacombe  B., Hazell  S., Chandra  A., Mitchell  T.J., Au  L., Eichler-Jonsson  C., Jabbar  F., Soultati  A., Chowdhury  S., Rudman  S., Lynch  J., Fernando  A., Stamp  G., Nye  E., Stewart  A., Xing  W., Smith  J.C., Escudero  M., Huffman  A., Matthews  N., Elgar  G., Phillimore  B., Costa  M., Begum  S., Ward  S., Salm  M., Boeing  S., Fisher  R., Spain  L., Navas  C., Grönroos  E., Hobor  S., Sharma  S., Aurangzeb  I., Lall  S., Polson A., Varia  M., Horsfield  C., Fotiadis  N., Pickering  L., Schwarz  R.F., Silva  B., Herrero  J., Luscombe  N.M., Jamal-Hanjani  M., Rosenthal  R., Birkbak  N.J., Wilson  G.A., Pipek  O., Ribli  D., Krzystanek  M., Csabai  I., Szallasi  Z., Gore  M., McGranahan  N., Van Loo  P., Campbell  P., Larkin  J., Swanton  C.  2018. Deterministic evolutionary trajectories influence primary tumor growth: TRACERx renal. Cell  173 (3): 595–610.e11. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES