A new universal system of tree shape indices

Robert Noble; Kimberley Verity

doi:10.1101/2023.07.17.549219

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Dec 17:2023.07.17.549219. [Version 4] doi: 10.1101/2023.07.17.549219

A new universal system of tree shape indices

Robert Noble ^1,^*, Kimberley Verity ¹

PMCID: PMC10705254 PMID: 38077096

Abstract

The comparison and categorization of tree diagrams is fundamental to large parts of biology, linguistics, computer science, and other fields, yet the indices currently applied to describing tree shape have important flaws that complicate their interpretation and limit their scope. Here we introduce a new system of indices with no such shortcomings. Our indices account for node sizes and branch lengths and are robust to small changes in either attribute. Unlike currently popular phylogenetic diversity, phylogenetic entropy, and tree balance indices, our definitions assign interpretable values to all rooted trees and enable meaningful comparison of any pair of trees. Our self-consistent definitions further unite measures of diversity, richness, balance, symmetry, effective height, effective outdegree, and effective branch count in a coherent system, and we derive numerous simple relationships between these indices. The main practical advantages of our indices are in 1) quantifying diversity in non-ultrametric trees; 2) assessing the balance of trees that have non-uniform branch lengths or node sizes; 3) comparing the balance of trees with different leaf counts or outdegrees; 4) obtaining a coherent, generic, multidimensional quantification of tree shape that is robust to sampling error and inferential error. We illustrate these features by comparing the shapes of trees representing the evolution of HIV and of Uralic languages, and trees generated by computational models of tumour evolution. Given the ubiquity of tree structures, we identify a wide range of applications across diverse domains.

Keywords: tree indices, tree shape, tree balance, phylogenetic diversity, phylogenetic entropy, rooted trees

Tree shape indices that quantify key properties of rooted trees – such as the effective number of leaves, average out-degree, and balance – have myriad applications. Conservation biologists use phylogenetic diversity values to determine which actions will preserve the most biodiversity (Tucker et al., 2017; Veron et al., 2019). Tree balance indices are used to compare models and to infer parameter values in systematic biology (Mooers and Heard, 1997; Purvis and Agapow, 2002), virology (Chindelevitch et al., 2021; Barzilai and Schrago, 2023), epidemiology (Leventhal et al., 2012; Colijn and Gardy, 2014), and oncology (Scott et al., 2020; Noble et al., 2022). Computer scientists seek to balance binary trees to make them more efficient as data structures (Albers and Westbrook, 2005). Numerous indices designed for such tasks have previously been proposed (Pavoine and Bonsall, 2011; Tucker et al., 2017; Fischer et al., 2023). However, no existing index provides a general purpose method for fairly evaluating the shape of any rooted tree. This paper introduces a system of such indices.

Rather than simply adding to a profusion of indices, our aim here is to solve important open problems: How can we modify existing phylogenetic diversity and entropy indices so that they are meaningful when applied to non-ultrametric trees? How can we define a tree balance index that accounts for both branch lengths and node sizes? How can we likewise generalize the concepts of outdegree, branch count, and node count? How can we unite all these types of tree shape index in a coherent system, so that their interrelationships can be easily understood? Only by solving these problems can we arrive at a general purpose method for fairly evaluating the shape of any rooted tree.

Among current diversity indices for generic rooted trees, arguably the most sophisticated are those introduced by Chao et al. (2010), which generalize and unify previous definitions of Hill (1973)^,Faith (1992)^,Jost (2006) and Allen et al. (2009). In quantifying the effective number of types in a data set, tthe ${}^{q}{\overline{D}}$ indices of Chao et al. (2010) account for both node sizes (type frequencies) and branch lengths (degree of dissimilarity between types). Nevertheless, a critical shortcoming of these indices, which limits their applications, is that they assign meaningful values only to leafy ultrametric trees (that is, trees in which the only non-zero-sized nodes are leaves, all equally distant from the root) (Chao et al., 2010; Leinster and Cobbold, 2012). We will further show that the ${}^{q}{\overline{D}}$ indices of Chao et al. (2010) are not fully self-consistent and have peculiar properties for $q > 1$ . Moreover, the relationships between these diversity indices and other types of index, such as tree balance indices, are generally opaque, which thwarts multi-dimensional analysis.

Conventional tree balance and imbalance indices – including those attributed to Sackin (1972) and Colless (1982), the total cophenetic index of Mir et al. (2013), and others reviewed by Fischer et al. (2023) – are also flawed. These indices, which are meant to quantify the extent to which each internal node splits its descendants into equally sized subtrees, are not defined for all rooted trees, do not permit meaningful comparison of trees with differing leaf counts, and are highly sensitive to the addition or removal of rare types (Noble et al., 2022; Lemant et al., 2022). We recently introduced a family of tree balance indices that solve these problems and that have additional desirable properties (Lemant et al., 2022). Our previous definitions are defined for any degree distribution, account for node sizes, and enable meaningful comparison of trees with different numbers of leaves. But our previous definitions do not account for branch lengths, which restricts their applications because branch lengths often convey important information (for example, genetic distance in virus evolution, or elapsed time in the evolution of species).

Here we define a new system of indices that resolve all the aforementioned problems by accounting for node sizes and branch lengths, being robust to small changes to the tree, assigning meaningful values to all rooted trees, and belonging to a coherent framework, so that mathematical relationships between the indices are well characterized. Our system captures fundamental properties such as diversity (effective number of leaves), tree balance (the extent to which each internal node splits its descendants into equally sized subtrees), and bushiness (average effective outdegree). Given that our indices share the desirable properties but not the flaws of prior indices, we discuss their potential to supersede current methods in a wide range of applications.

Materials and Methods

Hill numbers as a basis for defining robust, universal, interpretable tree indices

A rooted tree is a tree in which one node is designated the root and all branches are directed away from the root. Our aim is to define indices that are useful for categorizing and comparing the shapes of unlabelled rooted trees that have three attributes: tree topology, non-negative node sizes, and non-negative branch lengths. These indices should be generic and model-agnostic, meaning that they make no assumptions about what the tree represents or the process by which it was generated. In evolutionary trees, for example, the size of a node can correspond to the population size of the respective biological type, or simply to whether a type is extant (node size 1) or extinct (0), while branch lengths can represent genetic distance, morphological difference, or elapsed time. Linguists use similar structures with unequal branch lengths to study the evolution of languages (Honkola et al., 2013; Atkinson and Gray, 2005). In computing, the size of a search tree node corresponds to the probability of it being visited.

In this general context, a useful index should be robust, universal, and interpretable. A loose definition of robustness is that small changes to the tree have only small effects on the index value, except where sensitivity is desirable; universal means that the index is defined for all rooted trees; and interpretable implies a simple, consistent interpretation, enabling meaningful comparison of any pair of rooted trees. Lemant et al. (2022) provides more rigorous, axiomatic definitions. We follow these axiomatic definitions and call a tree index with all three properties an RUI index. In practical terms, robustness implies that an index is relatively insensitive to the effects of issues such as sampling error, inferential error, omission of rare types, imperfect genetic sequencing, and incomplete resolution of ancestral relationships. All our indices are dimensionless but the diversity indices can be re-scaled in terms of the branch length unit where desired.

We begin by recalling the family of diversity indices attributed to Hill (1973). These Hill numbers are functions of a set of proportions $P = \{p_{1}, \dots, p_{n}\}$ with $0 ⩽ p ⩽ 1$ for all $p \in P$ and $\sum_{i = 1}^{n} p_{i} = 1$ . Every Hill number of order $q ⩾ 0$ can be written as

{}^{q}D (P) : = {(\sum_{i = 1}^{n} p_{i}^{q})}^{\frac{1}{1 - q}} with {}^{1}D (P) : = \lim_{q \to 1} {}^{q}D (P) = \exp (- \sum_{i = 1}^{n} p_{i} \log p_{i}) .

Hence ${}^{q}D$ is the exponential of the Rényi entropy of order $q$ (Rényi, 1961), which we will denote ${}^{q}H$ , and ${}^{1}H$ is Shannon’s entropy (Shannon, 1948). Another important special case is

{}^{0}D (P) : = | {p \in P : p > 0} |,

which is simply the number of types, or richness. Following Pielou (1966) and Jost (2010), we further define the evenness indices

{}^{q}J (P) : = \{\begin{array}{l} \frac{{l o g}^{q} D (P)}{{l o g}^{0} D (P)} \in [0, 1] & if {}^{0}D (P) > 1 \\ 1 & otherwise. \end{array}

For completeness, we set ${}^{q}D (\emptyset) = 0$ and ${}^{q}J (\emptyset) = 1$ .

We can apply these indices to a rooted tree $T$ simply by equating $P (T) = \{p_{1}, \dots, p_{n}\}$ to the proportional sizes of the $n$ nodes of $T$ , including the internal nodes. Assigning non-zero sizes to internal nodes makes sense, for example, in the case of a tumour clone tree (Noble et al., 2022). The richness index ${}^{0}D (T) = {}^{0}D (P (T))$ then quantifies the number of non-zero-sized nodes in the tree, which we will refer to as the counted nodes. In an evolutionary tree, counted nodes correspond to extant types. For each $q > 0$ , the diversity index ${}^{q}D (T) = {}^{q}D (P (T))$ can be interpreted as an effective number of counted nodes, while ${}^{q}J (T) = {}^{q}J (P (T))$ gauges the evenness of the counted node sizes.

Clearly ${}^{q}D$ and ${}^{q}J$ are insensitive to small changes to proportional node sizes. For $q > 0$ , ${}^{q}D$ is also generally robust to the addition or removal of relatively small nodes (and the degree of robustness increases with $q$ ), whereas ${}^{0}D$ and ${}^{q}J$ are not, as is appropriate for indices that are meant to quantify richness and evenness. ${}^{q}D$ and ${}^{q}J$ are universal because they can be applied to any set of node sizes, and they are interpretable as described above. Yet although these indices are RUI, they are inadequate for assessing tree shape because they depend only on node sizes, ignoring both tree topology and branch lengths.

Many indices that capture aspects of tree shape have previously been defined (surveys include Pavoine and Bonsall (2011)^;Tucker et al. (2017)^;Fischer et al. (2023)) but, to the best of our knowledge, none is RUI (Table 1). We address this deficiency by developing new RUI tree indices that extend the basic indices ${}^{q}D$ and ${}^{q}J$ to account for tree topology and branch lengths. We do this using three types of weighted mean, which we refer to as the longitudinal mean, the node-wise mean, and the star mean (Table 2). Our consistent definitions ensure that our indices can be precisely related to each other and to ${}^{q}D$ and ${}^{q}J$ in numerous meaningful ways, so that all the indices belong to a single coherent system.

Table 1.

Properties of some previously defined non-RUI tree indices (see main text for definitions and citations.)

	Robust		Universal	Interpretable

	Robustly accounts for node sizes?	Robustly accounts for branch lengths?	Defined for all rooted trees?	Has a simple, consistent interpretation?	Can meaningfully compare any pair of rooted trees?

Faith’s PD	No	Yes

Allen et al’s $H_{P}$	Yes			Only for leafy ultrametric trees
Chao et al’s ${}^{q}{\overline{D}}$	Yes			Only for leafy ultrametric trees

Sackin’s index	No			Yes	No
Colless’s index
Total cophenetic index

Lemant et al’s $J^{q}$	Yes	No	Yes	Only if uniform branch lengths

Open in a new tab

Table 2.

Nature, notation and interpretation of RUI tree indices, including prior indices (top row) and new indices (second, third and fourth rows). Counted nodes are those with non-zero size.

Branches or nodes	Type of average	Richness	Diversity (with q > 0)	Evenness (with q > 0)

Nodes	None	${}^{0}D$ = number of counted nodes	${}^{q}D$ = effective number of counted nodes	${}^{q}J$ = evenness of counted node sizes
Branches	Longitudinal mean	${{}^{0}D}_{L}$ = average branch count across the tree	${{}^{q}D}_{L}$ = effective number of maximally distant leaves	${{}^{q}J}_{L}$ = evenness of branch sizes across the tree (tree symmetry if leafy and ultrametric)
Branches	Node-wise mean	${{}^{0}D}_{N}$ = average effective outdegree, ignoring branch sizes	${{}^{q}D}_{N}$ = average effective outdegree, accounting for branch sizes	${{}^{q}J}_{N}$ = tree balance
Branches	Star mean	${{}^{0}D}_{S}$ = effective number of non-root nodes	${{}^{q}D}_{S}$ = effective number of branches, accounting for branch sizes	${{}^{q}J}_{S}$ = evenness of all branch sizes

Open in a new tab

Further preliminary definitions

In a rooted tree, the depth of a node is the sum of the branch lengths along the unidirectional path from the root to the node. The height of the tree is the maximum depth of its non-zero-sized nodes. Nodes with no descendants are called leaves and non-leaves are called internal nodes. We define the size of a branch as the sum of the proportional node sizes that descend (directly or indirectly) from the branch. For example, in the three-leaf tree depicted in Figure 1a, the branches descending from the root have sizes $\frac{1}{3}$ and $\frac{2}{3}$ , and the other two branches each have size $\frac{1}{3}$ . The size of any segment of a branch is the same as the size of the branch.

Fig. 1. — a) A leafy bifurcating ultrametric tree with three equally sized leaves. In this and every subsequent tree diagram, open circles indicate zero-sized nodes. b) Index values versus branch length $λ$ for the three-leaf tree. The y-axis is log-transformed so that the curves for all diversity indices appear piecewise linear. ${}^{0}{\overline{D}}$ is slightly greater than ${{}^{1}J}_{N}$ whenever $0 < λ < 1$ .

A leafy tree is such that all internal nodes have zero size (equivalently, all counted nodes are leaves). A tree is ultrametric if all its leaves have the same depth after the removal of all subtrees that contain only zero-sized branches (corresponding to extinct lineages in an evolutionary tree). A caterpillar tree is a bifurcating tree in which every internal node except one has exactly one child leaf. A star tree is a tree in which all non-zero-sized branches are attached to the root. We define a piecewise star tree as a tree that can be divided into transverse intervals such that, within each interval, all the non-zero-sized branches are attached to a common node. For example, the leafy ultrametric tree in Figure 1a is a star tree if $λ = 0$ or $λ = 1$ and is otherwise a caterpillar tree. To simplify our notation, we will usually omit the tree as a function argument (for example, writing ${}^{0}D$ instead of ${}^{0}D (T)$ ).

It will be helpful to recall that, for a sequence of positive real numbers $X = x_{1}, \dots, x_{n}$ , real number $r \neq 0$ , and set of positive weights $W = w_{1}, \dots, w_{n}$ , the weighted power mean of exponent $r$ is

M_{r} (X; W) : = {(\frac{\sum_{i = 1}^{n} w_{i} x_{i}^{r}}{\sum_{i = 1}^{n} w_{i}})}^{\frac{1}{r}} .

$M_{0}$ is defined from the limit as

M_{0} (X; W) : = e x p (\frac{\sum_{i = 1}^{n} w_{i} l o g x_{i}}{\sum_{i = 1}^{n} w_{i}}) .

$M_{- 1}$ , $M_{0}$ and $M_{1}$ are respectively the weighted harmonic, geometric, and arithmetic means. $M_{- \infty}$ and $M_{\infty}$ respectively return the minimum and the maximum. Power means are closely related to Hill numbers as, for all $q ⩾ 0$ and any sequence of proportions $P$ ,

{}^{q}D (P) = {[M_{q - 1} (P; P)]}^{- 1} .

(0.1)

Prior tree balance and imbalance indices

The most popular conventional tree imbalance indices can be expressed in the form

I_{A} = \sum_{i \in V} n_{i} F_{A} (i),

where $V$ is the set of all internal nodes and $n_{i}$ is the number of leaves that descend from node $i$ , and $F_{A} (i)$ is a function that defines a particular index. For $I_{S}$ (Sackin’s index), $I_{C}$ (Colless’ index) and $I_{Φ}$ (the total cophenetic index) we have

F_{S} (i) = 1, F_{C} (i) = |p_{i_{1}} - p_{i_{2}}|, F_{Φ} (i) = \frac{n_{i} - 1}{2},

where $p_{i_{1}}$ is the proportion of the $n_{i}$ leaves that descend from the left child branch of $i$ , and $p_{i_{2}}$ is the proportion that descend from the right child branch. Being imbalance indices, these three indices assign higher values to less balanced trees. $I_{C}$ is defined only for bifurcating trees (in which all internal nodes have outdegree two). $I_{S}$ and $I_{Φ}$ are defined only for trees in which all internal nodes have outdegree greater than one. By convention, each index is normalized over the set of trees on $n > 2$ leaves by subtracting its minimum value over such trees and then dividing by the difference between its maximum and its minimum. The minima of $I_{S}$ , $I_{C}$ and $I_{Φ}$ are $n$ , 0 and 0, and the maxima are $(n + 2) (n - 1) / 2$ , $(\binom{n - 1}{2})$ and $(\binom{n + 1}{3})$ , respectively (Shao and Sokal, 1990; Rogers, 1993; Mir et al., 2013).

Lemant et al. (2022) proposed instead defining tree balance or imbalance indices in the form of the weighted arithmetic mean

\frac{1}{\sum_{i \in V} w_{i}} \sum_{i \in V} w_{i} F (i),

where $w_{i}$ is the weight assigned to node $i$ , and $F (i)$ quantifies the degree to which node $i$ splits its descendants into equally sized subtrees. For example, we can obtain an alternative normalization of Colless’ index by setting $w_{i} = n_{i}$ and $F (i) = F_{C} (i)$ . The normalizing factor $\sum_{i \in V} w_{i}$ is then Sackin’s index. An advantage of this approach is that it allows us to compare the balance of any pair of trees for which $F$ is defined, rather than only trees with equal leaf counts.

Definition of the normalizing factor $\overline{h}$

Consistent with Lemant et al. (2022), our new index definitions are based on weighted means. Our preferred weights require us to define the normalizing factor

\overline{h} : = \sum_{b \in B} s_{b} l_{b} ⩽ h,

where $B$ is the set of all branches in the tree, $s_{b} \in [0, 1]$ is the size of branch $b$ , $l_{b}$ is the length of branch $b$ , and $h$ is the tree height. We can interpret $\overline{h}$ (denoted $\overline{T}$ in Chao et al. (2010)) as the effective tree height or as the average counted node depth. In computer science, $\overline{h}$ is called the weighted path length (Albers and Westbrook, 2005). For leafy trees with uniform leaf sizes and uniform branch lengths, $\overline{h} = l I_{S} / {}^{0}D$ , where $l$ is the branch length and $I_{S}$ is Sackin’s index. Hence $\overline{h}$ can also be considered a generalization of Sackin’s index. Indeed, we have previously argued that Sackin’s index is best interpreted not as a general imbalance index but rather as a normalizing factor, which works as an imbalance index only in the special case of trees with uniform node sizes, uniform branch lengths, and uniform outdegree (Lemant et al., 2022). $\overline{h} = h$ if and only if the tree is leafy and ultrametric.

Definition of the longitudinal mean

The basic idea of the longitudinal mean is that we split the tree into transverse intervals, calculate an index value based on the proportional sizes of the branch segments within each interval, and then take a weighted average of these within-interval index values. Let $I$ denote the set of transverse intervals created by locating an interval boundary at every node depth (dashed lines in Figure 1a), excluding intervals that contain only zero-sized branches. Each interval $i \in I$ then contains a set $B_{i}$ of branch segments, all of the same length, which we will refer to as the interval height $h_{i}$ . Let

S_{i} : = \sum_{b \in B_{i}} s_{b} \in (0, 1],

where $s_{b}$ is the size of branch segment $b$ . Then $S_{i} = 1$ for all intervals $i$ if and only if the tree is leafy and ultrametric. It follows that

\sum_{i \in I} S_{i} h_{i} = \overline{h} .

Now for each $b \in B_{i}$ , define the within-interval proportional branch size $p_{b} : = s_{b} / S_{i}$ and let $P_{i} : = \{p_{b} : b \in B_{i}, p_{b} > 0\}$ . Then $\sum_{p \in P_{i}} p = \sum_{b \in B_{i}} p_{b} = 1$ for all intervals $i \in I$ .

Finally, for index $F$ and tree $T$ , we define the longitudinal mean of order $r$ of $F$ as the functional $F \mapsto M_{long, r} (F)$ such that

M_{long, r} (F) (T; w) : = \{\begin{array}{l} {(\frac{\sum_{i \in I (T)} w_{i} {[F (P_{i})]}^{r}}{\sum_{i \in I (T)} w_{i}})}^{\frac{1}{r}}, & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

(0.2)

where the weight $w > 0$ is a function of $i$ that remains to be specified. Hence $M_{l o n g, r} (F)$ is a weighted power mean of the $F$ values assigned to the intervals. For succinctness, we will omit the argument $T$ and specify $w$ only where necessary.

Example 0.1

For the function $F (x_{1}, \dots, x_{n}) = \sum_{k = 1}^{n} x_{k}$ we have

M_{long, r} (F) = {(\frac{\sum_{i \in I} w_{i} {(\sum_{p \in P_{i}} p)}^{r}}{\sum_{i \in I} w_{i}})}^{\frac{1}{r}} = 1 .

New longitudinal mean indices

We define new tree indices as longitudinal means of ${}^{q}D$ and ${}^{q}J$ with $w_{i} = S_{i} h_{i}$ , so that the index value assigned to each interval $i$ is weighted by the product of the length $h_{i}$ and the summed sizes $S_{i}$ of the branch segments that $i$ contains. First, we define

{{}^{q}D}_{L} : = M_{long, 0} ({}^{q}D) .

(0.3)

This is equivalent to ${{}^{q}H}_{L} = M_{long, 1} ({}^{q}H)$ with $D_{L} = e x p H_{L}$ . In particular,

{{}^{0}D}_{L} = \{\begin{array}{l} e x p (\frac{1}{\overline{h}} \sum_{i \in I} S_{i} h_{i} l o g |P_{i}|) & if h > 0 \\ 0 & otherwise, \end{array}

{{}^{1}D}_{L} = \{\begin{array}{l} e x p (- \frac{1}{\overline{h}} \sum_{i \in I} h_{i} \sum_{b \in B_{i}} s_{b} l o g \frac{s_{b}}{S_{i}}) & if h > 0 \\ 0 & otherwise. \end{array}

We can interpret ${{}^{0}D}_{L}$ as the average tree width or, more precisely, as the geometric mean number of branches counted across the tree. In an evolutionary tree where branch lengths correspond to elapsed time, ${{}^{0}D}_{L}$ equates to average richness across time, excluding extinct lineages. For $q > 0$ , ${{}^{q}D}_{L}$ can be interpreted as the effective number of counted nodes maximally distant from the root or – because all maximally distant counted nodes must be leaves – as the effective number of maximally distant leaves. In biological terms, this corresponds to the effective number of extant types maximally distinct from the root type.

Second, we define

{{}^{q}J}_{L} : = M_{long, 1} ({}^{q}J) = \{\begin{array}{l} \frac{1}{\overline{h}} \sum_{i \in I} S_{i} h_{i} {}^{q}J (P_{i}) & if h > 0 \\ 1 & otherwise. \end{array}

(0.4)

Just as ${}^{q}J$ measures the evenness of node sizes, so ${{}^{q}J}_{L}$ measures the average evenness of branch sizes across the tree. If the tree is leafy and ultrametric then ${{}^{q}J}_{L} = 1$ for $q > 0$ if and only if the tree is fully symmetric. Hence, when applied to leafy ultrametric trees, ${{}^{q}J}_{L}$ can be interpreted as a symmetry index (also known as a sound balance index (Mir et al., 2018)).

Figure 1b illustrates how ${{}^{0}D}_{L}$ , ${{}^{1}D}_{L}$ and ${{}^{1}J}_{L}$ (and other index values yet to be defined) vary with branch length $λ$ for the three-leaf tree of Figure 1a.

Definition of the node-wise mean: first special case

In the special case in which all branches have the same length $l$ , we can obtain a node-wise mean by calculating an index value for each node, based on the node’s child branch sizes, and then taking a weighted average of these node index values. We previously used this approach to define new tree balance indices (Lemant et al., 2022).

Let $V$ denote the set of all internal nodes, excluding nodes with only zero-sized descendants. Let $C_{i}$ denote the subtree containing only $i$ and its children. For $i \in V$ and $b \in C_{i}$ , let $s_{b}$ denote the size of $b$ and define

S_{i} = \sum_{b \in C_{i}} s_{b} \in (0, 1] .

Then $S_{i} = 1$ for all nodes $i$ if and only if the tree is a leafy piecewise star tree. It follows that

\sum_{i \in V} S_{i} l = \overline{h} .

Now for each $b \in C_{i}$ , define the proportional branch size $p_{b} : = s_{b} / S_{i}$ and let $P_{i} : = \{p_{b} : b \in C_{i}, p_{b} > 0\}$ . We then define the node-wise mean of order $r$ of index $F$ as the weighted power mean of the $F$ values assigned to the nodes:

M_{node, r} (F) (T; w) : = \{\begin{array}{l} {(\frac{\sum_{i \in V (T)} w_{i} {[F (P_{i})]}^{r}}{\sum_{i \in V (T)} w_{i}})}^{\frac{1}{r}}, & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

(0.5)

where the weight $w > 0$ is a function of $i$ that remains to be specified.

Definition of the node-wise mean: second special case

In the case of a piecewise star tree with $h > 0$ , we can set the index value of each internal node $k$ as the longitudinal mean index value of the subtree $C_{k}$ . We then have

\begin{array}{l} M_{node, r, t} (F) (T; u, w) = {(\frac{\sum_{k \in V (T)} u_{k} {[M_{l o n g, r} (F) (C_{k}; w)]}^{t}}{\sum_{k \in V (T)} u_{k}})}^{\frac{1}{t}} \\ = {(\frac{1}{\sum_{k \in V (T)} u_{k}} \sum_{k \in V (T)} u_{k} {(\frac{\sum_{i \in I (T)} w_{i k} {[F (P_{i k})]}^{r}}{\sum_{i \in I (T)} w_{i k}})}^{\frac{t}{r}})}^{\frac{1}{t}}, \end{array}

(0.6)

where $t$ is the exponent of the across-nodes power mean, $u_{k} > 0$ is the weight assigned to node $k$ , $P_{i k}$ contains the proportional sizes of all branch segments that belong to both subtree $C_{k}$ and interval $i$ , and $w_{i k} > 0$ is the weight assigned to $k$ associated with interval $i$ .

To keep our system internally consistent we would like, in the case of piecewise star trees, the node-wise mean of any index to be equal to the longitudinal mean of the same index. Comparing Equation 0.6 with the definition of the longitudinal mean (Equation 0.2), we see that the right-hand sides are equivalent if and only if three conditions hold:

r = t, \sum_{i \in I (T)} w_{i k} = u_{k}, \sum_{k \in V (T)} u_{k} = \sum_{i \in I (T)} w_{i} .

Under these conditions, summing index values across subtree intervals and then across nodes gives the same result as summing across tree intervals. We then have for any piecewise star tree $T$ with $h > 0$ ,

M_{n o d e, r} (F) (T; w) = {(\frac{\sum_{k \in V (T)} \sum_{i \in I (T)} w_{i k} {[F (P_{i k})]}^{r}}{\sum_{i \in I (T)} w_{i}})}^{\frac{1}{r}} .

In the particular case $F = {}^{q}D$ , the index value assigned to each node $k$ (that is, the longitudinal mean index value of the subtree $C_{k}$ ) measures the diversity of the child branches of $k$ . When $C_{k}$ has $m$ branches of equal length and size, the node diversity of $k$ is $m$ . In the case $m > 1$ , as one branch length is reduced towards zero while all else is kept constant, the node diversity of $k$ decreases continuously to $m - 1$ . Decreasing instead the size of one branch has the same effect provided $q > 0$ . Hence the diversity value assigned to each node can be interpreted as an effective outdegree, and the node-wise mean diversity can be interpreted as an average effective outdegree. When $q = 0$ the effective outdegree ignores branch sizes. As $q$ increases, the effective outdegree gives less weight to branches of smaller size. We would like to retain this interpretation as we generalize the definition of the node-wise mean.

Definition of the node-wise mean: general case

In extending the definition to all rooted trees, we want to ensure that, as with the longitudinal mean, the node-wise mean changes continuously as we vary branch lengths. We illustrate this general issue with an example.

Example 0.2

Consider a leafy ultrametric tree with six leaves such that the root has two descendant branches each of length $λ$ , and both non-root internal nodes have three descendant branches, all of length $1 - λ$ . When $λ = \frac{1}{2}$ (Figure 2a), it follows from our special-case definition (Equation 0.5) that the root has richness 2, the internal nodes each have richness 3, and the node-wise mean richness is intermediate between 2 and 3. As $λ$ increases from $\frac{1}{2}$ to 1, the node richness values should remain unchanged but the root node richness should be given greater weight, so that the node-wise mean richness (which we will denote ${{}^{0}D}_{N}$ ) approaches 2 continuously as $λ \to 1$ (Figure 2b).

Fig. 2. — a) The six-leaf tree considered in Example 0.2 with branch length $λ = \frac{1}{2}$ . b) As $λ \to 1$ , the tree approaches a two-leaf star tree. c) As $λ \to 0$ , the tree approaches a six-leaf star tree.

At the other extreme, as $λ$ decreases from $\frac{1}{2}$ to 0, we would like ${{}^{0}D}_{N}$ to increase continuously to 6 (Figure 2c). Given that the weight assigned to the root node richness should decrease as $λ$ decreases, the only way to achieve the required increase in ${{}^{0}D}_{N}$ is to increase the richness value assigned to each non-root internal node $k$ . We can do this by making the richness value assigned to $k$ depend not only on the child branches of $k$ but also, to an increasing degree as $λ$ decreases, on the other branches that run alongside the branches of $k$ .

Generalizing from the example we conclude that, when the distance between node $k$ and any ancestor $j$ of $k$ (in the example, the root) is less than the height of $C_{k}$ (in the example, when $λ < \frac{1}{2}$ ), the index value assigned to $k$ should depend not only on the branches of $C_{k}$ (the child branches of $k$ ) but also on branch segments that descend from $j$ and that coexist in transverse intervals with the branches of $C_{k}$ . The weight assigned to $k$ depends only on $C_{k}$ but the index value assigned to $k$ is a weighted average of index values across $k$ and all ancestors of $k$ .

To formalize this concept, we first define, for interval $i \in I$ and node $j \in V$ ,

S_{i T_{j}} = \sum_{b \in B_{i} \cap T_{j}} s_{b} \in [0, 1], S_{i C_{j}} = \sum_{b \in B_{i} \cap C_{j}} s_{b} \in [0, 1],

where $T_{j}$ is the subtree containing $j$ and all its descendants. This implies

\sum_{i \in I} S_{i T_{r}} h_{i} = \sum_{i \in I} \sum_{j \in V} S_{i C_{j}} h_{i} = \overline{h},

where $r$ is the root (and hence $T_{r}$ is the entire tree). $S_{i C_{j}}$ is a generalization of the $S_{i}$ used in our previous definitions, whereas $S_{i T_{j}}$ is a new concept. For each $b \in B_{i} \cap T_{j}$ , let

p_{b} = \{\begin{array}{l} s_{b} / S_{i T_{j}} & if S_{i T_{j}} > 0 \\ 0 & otherwise, \end{array}

and define $P_{i j} = \{p_{b} : b \in B_{i} \cap T_{j}, p_{b} > 0\}$ . We then define the node-wise average as the triple power mean

\begin{array}{l} M_{node, r, s, t} (F) (T; u, v, w) = \\ {(\frac{1}{\sum_{k \in V (T)} u_{k}} \sum_{k \in V (T)} u_{k} {[\frac{1}{\sum_{j \in A_{k}} v_{j k}} \sum_{j \in A_{k}} v_{j k} {(\frac{\sum_{i \in I (T)} w_{i k} {[F (P_{i j})]}^{r}}{\sum_{i \in I (T)} w_{i k}})}^{\frac{s}{r}}]}^{\frac{t}{s}})}^{\frac{1}{t}}, \end{array}

where $A_{k}$ is the set containing $k$ and all ancestors of $k$ , $s$ is the exponent of the across-ancestors power mean, and $v_{j k}$ are the ancestor weights. This expression is consistent with Equation 0.6 if and only if

t = s = r, \sum_{j \in A_{k}} v_{j k} = u_{k} = \sum_{i \in I (T)} w_{i k}, \sum_{k \in V (T)} u_{k} = \sum_{i \in I (T)} w_{i} .

(0.7)

We then arrive at a simpler general definition

M_{n o d e, r} (F) (T; v, w) : = \{\begin{array}{l} {(\frac{1}{\sum_{k \in V (T)} u_{k}} \sum_{k \in V (T)} \sum_{j \in A_{k}} \frac{v_{j k}}{u_{k}} \sum_{i \in I (T)} w_{i k} {[F (P_{i j})]}^{r})}^{\frac{1}{r}} & if h > 0 \\ F (\emptyset) & otherwise. \end{array}

Integral forms of the node-wise and longitudinal means

Since our preferred ancestor weights are best expressed as integrals, we will find it useful to define the longitudinal and node-wise means even more generally by integrating over depths instead of summing over intervals. Suppose we assign a non-negative density $f_{b} (x)$ to every branch $b$ at every depth $x$ , with $f_{b} (x) = 0$ for every $x$ at which $b$ is absent. Define the tree height $h : = m a x \{x : f_{b} (x) > 0, b \in B\}$ , where $B$ is the set of all branches. We can then define branch size $s_{b}$ as the non-increasing function of depth $x$ :

\overline{h} : = \sum_{b \in B} \int_{0}^{h} f_{b} (x) d x, s_{b} (x) : = \{\begin{array}{l} \frac{1}{\overline{h}} \sum_{b \in G_{b}} \int_{x}^{h} f_{b} (t) d t & if h > 0 \\ 0 & otherwise, \end{array}

where $G_{b}$ is the set containing $b$ and all branches that descend from $b$ . Let

S_{T_{j}} (x) : = \sum_{b \in B_{j}} s_{b} (x) \in [0, 1] .

For each $b \in B_{j}$ , define the proportional branch size

p_{b j} (x) : = \{\begin{array}{l} s_{b} (x) / S_{T_{j}} (x) & if S_{T_{j}} (x) > 0 \\ 0 & otherwise. \end{array}

Let $P_{j} (x) : = \{p_{b j} (x) : b \in B_{j}, p_{b j} (x) > 0\}$ . We then define the node-wise mean of an index $F$ as

M_{node, r} (F) (T; v, w) : = \{\begin{array}{l} {(\frac{1}{\sum_{k \in V (T)} u_{k}} \sum_{k \in V (T)} \sum_{j \in A_{k}} \frac{v_{j k}}{u_{k}} \int_{0}^{h} w_{k} (x) {[F (P_{j} (x))]}^{r} d x)}^{\frac{1}{r}} & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

where $w_{k} (x)$ is the weight assigned to node $k$ at depth $x$ , and

u_{k} = \int_{0}^{h} w_{k} (x) d x .

The longitudinal mean can similarly be defined in terms of integrals as

M_{l o n g, r} (F) (T; w) : = \{\begin{array}{l} {(\frac{\int_{0}^{h} w (x) [F (P (x))]^{r} d x}{\int_{0}^{h} w (x) d x})}^{\frac{1}{r}} & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

where $P (x) : = \{p_{b} (x) : b \in B, p_{b} (x) > 0\}$ ,

p_{b} (x) : = \{\begin{array}{l} s_{b} (x) / S (x) & if S (x) > 0 \\ 0 & otherwise, \end{array} S (x) : = \sum_{b \in B} s_{b} (x) = \sum_{j \in V} \sum_{b \in B_{j}} s_{b} (x) .

Our previous definitions are included as special cases in which the branch density is zero except at each counted node, where it is equal to the node size. In an evolutionary tree, branch density corresponds to population size, and branch size corresponds to number of extant descendants. Although it is beyond the scope of the current manuscript, we note that the integral forms would permit us to apply our indices to a more general class of tree, such that the size of any branch is allowed to vary along its length.

New node-wise mean indices

To define new tree indices as node-wise means of ${}^{q}D$ and ${}^{q}J$ , we first set $w_{k} = S_{C_{k}}$ , where

S_{C_{k}} (x) : = \sum_{b \in C_{k}} s_{b} (x),

and we define the normalization factor

{\overline{h}}_{C_{j}} : = \int_{0}^{h} S_{C_{j}} (x) d x, ⟹ \overline{h} = \sum_{j \in V} {\overline{h}}_{C_{j}} = \int_{0}^{h} S (x) d x .

Let $d_{k}$ denote the depth of node $k$ and let $d_{j k} = d_{k} - d_{j}$ denote the distance from $j$ to $k$ . Let $j^{'}$ denote the parent of node $j$ . The ancestor weight function $v$ should have three properties. First, as an assumption of our general definition (Equation 0.7),

\sum_{j \in A_{k}} v_{j k} = u_{k} = \int_{0}^{h} w_{k} (x) d x .

Second, $v_{j k}$ should decrease as $d_{j^{'} j}$ decreases. Third, $v_{j k}$ should increase as the overlap between $C_{j}$ and $C_{k}$ increases. A simple way to satisfy all three conditions is to set

v_{j k} = \int_{α_{j k}}^{β_{j k}} S_{C_{k}} (x) d x,

where $α_{j k} : = d_{k} + d_{j k}$ and

β_{j k} : = \{\begin{array}{l} α_{j k} + d_{j^{'} j} & if j is not the root \\ \infty & otherwise . \end{array}

Given the above choices of $w$ and $v$ , we define the node-wise mean diversity of order $q$ as

{{}^{q}D}_{N} : = M_{node, 0} ({}^{q}D) .

(0.8)

This is equivalent to ${{}^{q}H}_{N} = M_{node, 1} ({}^{q}H)$ with ${{}^{q}D}_{N} = e x p {{}^{q}H}_{N}$ . In particular,

{{}^{0}D}_{N} = \{\begin{array}{l} e x p (\frac{1}{\overline{h}} \sum_{k \in V} \frac{1}{{\overline{h}}_{C_{k}}} \sum_{j \in A_{k}} v_{j k} \int_{0}^{h} S_{C_{k}} (x) l o g |P_{j} (x)| d x) & if h > 0 \\ 0 & otherwise, \end{array}

{{}^{1}D}_{N} = \{\begin{array}{l} e x p (\frac{1}{\overline{h}} \sum_{k \in V} \frac{1}{{\overline{h}}_{C_{k}}} \sum_{j \in A_{k}} v_{j k} \int_{0}^{h} S_{C_{k}} (x)^{1} H (P_{j} (x)) d x) & if h > 0 \\ 0 & otherwise, \end{array}

where

{}^{1}H (P_{j} (x)) = - \sum_{b \in B_{j}} \frac{s_{b} (x)}{S_{T_{j}} (x)} l o g \frac{s_{b} (x)}{S_{T_{j}} (x)} .

As previously explained, we can interpret ${{}^{q}D}_{N}$ as an average effective outdegree (branching factor in computer science) that accounts for branch lengths only $(q = 0)$ or for both branch lengths and branch sizes $(q > 0)$ . Less formally, ${{}^{q}D}_{N}$ quantifies the bushiness of the tree.

With the same $w$ and $v$ , we define the universal tree balance ${{}^{q}J}_{N}$ as

{{}^{q}J}_{N} : = M_{node, 0} ({}^{q}J) = \{\begin{array}{l} \frac{1}{\overline{h}} \sum_{k \in V} \frac{1}{{\overline{h}}_{C_{k}}} \sum_{j \in A_{k}} v_{j k} \int_{0}^{h} S_{C_{k}} (x)^{q} J (P_{j} (x)) d x & if h > 0 \\ 1 & otherwise. \end{array}

(0.9)

In the case of uniform branch lengths, this definition simplifies to

{{}^{q}J}_{N} = \frac{1}{\overline{h}} \sum_{i \in V} S_{i}^{q} J (P_{i}),

where $S_{i}$ and $P_{i}$ are defined as in Equation 0.5. This means that for trees with uniform branch lengths, ${{}^{q}J}_{N}$ is identical to our previous definition of the tree balance index $J^{q}$ (Lemant et al., 2022), excepting one important difference. Whereas our prior index assigns a balance score of zero to any node that has outdegree 1, the above definition instead assigns a balance score of one. Therefore linear trees are considered maximally unbalanced according to $J^{q}$ but maximally balanced according to ${{}^{q}J}_{N}$ . This difference ensures that all our new evenness indices have consistent definitions and interpretations.

Example 0.3

Consider the perfectly balanced, bifurcating, leafy tree with four leaves and branch lengths $λ$ (upper two branches) and $1 - λ$ (lower four branches), as shown in Figure 3a. For all $q ⩾ 0$ , if $λ ⩾ \frac{1}{2}$ then ${{}^{q}D}_{N} = 2$ , and otherwise ${{}^{q}D}_{N} = 4^{1 - λ}$ , as shown in Figure 3b (dark blue curve). A step-by-step derivation is in the Appendix.

Fig. 3. — a) The four-leaf tree considered in Examples 0.3 and 0.6. b) Index values versus branch length $λ$ for the tree of Example 0.6. Curves for indices with parameter $q$ are independent of the value of $q ⩾ 0$ . The y-axis is log-transformed so that the curves for all diversity indices except ${}^{0}{\overline{D}}$ and $M_{l o n g, 1} ({}^{q}D)$ appear piecewise linear. c) ${}^{q}{\overline{D}}$ and ${{}^{q}D}_{L}$ values for the four-leaf tree considered in Example 0.6, for varied $q$ with $λ = \frac{1}{2}$ .

The above example illustrates that, for leafy ultrametric trees, the node-wise mean diversity, like the longitudinal mean diversity, is a piecewise exponential function of branch lengths. Equivalently, the entropy indices are piecewise linear. This property depends on our defining the ancestor weight function $v_{j k}$ as an integral of $S_{C_{k}}$ . Because $S_{C_{k}}$ is a step function, the integrals in all our node-wise mean index definitions are simply sums of areas of rectangles, and the widths of these rectangles are linear functions of branch lengths. Our definitions are designed so that, although Equations 0.8 and 0.9 might appear complicated, in practice they produce relatively simple expressions.

The star mean and new star mean indices

Like the longitudinal and node-wise means, the star mean is based on branch sizes. Unlike those other two means, but in common with the node-size indices ${}^{q}D$ and ${}^{q}J$ , the star mean ignores tree topology. The idea is that, in effect, we rearrange the tree by reattaching all branches to the root to form a star tree, while retaining branch sizes and lengths, and then calculate the longitudinal (equivalently node-wise) mean index value of the star tree. For index $F$ and tree $T$ , we define the star mean of order $r$ of $F$ such that

M_{star, r} (F) (T; w^{*}) : = \{\begin{array}{l} {(\frac{\int_{0}^{h} w^{*} (x) {[F (P^{*} (x))]}^{r} d x}{\int_{0}^{h} w^{*} (x) d x})}^{\frac{1}{r}} & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

(0.10)

where $P^{*} (x) : = \{p_{b}^{*} (x) : b \in B, p_{b}^{*} (x) > 0\}$ ,

p_{b}^{*} (x) : = \{\begin{array}{l} s_{b} (x + d_{b}) / S^{*} (x) & if S^{*} (x) > 0 \\ 0 & otherwise, \end{array} S^{*} (x) : = \sum_{b \in B} s_{b} (x + d_{b}),

and $d_{b}$ is the depth of the parent node of branch $b$ . Note that

\int_{0}^{h} S^{*} (x) = \int_{0}^{h} S (x) = \overline{h} .

With $w^{*} = S^{*}$ , we define the star mean diversity of order $q$ as

{{}^{q}D}_{S} : = M_{star, 0} ({}^{q}D),

(0.11)

which is equivalent to ${{}^{q}H}_{S} = M_{star, 1} ({}^{q}H)$ with ${{}^{q}D}_{S} = e x p {{}^{q}H}_{S}$ . In particular,

{{}^{0}D}_{S} = \{\begin{array}{l} e x p (\frac{1}{\overline{h}} \int_{0}^{h} S^{*} (x) l o g |P^{*} (x)| d x) & if h > 0 \\ F (\emptyset) & otherwise, \end{array}

{{}^{1}D}_{S} = \{\begin{array}{l} e x p (\frac{1}{\overline{h}} \int_{0}^{h} S^{*} (x)^{1} H (P^{*} (x)) d x) & if h > 0 \\ F (\emptyset) & otherwise. \end{array}

${{}^{q}D}_{S}$ quantifies the effective number of branches in the tree, either accounting for branch lengths only $(q = 0)$ or for both branch lengths and branch sizes $(q > 0)$ . Because every non-root node has exactly one parent branch, and because ${{}^{0}D}_{S}$ accounts for branch lengths but not sizes, ${{}^{0}D}_{S}$ can also be interpreted as an effective number of non-root nodes. We also define an index that quantifies the evenness of all branch sizes:

{{}^{q}J}_{S} : = M_{star, 0} ({}^{q}J) = \{\begin{array}{l} \frac{1}{\overline{h}} \int_{0}^{h} S^{*} (x)^{q} J (P^{*} (x)) d x & if h > 0 \\ 1 & otherwise. \end{array}

(0.12)

Figures 1b and 3b illustrate how ${{}^{0}D}_{S}$ , ${{}^{1}D}_{S}$ and ${{}^{1}J}_{S}$ values vary with branch lengths for three- and four-leaf trees.

Non-normalized indices

Although our focus is on indices that describe shape, rather than size, we note that every longitudinal, node-wise, or star mean diversity index can be converted into a non-normalized diversity index simply by omitting the normalization factor. Such indices are useful in applications where the unit of branch length should be retained, such as when assessing loss of richness or diversity due to the removal of a node. In particular, we will find it useful to define the non-normalized entropy index

H_{P}^{'} : = - \sum_{b \in B} l_{b} p_{b} \log p_{b} = \sum_{i \in I} h_{i} {l o g}^{1} D (P_{i}) .

(0.13)

Results

${{}^{q}D}_{L}$ improves on prior indices for non-ultrametric trees

Our indices ${{}^{0}D}_{L}$ and ${{}^{1}D}_{L}$ are similar to well-known pre-existing indices but with important improvements (Table 3). The phylogenetic diversity of Faith (1992) – which is popular among conservation biologists – is defined as

P D : = \sum_{b \in B} l_{b} .

Table 3.

Advantages of using our indices instead of previously-defined indices.

Prior index	Proposed replacement	Equation	Advantages of replacement

Allen et al’s $H_{P}$	$H_{P}^{'}$	0.13	Interpretable for non-ultrametric trees
Chao et al’s ${}^{q}{\overline{D}}$	${{}^{q}D}_{L}$	0.3	Bounded and interpretable for non-ultrametric trees; more self-consistent; more intuitive for q > 1
All prior tree balance and imbalance indices	${{}^{q}J}_{N}$	0.9	Defined for all rooted trees; can meaningfully compare any pair of trees; accounts for node sizes and branch lengths

Open in a new tab

Phylogenetic entropy (Allen et al., 2009) – a previous generalization of Shannon’s entropy – is defined in our notation as

H_{P} : = - \sum_{b \in B} l_{b} s_{b} l o g s_{b} .

Chao et al. (2010) defined normalized versions of these indices that can be written as

{}^{0}{\overline{D}} = \frac{P D}{\overline{h}} = \frac{1}{\overline{h}} \sum_{i \in I} h_{i} |B_{i}| = \frac{\sum_{i \in I} h_{i} {}^{0}D (Q_{i})}{\sum_{i \in I} S_{i} h_{i}},

{}^{1}{\overline{D}} = e x p (\frac{H_{P}}{\overline{h}}) = e x p (- \frac{1}{\overline{h}} \sum_{i \in I} h_{i} \sum_{b \in B_{i}} s_{b} l o g s_{b}) = e x p (\frac{\sum_{i \in I} h_{i} l o g {}^{1}D (Q_{i})}{\sum_{i \in I} S_{i} h_{i}}),

where $Q_{i} = \{s_{b} : b \in B_{i}\}$ .

A first problem with these definitions is that, for non-ultrametric trees, phylogenetic entropy lacks a clear interpretation. This issue is due to $H_{P}$ being defined in terms of sets of branch sizes $Q_{i}$ instead of sets of within-interval proportional branch sizes $P_{i} = \{p_{b} = s_{b} / S_{i} : b \in B_{i}\}$ , as illustrated by the following example.

Example 0.4

Consider the three-node, two-leaf tree with leaf sizes $p$ and $1 - p$ , and leaf depths $1 + λ$ and $λ$ , respectively (Figure 4a). For this tree, as $λ \to 0$ ,

\begin{matrix} P D = 1 + 2 λ \to 1, \\ \exp H_{P} = \exp [- (1 + λ) p \log p - λ (1 - p) \log (1 - p)] \to p^{- p} . \end{matrix}

Fig. 4. — a) The two-leaf tree considered in Examples 0.4 and 0.5. b) Index values for the tree of Example 0.5 with $p = \frac{1}{4}$ . As branch length $λ$ decreases, the previously defined indices ${}^{0}{\overline{D}}$ and ${}^{1}{\overline{D}}$ (grey curves) increase monotonically until both ${}^{0}{\overline{D}} > {}^{0}D$ and ${}^{1}{\overline{D}} > {}^{0}D$ . In contrast, our new indices ${{}^{0}D}_{L}$ and ${{}^{1}D}_{L}$ (black curves) decrease monotonically as $l$ decreases, with ${{}^{0}D}_{L} < {}^{0}D$ and ${{}^{1}D}_{L} < {}^{0}D$ for all values of $λ$ .

Therefore $P D$ behaves as expected but, except when $p = 0$ or $p = 1$ , $e x p H_{P}$ approaches a limit greater than 1. Hence $e x p H_{P}$ (which is supposed to be a measure of diversity) is greater than $P D$ (a measure of richness). Moreover, whereas we expect diversity to be maximal when node sizes are equal, $e x p H_{P}$ is maximal when the node sizes are unequal (specifically, $e x p H_{P} \approx 1.44$ when $p = e^{- 1} \approx 0.37$ ). If we instead use our index $H_{P}^{'}$ (Equation 0.13) then we obtain

e x p H_{P}^{'} = e x p [λ (- p \log p - (1 - p) l o g (1 - p))] \to 1,

as we would expect.

A second problem is that if the tree is not ultrametric then ${}^{0}{\overline{D}}$ and ${}^{1}{\overline{D}}$ do not correspond to weighted means. If and only if the tree is leafy and ultrametric, $S_{i} = 1$ and $Q_{i} = P_{i}$ for all $i$ and so

{}^{0}{\overline{D}} = M_{long, 1} ({}^{0}D), {}^{1}{\overline{D}} = M_{long, 0} ({}^{1}D) = {{}^{1}D}_{L},

with $w_{i} = S_{i} h_{i} = h_{i}$ in both cases. Otherwise, the numerator weights $h_{i}$ are unequal to the denominator weights $S_{i} h_{i}$ . As previously noted (Chao et al., 2010; Leinster and Cobbold, 2012), this implies that ${}^{0}{\overline{D}}$ and ${}^{1}{\overline{D}}$ can take values exceeding the number of counted nodes when applied to non-ultrametric (or non-leafy) trees. Therefore these normalized indices lack a universal interpretation in terms of effective numbers of counted nodes (or extant types) (Leinster and Cobbold, 2012).

We avoid both problems by defining our richness and diversity indices as weighted means of the within-interval proportional branch sizes in all cases. As illustrated by the following example, the differences between ${{}^{0}D}_{L}$ and ${}^{0}{\overline{D}}$ and between ${{}^{1}D}_{L}$ and ${}^{1}{\overline{D}}$ are generally unbounded and can be relatively large even when branch sizes and node sizes are not very unequal.

Example 0.5

Consider the three-node tree of Figure 4a with $p < \frac{1}{2}$ . We have ${}^{0}D = 2$ , $\overline{h} = p + λ$ , and

\begin{matrix} {}^{0}{\overline{D}} = \frac{(1 + λ) + λ}{p + λ} > \frac{1 + 2 λ}{\frac{1}{2} + λ} = 2, \\ {}^{1}{\overline{D}} = e x p (\frac{- (1 + λ) p l o g p - λ (1 - p) l o g (1 - p)}{p + λ}) \to \frac{1}{p} > 2 as λ \to 0 . \end{matrix}

It follows that ${}^{0}{\overline{D}} > {}^{0}D$ for all $λ$ , and we can choose $λ$ sufficiently small such that also ${}^{1}{\overline{D}} > {}^{0}D$ (Figure 4b, grey curves). For the same three-node tree, our new indices are instead

\begin{matrix} {{}^{0}D}_{L} = e x p (\frac{λ \log 2}{p + λ}) < e x p (\frac{λ \log 2}{λ}) = 2, \\ {{}^{1}D}_{L} = e x p (\frac{λ (- p l o g p - (1 - p) l o g (1 - p))}{p + λ}) < e x p (\frac{λ l o g 2}{λ}) = 2 . \end{matrix}

Therefore ${{}^{0}D}_{L} < {}^{0}D$ and ${{}^{1}D}_{L} < {}^{0}D$ for all $λ ⩾ 0$ , as we would expect (Figure 4b, black curves). As $λ \to 0$ , both ${{}^{0}D}_{L}$ and ${{}^{1}D}_{L}$ approach 1, consistent with the fact that the tree has exactly one non-root node when $λ = 0$ . As $λ \to \infty$ , the tree becomes increasingly close to being an ultrametric star tree, and hence ${{}^{0}D}_{L} \to {}^{0}{\overline{D}}$ and ${{}^{1}D}_{L} \to {}^{1}{\overline{D}}$ (convergence between dashed curves and between solid curves in Figure 4b).

${{}^{q}D}_{L}$ is more self-consistent and intuitive than the ${}^{q}{\overline{D}}$ of Chao et al. (2010)

Additional problems with the ${}^{q}{\overline{D}}$ indices of Chao et al. (2010) are that they are not self-consistent, and that they have counter-intuitive properties when $q > 1$ . The general definition can be expressed as

{}^{q}{\overline{D}} = {(\frac{1}{\overline{h}} \sum_{i \in I} h_{i} \sum_{b \in B_{i}} s_{b}^{q})}^{\frac{1}{1 - q}},

which can be restructured as

{}^{q}{\overline{D}} = {(\frac{1}{\overline{h}} \sum_{i \in I} h_{i} {[{(\sum_{b \in B_{i}} s_{b}^{q})}^{\frac{1}{1 - q}}]}^{1 - q})}^{\frac{1}{1 - q}} = {(\frac{1}{\overline{h}} \sum_{i \in I} h_{i} {[{}^{q}D (Q_{i})]}^{1 - q})}^{\frac{1}{1 - q}} .

Hence for leafy ultrametric trees we have

{}^{q}{\overline{D}} = M_{long, 1 - q} ({}^{q}D),

with $w_{i} = h_{i} = S_{i} h_{i}$ . We have thus shown that, in the case of leafy ultrametric trees, every ${}^{q}{\overline{D}}$ can be expressed as a weighted mean of within-interval diversities. But ${}^{0}{\overline{D}}$ is the weighted arithmetic mean, ${}^{1}{\overline{D}}$ is the weighted geometric mean, and in general ${}^{q}{\overline{D}}$ is the weighted power mean of exponent $1 - q$ . One consequence is that, for ultrametric trees in which every transverse interval contains branches of equal size, the set of within-interval values will be the same for every $q$ value but the ${}^{q}{\overline{D}}$ values will be different. Moreover, as $q$ becomes larger, ${}^{q}{\overline{D}}$ increasingly gives larger weight to smaller within-interval diversities. As $q \to \infty$ , the ${}^{q}D$ value assigned to each interval approaches the reciprocal of the maximum branch size within the interval. Counter-intuitively, ${}^{q}{\overline{D}}$ approaches the minimum of these within-interval ${}^{q}D$ values.

These peculiar properties of ${}^{q}{\overline{D}}$ are unnecessary and have no obvious advantages. The Hill numbers ${}^{q}D$ , which are used to assign a diversity value to each interval, necessarily relate to different types of weighted mean (Equation 0.1). But the method of averaging between intervals need not depend on the method of calculating diversity within intervals. Every Hill number ${}^{q}D$ can be extended to account for tree shape using the weighted arithmetic mean, the weighted geometric mean, or any other weighted power mean of the within-interval diversities by varying exponent $r$ of the longitudinal mean diversity index

M_{long, r} ({}^{q}D) = {(\frac{1}{\overline{h}} \sum_{i \in I} S_{i} h_{i} {[{}^{q}D (Q_{i})]}^{r})}^{\frac{1}{r}} .

The same choice exists when defining node-wise means and star means. To avoid incompatibilities within our system, we define all our diversity indices as weighted geometric means ( $r \to 0$ ). The following example illustrates the problem and our solution.

Example 0.6

Consider again the four-leaf tree of Example 0.3 (Figure 3a). The longitudinal mean diversity values assigned to this tree are

\begin{matrix} {{}^{q}D}_{L} = {}^{1}{\overline{D}} = e x p (λ l o g 2 + (1 - λ) l o g 4) = 2^{2 - λ} for all q ⩾ 0, \\ and {}^{0}{\overline{D}} = 2 h + 4 (1 - λ) = 2 (2 - λ), \end{matrix}

which are unequal except when $λ = 0$ or $λ = 1$ (Figure 3b, black and grey curves). In particular, in the case of uniform branch lengths $(λ = \frac{1}{2})$ , we find ${}^{1}{\overline{D}} = 2 \sqrt{2} \approx 2.83$ and ${}^{0}{\overline{D}} = 3$ (Figure 3c, dashed curve). As derived in Example 0.3, the node-wise mean diversity for this tree is

{{}^{q}D}_{N} = \{\begin{array}{l} 4^{1 - λ} & if λ < \frac{1}{2} \\ 2 & otherwise, \end{array}

for all $q ⩾ 0$ . Choosing the arithmetic mean instead of the geometric mean would instead give

M_{long, 1} ({}^{q}D) = \{\begin{array}{l} 4 (1 - λ) & if λ < \frac{1}{2} \\ 2 & otherwise. \end{array}

${{}^{q}D}_{N} : = M_{long, 0} ({}^{q}D) \neq M_{long, 1} ({}^{q}D)$ for all $q ⩾ 0$ and all $λ$ with $0 < λ < \frac{1}{2}$ (Figure 3b, dark blue and pale blue curves). As $q \to \infty$ , ${}^{q}{\overline{D}} \to 2$ (Figure 3c, dashed curve), while ${{}^{q}D}_{L}$ remains constant (Figure 3c, solid line).

${{}^{q}D}_{N}$ improves on all prior tree balance and imbalance indices

As previously explained (Lemant et al., 2022) and as summarized in Tables 1 and 3, conventional tree balance and imbalance indices including Sackin’s index, Colless’ index, the total cophenetic index, and others (reviewed by Fischer et al. (2023)) have important shortcomings. In the first place, these indices account for neither node sizes nor branch lengths. This means, for example, that these indices consider all star trees maximally balanced and all caterpillar trees maximally imbalanced, even as the relative sizes of some nodes or the relative lengths of some branches approach zero (Figure 5, green lines). The tree balance index $J^{q}$ defined by Lemant et al. (2022) varies continuously with changing node sizes but is independent of branch lengths (Figure 5, dashed purple curves). ${{}^{q}J}_{N}$ improves on $J^{q}$ by also varying continuously with branch lengths (Figure 5, solid purple curves).

Fig. 5. — Values of three tree balance indices for a tree undergoing continuous changes. $J^{1}$ is the index introduced by Lemant et al. (2022), which is equal to ${{}^{1}J}_{N}$ in the central third of the plot. $I_{S, n o r m}$ is the normalized Sackin index, which is undefined for the leftmost, linear tree. We plot $1 - I_{S, n o r m}$ for fair comparison because $I_{S, n o r m}$ is an imbalance index whereas $J^{1}$ and ${{}^{1}J}_{N}$ are balance indices. The normalized Colless index is equal to $I_{S, n o r m}$ in the rightmost third of the plot and is otherwise undefined. The normalized total cophenetic index is equal to $I_{S, n o r m}$ throughout the plot.

Lemant et al. (2022) further showed that, even when restricted to the tree types on which conventional tree balance indices are defined, and even when all node sizes are equal, $J^{q}$ enables a more meaningful comparison of trees with different degree distributions or different numbers of leaves. For example, when applied to leafy caterpillar trees with uniform branch lengths and uniform node sizes, $J^{q}$ considers long trees (those with many leaves) to be less balanced than short ones, whereas conventional indices consider them equally imbalanced. ${{}^{q}J}_{N}$ , as an extension of $J^{q}$ , shares this useful property.

Inequalities between indices

Choosing self-consistent definitions ensures that our diversity indices are related by simple sets of inequalities, which formalize and generalize the results of previous sections (Figure 6a). Hill (1973) showed that ${}^{q}D ⩾ {}^{r}D$ for all $r ⩾ q ⩾ 0$ . Because ${{}^{q}D}_{L}$ and ${{}^{q}D}_{N}$ are geometric weighted means of ${}^{q}D$ values with weights independent of $q$ , it follows that they obey corresponding inequalities:

Fig. 6. — a) Inequalities between diversity indices for all $q ⩾ 0$ and all $r ⩾ q$ . b) Examples of leafy trees with uniform branch lengths for which various index values are equal for all $q, r ⩾ 0$ . The top left corner of each panel contains a grid, whose twelve squares correspond to the twelve indices shown in the key. A line connecting two grid squares indicates that the corresponding indices are equal for the tree shown in the panel. Instances where evenness indices are equal to 1 are indicated in the third grid column. c) A tree for which ${{}^{q}J}_{L} > {}^{q}J$ and ${{}^{q}J}_{L} {{}^{q}J}_{N}$ .

Property 0.1

For all rooted trees, ${{}^{q}D}_{L} ⩾ {{}^{r}D}_{L}$ , ${{}^{q}D}_{N} ⩾ {{}^{r}D}_{N}$ and ${{}^{q}D}_{S} ⩾ {{}^{r}D}_{S}$ for all $r ⩾ q ⩾ 0$ .

Additional inequalities exist between different types of diversity index but not among the evenness indices:

Proposition 1

For all rooted trees, ${}^{0}D ⩾ {{}^{q}D}_{L}$ for all $q ⩾ 0$ . For all leafy ultrametric trees, but not for all rooted trees, ${}^{q}D ⩾ {{}^{q}D}_{L}$ for all $q ⩾ 0$ .

Proposition 2

For all rooted trees, ${{}^{q}D}_{L} ⩾ {{}^{q}D}_{N}$ for all $q ⩾ 0$ .

Proposition 3

For $q > 0$ , no single ordering of ${}^{q}J$ , ${{}^{q}J}_{L}$ and ${{}^{q}J}_{N}$ applies to all leafy ultrametric trees.

Proofs of these three propositions can be found in the Appendix. Informally, the reason why the second inequality in Proposition 1 applies only to leafy ultrametric trees is that ${{}^{q}D}_{L}$ , unlike ${}^{q}D$ , is independent of the size of the root node (and any node arbitrarily close to the root).

Special cases

Our consistent definitions further yield numerous simple equations that unite our indices in special cases. To simplify the statement of these results, we will assume that all branch sizes are greater than zero. This assumption implies no loss of generality because our index definitions are invariant to the addition or removal of subtrees containing only zero-sized branches (which in an evolutionary tree correspond to extinct lineages). The properties in this section hold for all $q$ , $r ⩾ 0$ .

We begin with cases in which diversities based on the same type of average but with different $q$ values are equal. These first four properties, which are illustrated by simple examples in the top row of Figure 6b, follow immediately from the definitions.

Property 0.2

${}^{q}D = {}^{r}D \Leftrightarrow {}^{q}J = 1$ if and only if all counted nodes have equal size.

Property 0.3

${{}^{q}D}_{L} = {{}^{r}D}_{L} \Leftrightarrow {{}^{q}J}_{L} = 1$ if and only if the branch sizes at every depth are equal. This also implies ${{}^{q}D}_{N} = {{}^{r}D}_{N}$ and ${{}^{q}J}_{N} = 1$ .

Property 0.4

${{}^{q}D}_{N} = {{}^{r}D}_{N} \Leftrightarrow {{}^{q}J}_{N} = 1$ if and only if every internal node’s child branches have equal size.

Property 0.5

$({}^{q}D = {}^{r}D a n d {{}^{q}D}_{L} = {{}^{r}D}_{L}) \Leftrightarrow ({}^{q}J = 1 a n d {{}^{q}J}_{N} = 1)$ if and only if the branch sizes at every depth are equal and all node sizes are equal. This implies that the tree is ultrametric and perfectly symmetric, and that ${{}^{q}D}_{N} = {{}^{r}D}_{N}$ and ${{}^{q}J}_{N} = 1$ .

In other special cases, we find equality among diversities of different types but with equal $q$ values. Again, these properties are directly implied by the definitions. Simple examples are shown in the middle row of Figure 6b.

Property 0.6

${{}^{q}D}_{L} = {}^{q}D$ if and only if the tree is a leafy ultrametric tree in which no non-root node has outdegree greater than 1. This also implies ${{}^{q}J}_{L} = {}^{q}J$ .

Property 0.7

${{}^{q}D}_{N} = {{}^{q}D}_{L}$ if and only if the tree is a piecewise star tree. This also implies ${{}^{q}J}_{N} = {{}^{q}J}_{L}$ .

Property 0.8

${{}^{q}D}_{S} = {{}^{q}D}_{N} = {{}^{q}D}_{L}$ if and only if the tree is a star tree. This also implies ${{}^{q}J}_{S} = {{}^{q}J}_{N} = {{}^{q}J}_{L}$ .

Property 0.9

${{}^{q}D}_{S} = {{}^{q}D}_{N} = {{}^{q}D}_{L} = {}^{q}D$ if and only if the tree is a leafy ultrametric star tree. This also implies ${{}^{q}J}_{S} = {{}^{q}J}_{N} = {{}^{q}J}_{L} = {}^{q}J$ .

It follows that equality both within and between types applies under more restrictive conditions, as illustrated in the bottom row of Figure 6b:

Property 0.10

${{}^{q}D}_{L} = {}^{r}D$ if and only if the tree is a leafy ultrametric tree with equally sized leaves in which only the root has outdegree greater than 1. This also implies ${{}^{q}D}_{N} = {{}^{r}D}_{N}$ , ${{}^{q}D}_{S} = {{}^{r}D}_{S}$ and ${{}^{q}J}_{S} = {{}^{q}J}_{N} = {{}^{q}J}_{L} = {}^{q}J = 1$ .

Property 0.11

${{}^{q}D}_{N} = {{}^{r}D}_{L}$ if and only if the tree is a piecewise star tree with equal branch sizes at every depth. This also implies ${{}^{q}J}_{N} = {{}^{q}J}_{L} = 1$ .

Property 0.12

${{}^{q}D}_{S} = {{}^{q}D}_{N} = {{}^{q}D}_{L}$ if and only if the tree is a star tree with equally sized leaves. This also implies ${{}^{q}J}_{S} = {{}^{q}J}_{N} = {{}^{q}J}_{L} = 1$ .

Property 0.13

${{}^{q}D}_{S} = {{}^{q}D}_{N} = {{}^{q}D}_{L} = {}^{r}D$ if and only if the tree is a leafy ultrametric star tree with equally sized leaves. This also implies ${{}^{q}J}_{S} = {{}^{q}J}_{N} = {{}^{q}J}_{L} = {}^{q}J = 1$ .

In yet another set of special cases, the evenness formulas simplify to ratios. The following two results are immediate consequences of ${{}^{0}D}_{L}$ or ${{}^{0}D}_{N}$ being constant under the specified conditions.

Property 0.14

If the branch count across the tree is constant and greater than one then

{{}^{q}J}_{L} = \frac{l o g {{}^{q}D}_{L}}{l o g {{}^{0}D}_{L}} .

Property 0.15

If the tree has uniform outdegree greater than one and the branches present at every depth in the tree have equal lengths then

{{}^{q}J}_{N} = \frac{l o g {{}^{q}D}_{N}}{l o g {{}^{0}D}_{N}} .

All properties described in this section would also hold if we were to define all our richness and diversity indices as weighted arithmetic, rather than geometric, means of interval or node values (or indeed any other weighted power mean). Our preference for geometric means will be justified in the next section.

The leafy tree identity

For an important class of trees, our index definitions lead to a surprisingly simple, fundamental connection between tree balance, Shannon’s diversity index, Sackin’s index, and outdegree. This result is less obvious than the properties of the previous section and requires a more substantial proof. We term this unifying relationship the leafy tree identity.

Lemma 0.7

If the tree is leafy and all branches have equal length $l > 0$ then

\log {{}^{1}D}_{N} = \frac{{}^{1}H l}{\overline{h}} .

If additionally all $n$ leaves have equal size then

\log {{}^{1}D}_{N} = \frac{n \log n}{I_{S}},

where $I_{S}$ is Sackin’s index.

Proof.

The proof is identical to the proof of Proposition 6 in Lemant et al. (2022), except for the base of the logarithms and the additional factor $l$ .

Proposition 4

(The leafy tree identity; generalization of Proposition 6 in Lemant et al. (2022)) If the tree is leafy and has uniform branch lengths and all internal nodes have outdegree $m > 1$ then

{{}^{1}J}_{N} = \frac{{}^{1}H l}{\overline{h} l o g m} .

(0.14)

If additionally all $n$ leaves have equal size then

{{}^{1}J}_{N} = \frac{n {l o g}_{m} n}{I_{S}} .

(0.15)

Proof.

The result follows immediately from Lemma 0.7 and Property 0.15.

The leafy tree identity implies that, among leafy trees with uniform branch lengths and uniform outdegrees, tree balance depends only on node sizes and node depths. If two such trees have equal effective heights relative to branch length $(\overline{h} / l)$ , equal outdegrees $(m)$ , and equal node size Shannon entropy values $(^{1} H)$ then they must have equal balance $({{}^{1}J}_{N})$ , irrespective of topology and number of leaves. For example, Figure 7a and 7b show a pair of bifurcating leafy ultrametric trees with uniform leaf sizes and uniform branch lengths. Because these trees have equal outdegrees, leaf counts, and Sackin’s index values, the special form of the leafy tree identity (Equation 0.15) implies they must be equally balanced (other equal index values are recorded in Figure 7f). The following example applies the more general form of the leafy tree identity (Equation 0.14) to trees that are less obviously similar.

Fig. 7. — a-b) Two leafy bifurcating trees with uniform node sizes and uniform branch lengths, which differ in topology but are equally balanced. c-e) Three leafy bifurcating trees with uniform branch lengths, which differ in topology and number of leaves but are equally balanced. Nodes are labelled with their sizes. f) Table recording where pairs of trees have equal or unequal index values. Parameter $q$ can take any non-negative value.

Example 0.8

Consider the bifurcating leafy ultrametric tree with four leaves, uniform branch lengths, and leaf sizes $\frac{3}{8}$ , $\frac{1}{8}$ , $\frac{1}{4}$ and $\frac{1}{4}$ (Figure 7c). Now suppose we retain the leaf sizes but rearrange the nodes and branches to form a caterpillar tree with the node of size $\frac{3}{8}$ at depth $l$ and one of the nodes of size $\frac{1}{4}$ at depth $2 l$ (Figure 7d). Finally, consider a six-leaf caterpillar tree with uniform branch lengths and proportional leaf sizes (in order of increasing depth) $\frac{1}{2}$ , $\frac{1}{4}$ , $x$ , $y$ , $p$ and $p$ , with $p \approx 0.026606$ (Figure 7e). All three trees have identical values of $m$ , $\overline{h}$ and ${}^{1}H$ (see Appendix for derivation). Hence the leafy tree identify implies that they have equal ${{}^{1}D}_{N}$ and ${{}^{1}J}_{N}$ values. All three trees also have equal values of ${}^{1}D = e x p {}^{1}H \approx 3.75$ and ${{}^{0}D}_{N} = m = 2$ . Other index values shared by pairs of trees are indicated in Figure 7f.

Equation 0.15 is especially useful because the numerator $n {l o g}_{m} n$ is the minimum value that $I_{S}$ can attain on leafy $n$ -leaf trees with uniform branch lengths, uniform node sizes, and uniform outdegree $m > 1$ . Hence $(n {l o g}_{m} n) / I_{S}$ lies between 0 and 1 and is equal to 1 if and only if the tree is fully balanced. We previously showed (Proposition 7 in Lemant et al. (2022)) that, among all node-wise arithmetic mean indices with $w_{i} = n_{i}$ , ${{}^{1}J}_{N}$ is the only index that satisfies Equation 0.15. Our previous proof can be straightforwardly generalized to show that Equation 0.15 cannot hold for any index of the form $M_{n o d e, r} ({}^{q}J)$ with $r \neq 1$ or $q \neq 1$ . Therefore ${{}^{1}J}_{N}$ is the only tree balance index for which this useful, unifying identity holds.

An example cross-disciplinary application

We illustrate the universality of our methods by using them to compare the shapes of two trees from different fields of research, representing dissimilar processes and constructed using different methods. The first of these trees depicts the evolution of the Human Immunodeficiency Virus (HIV) within a host, as inferred from molecular data and as used in another recent study of tree shape indices (Barzilai and Schrago, 2023). The second tree represents the diversification of the Uralic language family (Honkola et al., 2013). To simplify the exposition we assign size zero to all internal nodes and an equal size to all leaves.

If we disregard the inferred branch lengths then it is difficult by eye to assess which tree is the more diverse or more balanced (Figure 8a, b). These apparent similarities are borne out in the shape index values (Figure 8c, d). Excepting one node, both trees are bifurcating and therefore both have ${{}^{0}D}_{N} \approx 2$ . The two trees have similar branch counts in total ( ${{}^{0}D}_{S} = 33$ and 32) and at each depth $(3 < {{}^{0}D}_{L} < 4)$ . The ${{}^{1}D}_{N}$ , ${{}^{1}D}_{S}$ and ${{}^{1}D}_{L}$ values are somewhat lower than the corresponding richness values due to imbalances, as captured by our evenness indices, which are likewise similar for the two trees ( ${{}^{1}J}_{N}$ and ${{}^{1}J}_{L}$ between 0.7 and 0.8; ${{}^{1}J}_{S} \approx 0.86$ ). Lemma 0.7 further implies similar $I_{S}$ values (93 and 97).

Fig. 8. — a-b) Trees with equalized branch lengths representing the within-host evolution of HIV (a) and the evolutionary history of the Uralic languages (b). c) Diversity index values for the two trees with equalized branch lengths. d) Evenness index values for the two trees with equalized branch lengths. e-f) The same trees but with the originally inferred branch lengths. g) Diversity index values, accounting for branch lengths. h) Evenness index values, accounting for branch lengths. In all cases, leaves are assigned equal size and internal nodes are assigned size zero. The HIV tree was sourced from the GitHub repository associated with Barzilai and Schrago (2023) (file PIC38051.tre) and the languages tree from the D-PLACE database (Kirby et al., 2016) (folder honkola et al2013).

When we restore the inferred branch lengths, the two trees no longer look alike (Figure 8e, f). The HIV phylogeny approximates a non-ultrametric star tree, with long branches originating close to the root. The average effective out-degree of the HIV tree, accounting for unequal branch lengths, is substantially higher than two ( ${{}^{0}D}_{N} \approx 9$ ); the effective number of branches is three times lower than when branch lengths are ignored $({{}^{0}D}_{S} \approx 11)$ ; and there are more than twice as many parallel branches ( ${{}^{0}D}_{L} \approx 10$ ). Because the HIV tree is approximately a star tree with equal node sizes, all its diversity indices are approximately equal and all its evenness indices are close to one (Property 0.12). In the case of the languages tree, accounting for the inferred branch lengths – which are approximately exponentially distributed and not nearly so depth-dependent – has only a small effect on most index values. The diversity indices for the languages tree remain far from equal. Altogether our indices thus show that the HIV tree is much bushier, has a larger number of effective types, and is in every sense more balanced than the languages tree (Figure 8g, h).

In summary, the clear differences between these two trees, implying different modes of evolution, are captured only by indices that account for their different branch length distributions. An analysis based on prior tree balance indices, which ignore branch lengths, would incorrectly conclude that the trees have very similar shapes and plausibly resulted from similar processes.

An example application to model-generated trees

As a final demonstration of the potential for our indices to distinguish trees generated by different processes, we reanalyse results of a recent computational modelling study of tumour evolution by Lewinsohn et al. (2023). The original study sought to infer differences between the shapes of evolutionary trees corresponding to alternative modes of tumour expansion – boundary-driven growth (BDG) versus unrestricted growth. On average, the BDG model was found to generate ultrametric time trees with higher variance in their terminal branch lengths, and non-ultrametric gene trees with higher variance in their leaf depths (mutations per cell).

To see how our tree shape indices vary with simulated tumour growth mode, we consider the two representative simulated tumours from Figure 1 of Lewinsohn et al. (2023). The time trees (Figure 9a-c) have the same number of leaves and almost identical effective numbers of non-root nodes ( ${{}^{0}D}_{S} \approx 117$ and 118). However, the BDG time tree has 22% higher effective branch count ( ${{}^{1}D}_{S} \approx 65$ versus 53), 26% higher branch count across the tree ( ${{}^{0}D}_{L} \approx 28$ versus 22), and 25% higher leaf diversity ( ${{}^{1}D}_{L} \approx 21$ versus 17).

Fig. 9. — a-b) Time trees generated by computational models of tumour evolution with boundary-driven growth (a) or unrestricted growth (b). Leaves represent extant cells and branch lengths are proportional to time elapsed between cell division events. c) Tree shape index ratios for the two time trees. d-e) Gene trees generated by the same simulations as the time trees. Leaves represent extant cells and branch lengths are proportional to genetic distances. f) Tree shape index ratios for the two gene trees. All tree data was obtained from the GitHub repository associated with Lewinsohn et al. (2023).

The gene trees (Figure 9d-f) likewise have the same number of leaves and almost identical effective numbers of non-root nodes ( ${{}^{0}D}_{S} \approx 136$ in both cases). But the BDG gene tree, being less star-like, has substantially lower average effective outdegree ( ${{}^{0}D}_{N} \approx 2.6$ versus 3.0; ${{}^{1}D}_{N} \approx 2.1$ versus 2.6), 20% fewer branches across the tree ( ${{}^{0}D}_{L} \approx 17$ versus 21), and 26% lower leaf diversity ( ${{}^{1}D}_{L} \approx 11$ versus 15). The BDG gene tree is also less balanced ( ${{}^{1}J}_{N} \approx 0.76$ versus 0.84).

Whereas well chosen problem-specific indices might give greater statistical power for distinguishing particular tree types, an advantage of our multi-dimensional system is that it is designed to be universally applicable, to facilitate comparisons between studies and data sets. Leaf depth variance, for instance, cannot by itself tell apart ultrametric trees, while terminal branch length variance is inapplicable to trees with uniform (or unknown) branch lengths.

Discussion

The seminal paper of Hill (1973) cautions that there is “almost unlimited scope for mathematical generality in relation to measures of diversity and taxonomic difference” and therefore “Simple and well-understood indices should be used”. In accordance with this advice, here we have constructed new tree shape indices as weighted means of the most standard, basic diversity and evenness indices. This systematic approach ensures that all our indices are not only robust and universally applicable but also have simple, consistent interpretations and clear interrelationships.

Some of the indices we have defined here are refinements of prior approaches to assessing tree shape. Our ${{}^{q}D}_{L}$ and $H_{P}^{'}$ are similar to the ${}^{q}{\overline{D}}$ of Chao et al. (2010) and the phylogenetic entropy $H_{P}$ of Allen et al. (2009), respectively, but are more self-consistent and can be meaningfully applied to non-ultrametric trees. ${{}^{q}J}_{N}$ builds on the ideas of Lemant et al. (2022) but, by accounting for branch lengths – a key advantage of prior phylogenetic diversity and phylogenetic entropy indices, not shared by any prior tree balance indices – generalizes the concept of tree balance to a wider class of trees. These new indices share all the desirable properties but not the shortcomings of their predecessors and can therefore universally supersede them (Table 3). For the remainder of our indices describing average effective out-degree, effective numbers of nodes and branches, and evenness of branch sizes, we know of no precedents. In combination, our indices provide a more sophisticated, general, multidimensional description of tree shape than has previously been possible.

Whereas we have focussed on a system built around ${}^{q}D$ and ${}^{q}J$ , it is easy to use our general definitions of the longitudinal, node-wise, and star means to quantify other aspects of tree shape. A parallel, self-consistent system of indices can be defined by setting $w_{i} = h_{i}$ instead of $w_{i} = S_{i} h_{i}$ in Equations 0.3 and 0.4, and setting $w_{k} = 1$ instead of $w_{k} = S_{C_{k}}$ in Equations 0.8 and 0.9. These indices, which are robust to small changes in branch lengths but not node sizes, are normalized by dividing by $h$ instead of $\overline{h}$ . Alternatively, ${}^{q}D$ can be replaced by another basic diversity index, or ${}^{q}J$ by another evenness index, such as the ratio ${}^{q}D / {}^{0}D$ preferred by Hill (1973) (see also Smith and Wilson (1996)^;Jost (2010)^;Tuomisto (2012)). Based on the means, we can also straightforwardly derive expressions for higher moments to obtain indices that, for example, quantify how much effective out-degree varies across all nodes or varies with node depth.

There are nevertheless several reasons for preferring our specific definitions. First, the foundational ${}^{q}D$ and ${}^{q}J$ are the most popular diversity and evenness indices among biologists (Tucker et al., 2017; Tuomisto, 2012). Second, defining entropy and evenness indices as weighted arithmetic means, and diversity indices as weighted geometric means, results in relatively simple expressions, especially in the case of leafy ultrametric trees. Third, ${{}^{1}J}_{N}$ is the only universal tree balance index for which the unifying leafy tree identity holds. In summary, we have taken the best of the existing indices, improved them, unified them, and filled in the gaps to create a coherent system (Table 2).

Given the ubiquity of tree structures, we expect our multidimensional method of describing tree shape to empower research and inform decision making in diverse domains. Our initial development of universal, robust indices was motivated by the need to compare and categorize non-leafy, non-ultrametric trees representing the clonal evolution of human tumours, where node sizes (corresponding to cell subpopulation sizes) and branch lengths (genetic distances) convey valuable information (Noble et al., 2022). Tree structures with node sizes and branch lengths are likewise centrally important in community ecology, conservation biology, systematic biology, and the study of microbial evolution. For instance, our indices can be used instead of conventional tree balance indices to evaluate alternative models of speciation, or to investigate how the mode of evolution of a pathogenic virus varies with geographical location, time period, or strain. In place of phylogenetic diversity and phylogenetic entropy, our non-normalized diversity indices could be used to inform policy making by quantifying how different actions would affect biodiversity. Beyond biology, obvious subjects for analysis include phylogenetic trees of language evolution, hierarchical organizational structures, and the tree data structures that abound in computing. As we have illustrated, our generic indices can be used not only within but also across domains to uncover similarities and differences in, say, the evolution of organisms, languages, and technologies.

One key topic for further theoretical research is to derive the expected values and covariances of our indices under standard tree generation models, such as the uniform model and the Yule process, for comparison with empirical data. Relationships between our indices and distance-based metrics such as the mean pairwise distance (which lacks a universal normalization (Tsirogiannis et al., 2012)) also remain to be examined. In the same vein as Figure 7, we are investigating sets of distinct trees to which our indices assign equal values, to determine whether additional indices might ever be needed to distinguish between trees in typical applications. Towards establishing a universal standard for describing tree shape, we are developing software packages for calculating index values that can be integrated with popular tree inference methods. Just as the first step in analysing a set of measurements is to calculate the mean and variance, so we propose that, whenever one encounters a rooted tree, a useful first step will be to describe its shape by evaluating our indices.

Acknowledgements

We are grateful to Kerry Manson for helpful comments on an earlier draft of this manuscript, to Lucia Barzilai and Chiara Barbieri for helping us obtain suitable empirical tree data, and to anonymous reviewers for suggesting various improvements.

Funding

This work was supported by the National Cancer Institute at the National Institutes of Health (grant number U54CA217376) to RN. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

APPENDIX

Derivation of ${{}^{q}D}_{N}$ in Example 0.3

For the root $r$ , we have ${\overline{h}}_{C_{r}} = λ$ and $A_{r} = {r}$ . For either of the other internal nodes $k$ , we have ${\overline{h}}_{C_{k}} = (1 - λ) / 2$ and $A_{k} = {k, r}$ . The subtree weights are

S_{C_{r}} (x) = \{\begin{array}{l} 1 if 0 ⩽ x < λ, \\ 0 otherwise, \end{array} S_{C_{k}} (x) = \{\begin{array}{l} \frac{1}{2} if λ ⩽ x < 1, \\ 0 otherwise. \end{array}

The ancestor weights are

\begin{matrix} v_{r r} = \int_{0}^{\infty} S_{C_{r}} (x) d x = \int_{0}^{λ} 1 d x = λ, v_{k k} = \int_{λ}^{2 λ} S_{C_{k}} (x) d x = \{\begin{array}{l} \int_{λ}^{2 λ} \frac{1}{2} d x = \frac{λ}{2} if λ < \frac{1}{2}, \\ \int_{λ}^{1} \frac{1}{2} d x = \frac{1 - λ}{2} otherwise, \end{array} \\ v_{r k} = \int_{2 λ}^{\infty} S_{C_{k}} (x) d x = \{\begin{array}{l} \int_{2 λ}^{1} \frac{1}{2} d x = \frac{1 - 2 λ}{2} if λ < \frac{1}{2}, \\ 0 otherwise. \end{array} \end{matrix}

The node diversity values are, for all $q ⩾ 0$ ,

{}^{q}D (P_{r} (x)) = \{\begin{array}{l} 2 if 0 ⩽ x < λ, \\ 4 if λ ⩽ x < 1, \\ 0 otherwise. \end{array} {}^{q}D (P_{k} (x)) = \{\begin{array}{l} 2 if λ ⩽ x < 1, \\ 0 otherwise. \end{array}

Hence

\begin{matrix} \int_{0}^{h} S_{C_{r}} (x)^{q} H_{N} (P_{r} (x)) d x = \int_{0}^{λ} 1 \log 2 d x = λ l o g 2, \\ \int_{0}^{h} S_{C_{k}} (x)^{q} H_{N} (P_{k} (x)) d x = \int_{λ}^{1} \frac{1}{2} l o g 2 d x = \frac{(1 - λ) l o g 2}{2}, \\ \int_{0}^{h} S_{C_{k}} (x)^{q} H_{N} (P_{r} (x)) d x = \int_{λ}^{1} \frac{1}{2} \log 4 d x = (1 - λ) l o g 2 . \end{matrix}

For all $q ⩾ 0$ , it follows from Equation 0.8 that if $λ ⩾ \frac{1}{2}$ then

{{}^{q}D}_{N} = e x p (\frac{1}{λ} λ (λ l o g 2) + 2 \times \frac{1}{(1 - λ) / 2} (\frac{1 - λ}{2} \frac{(1 - λ) l o g 2}{2})) = 2,

and otherwise

{{}^{q}D}_{N} = e x p (\frac{1}{λ} λ (λ l o g 2) + 2 \times \frac{1}{(1 - λ) / 2} (\frac{λ}{2} \frac{(1 - λ) l o g 2}{2} + \frac{2 - 2 λ}{2} (1 - λ) l o g 2)) = 4^{1 - λ} .

Proof of Proposition 1

Proof.

For $q ⩾ 0$ , let $k (q) \in I (T)$ such that ${}^{q}D (P_{k (q)}) ⩾ {}^{q}D (P_{i})$ for all $i \in I (T)$ , and let $b_{1}, \dots, b_{|P_{k (q)}|}$ denote the non-zero-sized branches in the interval $k (q)$ . Then, by a basic property of generalized means,

{{}^{q}D}_{L} (T) : = M_{long, 0} ({}^{q}D) (T) ⩽ \underset{r \to \infty}{l i m} M_{long, r} ({}^{q}D) (T) = \underset{i \in I (T)}{m a x} {}^{q}D (P_{i}) = {}^{q}D (P_{k (q)}) .

For the first part of the proposition, we note that for any interval $i \in I (T)$ , the number of non-zero-sized branches in $i$ is $|P_{i}|$ , which cannot exceed the number of counted nodes ${}^{0}D (T)$ . Hence, for any rooted tree $T$ and all $q ⩾ 0$ ,

{{}^{q}D}_{L} (T) ⩽ {}^{q}D (P_{k (q)}) : = {(\sum_{b \in B_{k (q)}} p_{b}^{q})}^{\frac{1}{1 - q}} = {(\sum_{j = 1}^{|P_{k (q)}|} p_{b_{j}}^{q})}^{\frac{1}{1 - q}} ⩽ {(\sum_{j = 1}^{0} p_{b_{j}}^{q})}^{\frac{1}{1 - q}} ⩽ {}^{0}D (T) .

We now turn to the second part. By definition, for all $i \in I$ , if $b \in B_{i}$ then branch size $s_{b} = \sum_{x \in V_{b}} f_{x}$ , where $V_{b}$ is the set of all nodes that descend from $b$ , and $f_{x}$ is the proportional size of node $x$ . For all rooted trees we have $V_{b_{1}} \cap V_{b_{2}} = \emptyset$ for all $b_{1}, b_{2} \in B_{i}$ with $b_{1} \neq b_{2}$ . For ultrametric trees, $⋃_{b \in B_{i}} V_{b} = L$ , where $L$ is the set of all leaves in the tree. For leafy ultrametric trees, $S_{i} = 1$ for all $i$ and hence $p_{b} = s_{b}$ for all $b \in B_{i}$ . Then for any leafy ultrametric tree $T$ and all $q ⩾ 0$ ,

\begin{array}{l} {{}^{q}D}_{L} (T) ⩽ {}^{q}D (P_{k (q)}) : = {(\sum_{b \in B_{k (q)}} p_{b}^{q})}^{\frac{1}{1 - q}} = {(\sum_{b \in B_{k (q)}} s_{b}^{q})}^{\frac{1}{1 - q}} = {(\sum_{b \in B_{k (q)}} {(\sum_{x \in V_{b}} f_{x})}^{q})}^{\frac{1}{1 - q}} \\ ⩽ {(\sum_{b \in B_{k (q)}} \sum_{x \in V_{b}} f_{x}^{q})}^{\frac{1}{1 - q}} = {(\sum_{x \in L (T)} f_{x}^{q})}^{\frac{1}{1 - q}} = {}^{q}D (T) . \end{array}

Finally we will prove that this inequality does not hold for all rooted trees. We will do so in a more general context to show that the result is independent of our choice of weight function $w$ and exponent $r$ . Let ${{}^{q}D}_{L, r w} = M_{l o n g, r} ({}^{q}D; w)$ for real $r$ , where $w$ is a continuous, monotonically increasing function of $S_{i}$ and $h_{i}$ such that $w_{i} = w (S_{i}, h_{i}) > 0$ when $S_{i} > 0$ or $h_{i} > 0$ , and $w_{i} \to 0$ as $S_{i} \to 0$ or $h_{i} \to 0$ . First consider the leafy but non-ultrametric three-leaf star tree $T_{1}$ in which one leaf has size $1 - p$ and depth $λ$ , and the other two leaves have size $\frac{p}{2}$ and depth $1 + λ$ (as in Figure 4 but with one more leaf). Now

{{}^{q}D}_{L, r w} (T_{1}) = {(\frac{\sum_{i \in I (T_{!})} w_{i} {[{}^{q}D (P_{i})]}^{r}}{\sum_{i \in I (T_{1})} w_{i}})}^{\frac{1}{r}} = {(\frac{w_{1} {((1 - p)^{q} + 2 {(\frac{p}{2})}^{q})}^{\frac{r}{1 - q}} + w_{2} {(2 {(\frac{1}{2})}^{q})}^{\frac{r}{1 - q}}}{w_{1} + w_{2}})}^{\frac{1}{r}} .

Since $w_{1}$ depends only on $λ$ and $w_{2}$ depends only on $p$ , we can make $λ$ a function of $p$ such that $\frac{w_{1}}{w_{2}} \to 0$ as $λ \to 0$ and $p \to 0$ , in which case ${{}^{q}D}_{L, r w} (T_{1}) \to 2$ as $λ \to 0$ and $p \to 0$ . Also, for all $q > 0$ , ${}^{q}D (T_{1}) \to 1$ as $p \to 0$ . Hence ${{}^{q}D}_{L, r w} (T_{1}) > {}^{q}D (T_{1})$ as $λ \to 0$ and $p \to 0$ . Instead setting $λ = 0$ makes $T_{1}$ ultrametric but non-leafy, with ${{}^{q}D}_{L, r w} (T_{1}) \to 2$ and ${}^{q}D (T_{1}) \to 1$ as $p \to 0$ as before.

Proof of Proposition 2

Proof.

For every node $k$ , every $j \in A_{k}$ , and at every depth $x$ , we have $P_{j} (x) \subseteq P (x)$ and so ${}^{q}H (P_{j} (x)) ⩽ {}^{q}H (P (x))$ for all $q ⩾ 0$ . Hence

\begin{array}{l} {{}^{q}D}_{N} : = e x p (\frac{1}{\overline{h}} \sum_{k \in V} \sum_{j \in A_{k}} \frac{v_{j k}}{u_{k}} \int_{0}^{h} S_{C_{k}} (x)^{q} H (P_{j} (x)) d x) \\ ⩽ e x p (\frac{1}{\overline{h}} \sum_{k \in V} \underset{j \in A_{k}}{m a x} \int_{0}^{h} S_{C_{k}} (x)^{q} H (P_{j} (x)) d x) ⩽ e x p (\frac{1}{\overline{h}} \sum_{k \in V} \int_{0}^{h} S_{C_{k}} (x)^{q} H (P (x)) d x) \\ = e x p (\frac{1}{\overline{h}} \int_{0}^{h} {}^{q}H (P (x)) \sum_{k \in V} S_{C_{k}} (x) d x) = e x p (\frac{1}{\overline{h}} \int_{0}^{h} {}^{q}H (P (x)) S (x) d x) = {{}^{q}D}_{L} . \end{array}

Proof of Proposition 3

Proof.

The top left panel of Figure 6b shows a leafy ultrametric tree for which ${}^{q}J = 1$ , ${{}^{q}J}_{L} < 1$ and ${{}^{q}J}_{N} < 1$ . The third panel in the top row of Figure 6 b shows a leafy ultrametric tree for which ${}^{q}J < 1$ , ${{}^{q}J}_{L} < 1$ and ${{}^{q}J}_{N} = 1$ . Now consider the four-leaf, bifurcating, leafy ultrametric tree with uniform branch lengths, such that the sizes of each pair of sibling leaves are $ϵ$ and $\frac{1}{2} - ϵ$ (Figure 6c). As $ϵ \to 0$ , we have ${}^{q}J \to \frac{1}{2}$ , ${{}^{q}J}_{L} \to \frac{3}{4}$ and ${{}^{q}J}_{N} \to \frac{1}{2}$ . The different orderings of ${}^{q}J$ , ${{}^{q}J}_{L}$ and ${{}^{q}J}_{N}$ for three trees are inconsistent with any universal ordering.

Derivation of index values in Example 0.2

Since the tree of Figure 7c is ultrametric, $\overline{h} / l = 2$ , where $l$ is the branch length. The node size entropy is

{}^{1}H = - \frac{3}{8} l o g \frac{3}{8} - 2 (\frac{1}{4} l o g \frac{1}{4}) - \frac{1}{8} l o g \frac{1}{8} = \frac{5}{2} l o g 2 - \frac{3}{8} l o g 3 \approx 1.32 .

The tree of Figure 7d has the same node sizes and therefore the same ${}^{1}H$ value as the previous tree. It also has the same relative effective height, as

\frac{\overline{h}}{l} = \frac{3}{8} + 2 (\frac{1}{4}) + 3 (\frac{1}{8} + \frac{1}{4}) = 2 .

For the tree of Figure 7e, since proportional node sizes must sum to unity we have

1 = \frac{1}{2} + \frac{1}{4} + x + y + 2 p ⟹ x + y + 2 p = \frac{1}{4} .

For this tree to have the same $\overline{h}$ and ${}^{1}H$ values as the four-leaf trees we additionally require

\begin{matrix} 2 = \frac{\overline{h}}{l} = \frac{1}{2} + 2 (\frac{1}{4}) + 3 x + 4 y + 5 (2 p) ⟹ 3 x + 4 y + 10 p = 1, \\ \frac{5}{2} \log 2 - \frac{3}{8} \log 3 = {}^{1}H = - \frac{1}{2} \log \frac{1}{2} - \frac{1}{4} \log \frac{1}{4} - x \log x - y \log y - 2 p l o g p . \end{matrix}

The first two equations together imply $x = 2 p$ and $y = \frac{1}{4} - 4 p$ . After substituting these results into the third equation we obtain the numerical solution $p \approx 0.026606$ . Since all three trees have identical values of $m$ , $\overline{h}$ and ${}^{1}H$ , the leafy tree identify implies that they have equal ${{}^{1}D}_{N}$ values and equal balance:

{{}^{1}D}_{N} = e x p (\frac{{}^{1}H l}{\overline{h}}) = e x p (\frac{{}^{1}H}{2}) \approx 1.94, {{}^{1}J}_{N} = \frac{{}^{1}H l}{\overline{h} l o g m} = \frac{{}^{1}H}{2 l o g 2} \approx 0.95 .

Data and code availability

All data sets used in this study have previously been published; the captions of Figures 8 and 9 provide precise references. An open source R package to calculate our new tree shape indices for trees in Newick, NEXUS or phylo format is at https://github.com/kimverity/RUIindices.

References

Albers Susanne and Westbrook Jeffery. Self-organizing data structures. Online Algorithms: The state of the art, pages 13–51, 2005. [Google Scholar]
Allen Benjamin, Kon Mark, and Bar-Yam Yaneer. A New Phylogenetic Diversity Measure Generalizing the Shannon Index and Its Application to Phyllostomid Bats. The American Naturalist, 174(2):236–243, 2009. [DOI] [PubMed] [Google Scholar]
Atkinson Quentin D. and Gray Russell D.. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics. Systematic Biology, 54(4):513–526, 2005. [DOI] [PubMed] [Google Scholar]
Barzilai Lucia P and Schrago Carlos G. Signatures of natural selection in tree topology shape of serially sampled viral phylogenies. Molecular Phylogenetics and Evolution, 183:107776, 2023. [DOI] [PubMed] [Google Scholar]
Chao Anne, Chiu Chun Huo, and Jost Lou. Phylogenetic diversity measures based on Hill numbers. Philosophical Transactions of the Royal Society B: Biological Sciences, 365 (1558):3599–3609, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chindelevitch Leonid, Hayati Maryam, Poon Art FY, and Colijn Caroline. Network science inspires novel tree shape statistics. Plos one, 16(12):e0259877, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colijn Caroline and Gardy Jennifer. Phylogenetic tree shapes resolve disease transmission patterns. Evolution, medicine, and public health, 2014(1):96–108, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colless Donald H. Review of Phylogenetics, The Theory and Practice of Phylogenetic Systematics. Systematic Zoology, 31(1):100–104, 1982. [Google Scholar]
Faith Daniel P.. Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1):1–10, 1992. [Google Scholar]
Fischer Mareike, Herbst Lina, Kersting Sophie, Kühn Luise, and Wicke Kristina. Tree Balance Indices: A Comprehensive Survey. Springer Nature, 2023. [Google Scholar]
Hill Mark. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology, 54:427–432, 1973. [Google Scholar]
Honkola Terhi, Vesakoski Outi, Korhonen Kalle, Lehtinen Jyri, Kaj Syrjänen, and Niklas Wahlberg. Cultural and climatic changes shape the evolutionary history of the uralic languages. Journal of Evolutionary Biology, 26(6):1244–1253, 2013. [DOI] [PubMed] [Google Scholar]
Jost Lou. Entropy and diversity. Oikos, 113(2):363–375, 2006. [Google Scholar]
Jost Lou. The Relation between Evenness and Diversity. Diversity, 2(2):207–232, 2010. [Google Scholar]
Kirby Kathryn R, Gray Russell D, Greenhill Simon J, Jordan Fiona M, Gomes-Ng Stephanie, Bibiko Hans-Jörg, Blasi Damián E, Botero Carlos A, Bowern Claire, Ember Carol R, et al. D-place: A global database of cultural, linguistic and environmental diversity. PloS one, 11(7):e0158391, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leinster Tom and Cobbold Christina A.. Measuring diversity: the importance of species similarity. Ecology, 93(3):477–489, 2012. [DOI] [PubMed] [Google Scholar]
Lemant Jeanne, Sueur Cécile Le, Manojlović Veselin, and Noble Robert. Robust, Universal Tree Balance Indices. Systematic Biology, 71(5):1210–1224, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leventhal Gabriel E, Kouyos Roger, Stadler Tanja, Von Wyl Viktor, Yerly Sabine, Böni Jürg, Cellerai Cristina, Klimkait Thomas, Günthard Huldrych F, and Bonhoeffer Sebastian. Inferring epidemic contact structure from phylogenetic trees. PLoS computational biology, 8(3):e1002413, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lewinsohn Maya A, Bedford Trevor, Müller Nicola F, and Feder Alison F. State-dependent evolutionary models reveal modes of solid tumour growth. Nature Ecology & Evolution, 7(4):581–596, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mir Arnau, Rosselló Francesc, and Rotger Lucía. A new balance index for phylogenetic trees. Mathematical Biosciences, 241(1):125–136, 2013. arXiv: 1202.1223. [DOI] [PubMed] [Google Scholar]
Mir Arnau, Rotger Lucía, and Rosselló Francesc. Sound Colless-like balance indices for multifurcating trees. PLoS ONE, 13(9):559–560, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mooers Arne O. and Heard Stephen B.. Inferring Evolutionary Process from Phylogenetic Tree Shape. The Quarterly Review of Biology, 72(1):31–54, 1997. [Google Scholar]
Noble Robert, Burri Dominik, Sueur Cécile Le, Lemant Jeanne, Viossat Yannick, Kather Jakob Nikolas, and Beerenwinkel Niko. Spatial structure governs the mode of tumour evolution. Nature Ecology & Evolution, 6(2):207–217, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pavoine Sandrine and Bonsall Michael B. Measuring biodiversity to explain community assembly: a unified approach. Biological Reviews, 86(4):792–812, 2011. [DOI] [PubMed] [Google Scholar]
Pielou E. C.. The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13:131–144, 1966. [Google Scholar]
Purvis Andy and Agapow Paul-Michael. Phylogeny imbalance: taxonomic level matters. Systematic Biology, 51(6):844–854, 2002. [DOI] [PubMed] [Google Scholar]
Rényi Alfréd. On measures of entropy and information Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1: Contributions to the theory of statistics, 4:547–562, 1961. [Google Scholar]
Rogers James S.. Response of Colless’s Tree Imbalance to Number of Terminal Taxa. Systematic Biology, 42(1):102–105, 1993. [Google Scholar]
Sackin M. J.. “Good” and “Bad” Phenograms. Systematic Biology, 21(2):225–226, 1972. [Google Scholar]
Scott Jacob G, Maini Philip K, Anderson Alexander RA A, and Fletcher Alexander G. Inferring Tumor Proliferative Organization from Phylogenetic Tree Measures in a Computational Model. Systematic Biology, 69(4):623–637, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shannon C. A.. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. [Google Scholar]
Shao Kwang-Tsao and Sokal Robert R. Tree Balance. Systematic Zoology, 39(3):266, 1990. [Google Scholar]
Smith Benjamin and J. Bastow Wilson. A Consumer’s Guide to Evenness Indices. Oikos, 76(1):70–82, 1996. [Google Scholar]
Tsirogiannis Constantinos, Sandel Brody, and Cheliotis Dimitris. Efficient computation of popular phylogenetic tree measures. In International Workshop on Algorithms in Bioinformatics, pages 30–43. Springer, 2012. [Google Scholar]
Tucker Caroline M., Cadotte Marc W., Carvalho Silvia B., Davies T. Jonathan, Ferrier Simon, Fritz Susanne A., Grenyer Rich, Helmus Matthew R., Jin Lanna S., Mooers Arne O., Pavoine Sandrine, Purschke Oliver, Redding David W., Rosauer Dan F., Winter Marten, and Mazel Florent. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biological Reviews, 92(2):698–715, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tuomisto Hanna. An updated consumer’s guide to evenness and related indices. Oikos, 121(8):1203–1218, 2012. [Google Scholar]
Veron Simon, Saito Victor, Padilla-García Nélida, Forest Félix, and Bertheau Yves. The use of phylogenetic diversity in conservation biology and community ecology: A common base but different approaches. Quarterly Review of Biology, 94(2):123, 2019. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[R1] Albers Susanne and Westbrook Jeffery. Self-organizing data structures. Online Algorithms: The state of the art, pages 13–51, 2005. [Google Scholar]

[R2] Allen Benjamin, Kon Mark, and Bar-Yam Yaneer. A New Phylogenetic Diversity Measure Generalizing the Shannon Index and Its Application to Phyllostomid Bats. The American Naturalist, 174(2):236–243, 2009. [DOI] [PubMed] [Google Scholar]

[R3] Atkinson Quentin D. and Gray Russell D.. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics. Systematic Biology, 54(4):513–526, 2005. [DOI] [PubMed] [Google Scholar]

[R4] Barzilai Lucia P and Schrago Carlos G. Signatures of natural selection in tree topology shape of serially sampled viral phylogenies. Molecular Phylogenetics and Evolution, 183:107776, 2023. [DOI] [PubMed] [Google Scholar]

[R5] Chao Anne, Chiu Chun Huo, and Jost Lou. Phylogenetic diversity measures based on Hill numbers. Philosophical Transactions of the Royal Society B: Biological Sciences, 365 (1558):3599–3609, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Chindelevitch Leonid, Hayati Maryam, Poon Art FY, and Colijn Caroline. Network science inspires novel tree shape statistics. Plos one, 16(12):e0259877, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Colijn Caroline and Gardy Jennifer. Phylogenetic tree shapes resolve disease transmission patterns. Evolution, medicine, and public health, 2014(1):96–108, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Colless Donald H. Review of Phylogenetics, The Theory and Practice of Phylogenetic Systematics. Systematic Zoology, 31(1):100–104, 1982. [Google Scholar]

[R9] Faith Daniel P.. Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1):1–10, 1992. [Google Scholar]

[R10] Fischer Mareike, Herbst Lina, Kersting Sophie, Kühn Luise, and Wicke Kristina. Tree Balance Indices: A Comprehensive Survey. Springer Nature, 2023. [Google Scholar]

[R11] Hill Mark. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology, 54:427–432, 1973. [Google Scholar]

[R12] Honkola Terhi, Vesakoski Outi, Korhonen Kalle, Lehtinen Jyri, Kaj Syrjänen, and Niklas Wahlberg. Cultural and climatic changes shape the evolutionary history of the uralic languages. Journal of Evolutionary Biology, 26(6):1244–1253, 2013. [DOI] [PubMed] [Google Scholar]

[R13] Jost Lou. Entropy and diversity. Oikos, 113(2):363–375, 2006. [Google Scholar]

[R14] Jost Lou. The Relation between Evenness and Diversity. Diversity, 2(2):207–232, 2010. [Google Scholar]

[R15] Kirby Kathryn R, Gray Russell D, Greenhill Simon J, Jordan Fiona M, Gomes-Ng Stephanie, Bibiko Hans-Jörg, Blasi Damián E, Botero Carlos A, Bowern Claire, Ember Carol R, et al. D-place: A global database of cultural, linguistic and environmental diversity. PloS one, 11(7):e0158391, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Leinster Tom and Cobbold Christina A.. Measuring diversity: the importance of species similarity. Ecology, 93(3):477–489, 2012. [DOI] [PubMed] [Google Scholar]

[R17] Lemant Jeanne, Sueur Cécile Le, Manojlović Veselin, and Noble Robert. Robust, Universal Tree Balance Indices. Systematic Biology, 71(5):1210–1224, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Leventhal Gabriel E, Kouyos Roger, Stadler Tanja, Von Wyl Viktor, Yerly Sabine, Böni Jürg, Cellerai Cristina, Klimkait Thomas, Günthard Huldrych F, and Bonhoeffer Sebastian. Inferring epidemic contact structure from phylogenetic trees. PLoS computational biology, 8(3):e1002413, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Lewinsohn Maya A, Bedford Trevor, Müller Nicola F, and Feder Alison F. State-dependent evolutionary models reveal modes of solid tumour growth. Nature Ecology & Evolution, 7(4):581–596, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Mir Arnau, Rosselló Francesc, and Rotger Lucía. A new balance index for phylogenetic trees. Mathematical Biosciences, 241(1):125–136, 2013. arXiv: 1202.1223. [DOI] [PubMed] [Google Scholar]

[R21] Mir Arnau, Rotger Lucía, and Rosselló Francesc. Sound Colless-like balance indices for multifurcating trees. PLoS ONE, 13(9):559–560, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Mooers Arne O. and Heard Stephen B.. Inferring Evolutionary Process from Phylogenetic Tree Shape. The Quarterly Review of Biology, 72(1):31–54, 1997. [Google Scholar]

[R23] Noble Robert, Burri Dominik, Sueur Cécile Le, Lemant Jeanne, Viossat Yannick, Kather Jakob Nikolas, and Beerenwinkel Niko. Spatial structure governs the mode of tumour evolution. Nature Ecology & Evolution, 6(2):207–217, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Pavoine Sandrine and Bonsall Michael B. Measuring biodiversity to explain community assembly: a unified approach. Biological Reviews, 86(4):792–812, 2011. [DOI] [PubMed] [Google Scholar]

[R25] Pielou E. C.. The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13:131–144, 1966. [Google Scholar]

[R26] Purvis Andy and Agapow Paul-Michael. Phylogeny imbalance: taxonomic level matters. Systematic Biology, 51(6):844–854, 2002. [DOI] [PubMed] [Google Scholar]

[R27] Rényi Alfréd. On measures of entropy and information Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1: Contributions to the theory of statistics, 4:547–562, 1961. [Google Scholar]

[R28] Rogers James S.. Response of Colless’s Tree Imbalance to Number of Terminal Taxa. Systematic Biology, 42(1):102–105, 1993. [Google Scholar]

[R29] Sackin M. J.. “Good” and “Bad” Phenograms. Systematic Biology, 21(2):225–226, 1972. [Google Scholar]

[R30] Scott Jacob G, Maini Philip K, Anderson Alexander RA A, and Fletcher Alexander G. Inferring Tumor Proliferative Organization from Phylogenetic Tree Measures in a Computational Model. Systematic Biology, 69(4):623–637, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Shannon C. A.. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. [Google Scholar]

[R32] Shao Kwang-Tsao and Sokal Robert R. Tree Balance. Systematic Zoology, 39(3):266, 1990. [Google Scholar]

[R33] Smith Benjamin and J. Bastow Wilson. A Consumer’s Guide to Evenness Indices. Oikos, 76(1):70–82, 1996. [Google Scholar]

[R34] Tsirogiannis Constantinos, Sandel Brody, and Cheliotis Dimitris. Efficient computation of popular phylogenetic tree measures. In International Workshop on Algorithms in Bioinformatics, pages 30–43. Springer, 2012. [Google Scholar]

[R35] Tucker Caroline M., Cadotte Marc W., Carvalho Silvia B., Davies T. Jonathan, Ferrier Simon, Fritz Susanne A., Grenyer Rich, Helmus Matthew R., Jin Lanna S., Mooers Arne O., Pavoine Sandrine, Purschke Oliver, Redding David W., Rosauer Dan F., Winter Marten, and Mazel Florent. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biological Reviews, 92(2):698–715, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Tuomisto Hanna. An updated consumer’s guide to evenness and related indices. Oikos, 121(8):1203–1218, 2012. [Google Scholar]

[R37] Veron Simon, Saito Victor, Padilla-García Nélida, Forest Félix, and Bertheau Yves. The use of phylogenetic diversity in conservation biology and community ecology: A common base but different approaches. Quarterly Review of Biology, 94(2):123, 2019. [Google Scholar]

PERMALINK

This is a preprint.

A new universal system of tree shape indices

Robert Noble

Kimberley Verity

Abstract

Materials and Methods

Hill numbers as a basis for defining robust, universal, interpretable tree indices

Table 1.

Table 2.

Further preliminary definitions

Fig. 1.

Prior tree balance and imbalance indices

Definition of the normalizing factor h‾

Definition of the longitudinal mean

Example 0.1

New longitudinal mean indices

Definition of the node-wise mean: first special case

Definition of the node-wise mean: second special case

Definition of the node-wise mean: general case

Example 0.2

Fig. 2.

Integral forms of the node-wise and longitudinal means

New node-wise mean indices

Example 0.3

Fig. 3.

The star mean and new star mean indices

Non-normalized indices

Results

DqL improves on prior indices for non-ultrametric trees

Table 3.

Example 0.4

Fig. 4.

Example 0.5

DqL is more self-consistent and intuitive than the D‾q of Chao et al. (2010)

Example 0.6

DqN improves on all prior tree balance and imbalance indices

Fig. 5.

Inequalities between indices

Fig. 6.

Property 0.1

Proposition 1

Proposition 2

Proposition 3

Special cases

Property 0.2

Property 0.3

Property 0.4

Property 0.5

Property 0.6

Property 0.7

Property 0.8

Property 0.9

Property 0.10

Property 0.11

Property 0.12

Property 0.13

Property 0.14

Property 0.15

The leafy tree identity

Lemma 0.7

Proof.

Proposition 4

Proof.

Fig. 7.

Example 0.8

An example cross-disciplinary application

Fig. 8.

An example application to model-generated trees

Fig. 9.

Discussion

Acknowledgements

Funding

APPENDIX

Derivation of DqN in Example 0.3

Proof of Proposition 1

Proof.

Proof of Proposition 2

Proof.

Proof of Proposition 3

Definition of the normalizing factor $\overline{h}$

${{}^{q}D}_{L}$ improves on prior indices for non-ultrametric trees

${{}^{q}D}_{L}$ is more self-consistent and intuitive than the ${}^{q}{\overline{D}}$ of Chao et al. (2010)

${{}^{q}D}_{N}$ improves on all prior tree balance and imbalance indices

Derivation of ${{}^{q}D}_{N}$ in Example 0.3