Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Dec 17:2023.07.17.549219. [Version 4] doi: 10.1101/2023.07.17.549219

A new universal system of tree shape indices

Robert Noble 1,*, Kimberley Verity 1
PMCID: PMC10705254  PMID: 38077096

Abstract

The comparison and categorization of tree diagrams is fundamental to large parts of biology, linguistics, computer science, and other fields, yet the indices currently applied to describing tree shape have important flaws that complicate their interpretation and limit their scope. Here we introduce a new system of indices with no such shortcomings. Our indices account for node sizes and branch lengths and are robust to small changes in either attribute. Unlike currently popular phylogenetic diversity, phylogenetic entropy, and tree balance indices, our definitions assign interpretable values to all rooted trees and enable meaningful comparison of any pair of trees. Our self-consistent definitions further unite measures of diversity, richness, balance, symmetry, effective height, effective outdegree, and effective branch count in a coherent system, and we derive numerous simple relationships between these indices. The main practical advantages of our indices are in 1) quantifying diversity in non-ultrametric trees; 2) assessing the balance of trees that have non-uniform branch lengths or node sizes; 3) comparing the balance of trees with different leaf counts or outdegrees; 4) obtaining a coherent, generic, multidimensional quantification of tree shape that is robust to sampling error and inferential error. We illustrate these features by comparing the shapes of trees representing the evolution of HIV and of Uralic languages, and trees generated by computational models of tumour evolution. Given the ubiquity of tree structures, we identify a wide range of applications across diverse domains.

Keywords: tree indices, tree shape, tree balance, phylogenetic diversity, phylogenetic entropy, rooted trees


Tree shape indices that quantify key properties of rooted trees – such as the effective number of leaves, average out-degree, and balance – have myriad applications. Conservation biologists use phylogenetic diversity values to determine which actions will preserve the most biodiversity (Tucker et al., 2017; Veron et al., 2019). Tree balance indices are used to compare models and to infer parameter values in systematic biology (Mooers and Heard, 1997; Purvis and Agapow, 2002), virology (Chindelevitch et al., 2021; Barzilai and Schrago, 2023), epidemiology (Leventhal et al., 2012; Colijn and Gardy, 2014), and oncology (Scott et al., 2020; Noble et al., 2022). Computer scientists seek to balance binary trees to make them more efficient as data structures (Albers and Westbrook, 2005). Numerous indices designed for such tasks have previously been proposed (Pavoine and Bonsall, 2011; Tucker et al., 2017; Fischer et al., 2023). However, no existing index provides a general purpose method for fairly evaluating the shape of any rooted tree. This paper introduces a system of such indices.

Rather than simply adding to a profusion of indices, our aim here is to solve important open problems: How can we modify existing phylogenetic diversity and entropy indices so that they are meaningful when applied to non-ultrametric trees? How can we define a tree balance index that accounts for both branch lengths and node sizes? How can we likewise generalize the concepts of outdegree, branch count, and node count? How can we unite all these types of tree shape index in a coherent system, so that their interrelationships can be easily understood? Only by solving these problems can we arrive at a general purpose method for fairly evaluating the shape of any rooted tree.

Among current diversity indices for generic rooted trees, arguably the most sophisticated are those introduced by Chao et al. (2010), which generalize and unify previous definitions of Hill (1973), Faith (1992), Jost (2006) and Allen et al. (2009). In quantifying the effective number of types in a data set, tthe Dq indices of Chao et al. (2010) account for both node sizes (type frequencies) and branch lengths (degree of dissimilarity between types). Nevertheless, a critical shortcoming of these indices, which limits their applications, is that they assign meaningful values only to leafy ultrametric trees (that is, trees in which the only non-zero-sized nodes are leaves, all equally distant from the root) (Chao et al., 2010; Leinster and Cobbold, 2012). We will further show that the Dq indices of Chao et al. (2010) are not fully self-consistent and have peculiar properties for q>1. Moreover, the relationships between these diversity indices and other types of index, such as tree balance indices, are generally opaque, which thwarts multi-dimensional analysis.

Conventional tree balance and imbalance indices – including those attributed to Sackin (1972) and Colless (1982), the total cophenetic index of Mir et al. (2013), and others reviewed by Fischer et al. (2023) – are also flawed. These indices, which are meant to quantify the extent to which each internal node splits its descendants into equally sized subtrees, are not defined for all rooted trees, do not permit meaningful comparison of trees with differing leaf counts, and are highly sensitive to the addition or removal of rare types (Noble et al., 2022; Lemant et al., 2022). We recently introduced a family of tree balance indices that solve these problems and that have additional desirable properties (Lemant et al., 2022). Our previous definitions are defined for any degree distribution, account for node sizes, and enable meaningful comparison of trees with different numbers of leaves. But our previous definitions do not account for branch lengths, which restricts their applications because branch lengths often convey important information (for example, genetic distance in virus evolution, or elapsed time in the evolution of species).

Here we define a new system of indices that resolve all the aforementioned problems by accounting for node sizes and branch lengths, being robust to small changes to the tree, assigning meaningful values to all rooted trees, and belonging to a coherent framework, so that mathematical relationships between the indices are well characterized. Our system captures fundamental properties such as diversity (effective number of leaves), tree balance (the extent to which each internal node splits its descendants into equally sized subtrees), and bushiness (average effective outdegree). Given that our indices share the desirable properties but not the flaws of prior indices, we discuss their potential to supersede current methods in a wide range of applications.

Materials and Methods

Hill numbers as a basis for defining robust, universal, interpretable tree indices

A rooted tree is a tree in which one node is designated the root and all branches are directed away from the root. Our aim is to define indices that are useful for categorizing and comparing the shapes of unlabelled rooted trees that have three attributes: tree topology, non-negative node sizes, and non-negative branch lengths. These indices should be generic and model-agnostic, meaning that they make no assumptions about what the tree represents or the process by which it was generated. In evolutionary trees, for example, the size of a node can correspond to the population size of the respective biological type, or simply to whether a type is extant (node size 1) or extinct (0), while branch lengths can represent genetic distance, morphological difference, or elapsed time. Linguists use similar structures with unequal branch lengths to study the evolution of languages (Honkola et al., 2013; Atkinson and Gray, 2005). In computing, the size of a search tree node corresponds to the probability of it being visited.

In this general context, a useful index should be robust, universal, and interpretable. A loose definition of robustness is that small changes to the tree have only small effects on the index value, except where sensitivity is desirable; universal means that the index is defined for all rooted trees; and interpretable implies a simple, consistent interpretation, enabling meaningful comparison of any pair of rooted trees. Lemant et al. (2022) provides more rigorous, axiomatic definitions. We follow these axiomatic definitions and call a tree index with all three properties an RUI index. In practical terms, robustness implies that an index is relatively insensitive to the effects of issues such as sampling error, inferential error, omission of rare types, imperfect genetic sequencing, and incomplete resolution of ancestral relationships. All our indices are dimensionless but the diversity indices can be re-scaled in terms of the branch length unit where desired.

We begin by recalling the family of diversity indices attributed to Hill (1973). These Hill numbers are functions of a set of proportions P=p1,,pn with 0p1 for all pP and i=1npi=1. Every Hill number of order q0 can be written as

DqP:=i=1npiq11-qwithD1P:=limq1DqP=exp(-i=1npilogpi).

Hence Dq is the exponential of the Rényi entropy of order q (Rényi, 1961), which we will denote Hq, and H1 is Shannon’s entropy (Shannon, 1948). Another important special case is

D0P:=|{pP:p>0}|,

which is simply the number of types, or richness. Following Pielou (1966) and Jost (2010), we further define the evenness indices

JqP:=logqD(P)log0D(P)[0,1]ifD0(P)>11otherwise.

For completeness, we set Dq()=0 and Jq()=1.

We can apply these indices to a rooted tree T simply by equating P(T)=p1,,pn to the proportional sizes of the n nodes of T, including the internal nodes. Assigning non-zero sizes to internal nodes makes sense, for example, in the case of a tumour clone tree (Noble et al., 2022). The richness index D0(T)=D0(P(T)) then quantifies the number of non-zero-sized nodes in the tree, which we will refer to as the counted nodes. In an evolutionary tree, counted nodes correspond to extant types. For each q>0, the diversity index Dq(T)=Dq(P(T)) can be interpreted as an effective number of counted nodes, while Jq(T)=Jq(P(T)) gauges the evenness of the counted node sizes.

Clearly Dq and Jq are insensitive to small changes to proportional node sizes. For q>0, Dq is also generally robust to the addition or removal of relatively small nodes (and the degree of robustness increases with q), whereas D0 and Jq are not, as is appropriate for indices that are meant to quantify richness and evenness. Dq and Jq are universal because they can be applied to any set of node sizes, and they are interpretable as described above. Yet although these indices are RUI, they are inadequate for assessing tree shape because they depend only on node sizes, ignoring both tree topology and branch lengths.

Many indices that capture aspects of tree shape have previously been defined (surveys include Pavoine and Bonsall (2011); Tucker et al. (2017); Fischer et al. (2023)) but, to the best of our knowledge, none is RUI (Table 1). We address this deficiency by developing new RUI tree indices that extend the basic indices Dq and Jq to account for tree topology and branch lengths. We do this using three types of weighted mean, which we refer to as the longitudinal mean, the node-wise mean, and the star mean (Table 2). Our consistent definitions ensure that our indices can be precisely related to each other and to Dq and Jq in numerous meaningful ways, so that all the indices belong to a single coherent system.

Table 1.

Properties of some previously defined non-RUI tree indices (see main text for definitions and citations.)

Robust Universal Interpretable

Robustly accounts for node sizes? Robustly accounts for branch lengths? Defined for all rooted trees? Has a simple, consistent interpretation? Can meaningfully compare any pair of rooted trees?

Faith’s PD No Yes

Allen et al’s HP Yes Only for leafy ultrametric trees
Chao et al’s Dq

Sackin’s index No Yes No
Colless’s index
Total cophenetic index

Lemant et al’s Jq Yes No Yes Only if uniform branch lengths

Table 2.

Nature, notation and interpretation of RUI tree indices, including prior indices (top row) and new indices (second, third and fourth rows). Counted nodes are those with non-zero size.

Branches or nodes Type of average Richness Diversity (with q > 0) Evenness (with q > 0)

Nodes None D0 = number of counted nodes Dq = effective number of counted nodes Jq = evenness of counted node sizes
Branches Longitudinal mean D0L = average branch count across the tree DqL = effective number of maximally distant leaves JqL = evenness of branch sizes across the tree (tree symmetry if leafy and ultrametric)
Branches Node-wise mean D0N = average effective outdegree, ignoring branch sizes DqN = average effective outdegree, accounting for branch sizes JqN = tree balance
Branches Star mean D0S = effective number of non-root nodes DqS = effective number of branches, accounting for branch sizes JqS = evenness of all branch sizes

Further preliminary definitions

In a rooted tree, the depth of a node is the sum of the branch lengths along the unidirectional path from the root to the node. The height of the tree is the maximum depth of its non-zero-sized nodes. Nodes with no descendants are called leaves and non-leaves are called internal nodes. We define the size of a branch as the sum of the proportional node sizes that descend (directly or indirectly) from the branch. For example, in the three-leaf tree depicted in Figure 1a, the branches descending from the root have sizes 13 and 23, and the other two branches each have size 13. The size of any segment of a branch is the same as the size of the branch.

Fig. 1.

Fig. 1.

a) A leafy bifurcating ultrametric tree with three equally sized leaves. In this and every subsequent tree diagram, open circles indicate zero-sized nodes. b) Index values versus branch length λ for the three-leaf tree. The y-axis is log-transformed so that the curves for all diversity indices appear piecewise linear. D0 is slightly greater than J1N whenever 0<λ<1.

A leafy tree is such that all internal nodes have zero size (equivalently, all counted nodes are leaves). A tree is ultrametric if all its leaves have the same depth after the removal of all subtrees that contain only zero-sized branches (corresponding to extinct lineages in an evolutionary tree). A caterpillar tree is a bifurcating tree in which every internal node except one has exactly one child leaf. A star tree is a tree in which all non-zero-sized branches are attached to the root. We define a piecewise star tree as a tree that can be divided into transverse intervals such that, within each interval, all the non-zero-sized branches are attached to a common node. For example, the leafy ultrametric tree in Figure 1a is a star tree if λ=0 or λ=1 and is otherwise a caterpillar tree. To simplify our notation, we will usually omit the tree as a function argument (for example, writing D0 instead of D0(T)).

It will be helpful to recall that, for a sequence of positive real numbers X=x1,,xn, real number r0, and set of positive weights W=w1,,wn, the weighted power mean of exponent r is

MrX;W:=i=1nwixiri=1nwi1r.

M0 is defined from the limit as

M0X;W:=expi=1nwilogxii=1nwi.

M-1, M0 and M1 are respectively the weighted harmonic, geometric, and arithmetic means. M- and M respectively return the minimum and the maximum. Power means are closely related to Hill numbers as, for all q0 and any sequence of proportions P,

Dq(P)=Mq-1(P;P)-1. (0.1)

Prior tree balance and imbalance indices

The most popular conventional tree imbalance indices can be expressed in the form

IA=iVniFA(i),

where V is the set of all internal nodes and ni is the number of leaves that descend from node i, and FA(i) is a function that defines a particular index. For IS (Sackin’s index), IC (Colless’ index) and IΦ (the total cophenetic index) we have

FSi=1,FCi=pi1-pi2,FΦi=ni-12,

where pi1 is the proportion of the ni leaves that descend from the left child branch of i, and pi2 is the proportion that descend from the right child branch. Being imbalance indices, these three indices assign higher values to less balanced trees. IC is defined only for bifurcating trees (in which all internal nodes have outdegree two). IS and IΦ are defined only for trees in which all internal nodes have outdegree greater than one. By convention, each index is normalized over the set of trees on n>2 leaves by subtracting its minimum value over such trees and then dividing by the difference between its maximum and its minimum. The minima of IS, IC and IΦ are n, 0 and 0, and the maxima are (n+2)(n-1)/2, n-12 and n+13, respectively (Shao and Sokal, 1990; Rogers, 1993; Mir et al., 2013).

Lemant et al. (2022) proposed instead defining tree balance or imbalance indices in the form of the weighted arithmetic mean

1iVwiiVwiFi,

where wi is the weight assigned to node i, and F(i) quantifies the degree to which node i splits its descendants into equally sized subtrees. For example, we can obtain an alternative normalization of Colless’ index by setting wi=ni and F(i)=FC(i). The normalizing factor iVwi is then Sackin’s index. An advantage of this approach is that it allows us to compare the balance of any pair of trees for which F is defined, rather than only trees with equal leaf counts.

Definition of the normalizing factor h

Consistent with Lemant et al. (2022), our new index definitions are based on weighted means. Our preferred weights require us to define the normalizing factor

h:=bBsblbh,

where B is the set of all branches in the tree, sb[0,1] is the size of branch b, lb is the length of branch b, and h is the tree height. We can interpret h (denoted T in Chao et al. (2010)) as the effective tree height or as the average counted node depth. In computer science, h is called the weighted path length (Albers and Westbrook, 2005). For leafy trees with uniform leaf sizes and uniform branch lengths, h=lIS/D0, where l is the branch length and IS is Sackin’s index. Hence h can also be considered a generalization of Sackin’s index. Indeed, we have previously argued that Sackin’s index is best interpreted not as a general imbalance index but rather as a normalizing factor, which works as an imbalance index only in the special case of trees with uniform node sizes, uniform branch lengths, and uniform outdegree (Lemant et al., 2022). h=h if and only if the tree is leafy and ultrametric.

Definition of the longitudinal mean

The basic idea of the longitudinal mean is that we split the tree into transverse intervals, calculate an index value based on the proportional sizes of the branch segments within each interval, and then take a weighted average of these within-interval index values. Let I denote the set of transverse intervals created by locating an interval boundary at every node depth (dashed lines in Figure 1a), excluding intervals that contain only zero-sized branches. Each interval iI then contains a set Bi of branch segments, all of the same length, which we will refer to as the interval height hi. Let

Si:=bBisb(0,1],

where sb is the size of branch segment b. Then Si=1 for all intervals i if and only if the tree is leafy and ultrametric. It follows that

iISihi=h.

Now for each bBi, define the within-interval proportional branch size pb:=sb/Si and let Pi:=pb:bBi,pb>0. Then pPip=bBipb=1 for all intervals iI.

Finally, for index F and tree T, we define the longitudinal mean of order r of F as the functional FMlong,r(F) such that

Mlong,rFT;w:=iI(T)wiFPiriI(T)wi1r,ifh>0F()otherwise, (0.2)

where the weight w>0 is a function of i that remains to be specified. Hence Mlong,r(F) is a weighted power mean of the F values assigned to the intervals. For succinctness, we will omit the argument T and specify w only where necessary.

Example 0.1

For the function Fx1,,xn=k=1nxk we have

Mlong,r(F)=iIwipPipriIwi1r=1.

New longitudinal mean indices

We define new tree indices as longitudinal means of Dq and Jq with wi=Sihi, so that the index value assigned to each interval i is weighted by the product of the length hi and the summed sizes Si of the branch segments that i contains. First, we define

DqL:=Mlong,0Dq. (0.3)

This is equivalent to HqL=Mlong,1Hq with DL=expHL. In particular,

D0L=exp1hiISihilogPiifh>00otherwise,
D1L=exp-1hiIhibBisblogsbSiifh>00otherwise.

We can interpret D0L as the average tree width or, more precisely, as the geometric mean number of branches counted across the tree. In an evolutionary tree where branch lengths correspond to elapsed time, D0L equates to average richness across time, excluding extinct lineages. For q>0, DqL can be interpreted as the effective number of counted nodes maximally distant from the root or – because all maximally distant counted nodes must be leaves – as the effective number of maximally distant leaves. In biological terms, this corresponds to the effective number of extant types maximally distinct from the root type.

Second, we define

JqL:=Mlong,1Jq=1hiISihiJqPiifh>01otherwise. (0.4)

Just as Jq measures the evenness of node sizes, so JqL measures the average evenness of branch sizes across the tree. If the tree is leafy and ultrametric then JqL=1 for q>0 if and only if the tree is fully symmetric. Hence, when applied to leafy ultrametric trees, JqL can be interpreted as a symmetry index (also known as a sound balance index (Mir et al., 2018)).

Figure 1b illustrates how D0L, D1L and J1L (and other index values yet to be defined) vary with branch length λ for the three-leaf tree of Figure 1a.

Definition of the node-wise mean: first special case

In the special case in which all branches have the same length l, we can obtain a node-wise mean by calculating an index value for each node, based on the node’s child branch sizes, and then taking a weighted average of these node index values. We previously used this approach to define new tree balance indices (Lemant et al., 2022).

Let V denote the set of all internal nodes, excluding nodes with only zero-sized descendants. Let Ci denote the subtree containing only i and its children. For iV and bCi, let sb denote the size of b and define

Si=bCisb(0,1].

Then Si=1 for all nodes i if and only if the tree is a leafy piecewise star tree. It follows that

iVSil=h.

Now for each bCi, define the proportional branch size pb:=sb/Si and let Pi:=pb:bCi,pb>0. We then define the node-wise mean of order r of index F as the weighted power mean of the F values assigned to the nodes:

Mnode,rFT;w:=iV(T)wiFPiriV(T)wi1r,ifh>0F()otherwise, (0.5)

where the weight w>0 is a function of i that remains to be specified.

Definition of the node-wise mean: second special case

In the case of a piecewise star tree with h>0, we can set the index value of each internal node k as the longitudinal mean index value of the subtree Ck. We then have

Mnode,r,t(F)(T;u,w)=kV(T)ukMlong,r(F)Ck;wtkV(T)uk1t=1kV(T)ukkV(T)ukiI(T)wikFPikriI(T)wiktr1t, (0.6)

where t is the exponent of the across-nodes power mean, uk>0 is the weight assigned to node k, Pik contains the proportional sizes of all branch segments that belong to both subtree Ck and interval i, and wik>0 is the weight assigned to k associated with interval i.

To keep our system internally consistent we would like, in the case of piecewise star trees, the node-wise mean of any index to be equal to the longitudinal mean of the same index. Comparing Equation 0.6 with the definition of the longitudinal mean (Equation 0.2), we see that the right-hand sides are equivalent if and only if three conditions hold:

r=t,iI(T)wik=uk,kV(T)uk=iI(T)wi.

Under these conditions, summing index values across subtree intervals and then across nodes gives the same result as summing across tree intervals. We then have for any piecewise star tree T with h>0,

Mnode,rFT;w=kVTiITwikFPikriITwi1r.

In the particular case F=Dq, the index value assigned to each node k (that is, the longitudinal mean index value of the subtree Ck) measures the diversity of the child branches of k. When Ck has m branches of equal length and size, the node diversity of k is m. In the case m>1, as one branch length is reduced towards zero while all else is kept constant, the node diversity of k decreases continuously to m-1. Decreasing instead the size of one branch has the same effect provided q>0. Hence the diversity value assigned to each node can be interpreted as an effective outdegree, and the node-wise mean diversity can be interpreted as an average effective outdegree. When q=0 the effective outdegree ignores branch sizes. As q increases, the effective outdegree gives less weight to branches of smaller size. We would like to retain this interpretation as we generalize the definition of the node-wise mean.

Definition of the node-wise mean: general case

In extending the definition to all rooted trees, we want to ensure that, as with the longitudinal mean, the node-wise mean changes continuously as we vary branch lengths. We illustrate this general issue with an example.

Example 0.2

Consider a leafy ultrametric tree with six leaves such that the root has two descendant branches each of length λ, and both non-root internal nodes have three descendant branches, all of length 1-λ. When λ=12 (Figure 2a), it follows from our special-case definition (Equation 0.5) that the root has richness 2, the internal nodes each have richness 3, and the node-wise mean richness is intermediate between 2 and 3. As λ increases from 12 to 1, the node richness values should remain unchanged but the root node richness should be given greater weight, so that the node-wise mean richness (which we will denote D0N) approaches 2 continuously as λ1 (Figure 2b).

Fig. 2.

Fig. 2.

a) The six-leaf tree considered in Example 0.2 with branch length λ=12. b) As λ1, the tree approaches a two-leaf star tree. c) As λ0, the tree approaches a six-leaf star tree.

At the other extreme, as λ decreases from 12 to 0, we would like D0N to increase continuously to 6 (Figure 2c). Given that the weight assigned to the root node richness should decrease as λ decreases, the only way to achieve the required increase in D0N is to increase the richness value assigned to each non-root internal node k. We can do this by making the richness value assigned to k depend not only on the child branches of k but also, to an increasing degree as λ decreases, on the other branches that run alongside the branches of k.

Generalizing from the example we conclude that, when the distance between node k and any ancestor j of k (in the example, the root) is less than the height of Ck (in the example, when λ<12), the index value assigned to k should depend not only on the branches of Ck (the child branches of k) but also on branch segments that descend from j and that coexist in transverse intervals with the branches of Ck. The weight assigned to k depends only on Ck but the index value assigned to k is a weighted average of index values across k and all ancestors of k.

To formalize this concept, we first define, for interval iI and node jV,

SiTj=bBiTjsb[0,1],SiCj=bBiCjsb[0,1],

where Tj is the subtree containing j and all its descendants. This implies

iISiTrhi=iIjVSiCjhi=h,

where r is the root (and hence Tr is the entire tree). SiCj is a generalization of the Si used in our previous definitions, whereas SiTj is a new concept. For each bBiTj, let

pb=sb/SiTjifSiTj>00otherwise,

and define Pij=pb:bBiTj,pb>0. We then define the node-wise average as the triple power mean

Mnode,r,s,t(F)(T;u,v,w)=1kV(T)ukkV(T)uk1jAkvjkjAkvjkiI(T)wikFPijriI(T)wiksrts1t,

where Ak is the set containing k and all ancestors of k, s is the exponent of the across-ancestors power mean, and vjk are the ancestor weights. This expression is consistent with Equation 0.6 if and only if

t=s=r,jAkvjk=uk=iI(T)wik,kV(T)uk=iI(T)wi. (0.7)

We then arrive at a simpler general definition

Mnode,rFT;v,w:=1kV(T)ukkV(T)jAkvjkukiI(T)wikFPijr1rifh>0F()otherwise.

Integral forms of the node-wise and longitudinal means

Since our preferred ancestor weights are best expressed as integrals, we will find it useful to define the longitudinal and node-wise means even more generally by integrating over depths instead of summing over intervals. Suppose we assign a non-negative density fb(x) to every branch b at every depth x, with fb(x)=0 for every x at which b is absent. Define the tree height h:=maxx:fb(x)>0,bB, where B is the set of all branches. We can then define branch size sb as the non-increasing function of depth x:

h:=bB0hfbxdx,sbx:=1hbGbxhfb(t)dtifh>00otherwise,

where Gb is the set containing b and all branches that descend from b. Let

STjx:=bBjsb(x)[0,1].

For each bBj, define the proportional branch size

pbjx:=sb(x)/STj(x)ifSTj(x)>00otherwise.

Let Pjx:=pbj(x):bBj,pbj(x)>0. We then define the node-wise mean of an index F as

Mnode,rFT;v,w:=1kV(T)ukkV(T)jAkvjkuk0hwk(x)FPj(x)rdx1rifh>0F()otherwise,

where wk(x) is the weight assigned to node k at depth x, and

uk=0hwkxdx.

The longitudinal mean can similarly be defined in terms of integrals as

Mlong,rFT;w:=0hw(x)[F(P(x))]rdx0hw(x)dx1rifh>0F()otherwise,

where Px:=pbx:bB,pb(x)>0,

pbx:=sb(x)/S(x)ifS(x)>00otherwise,Sx:=bBsb(x)=jVbBjsb(x).

Our previous definitions are included as special cases in which the branch density is zero except at each counted node, where it is equal to the node size. In an evolutionary tree, branch density corresponds to population size, and branch size corresponds to number of extant descendants. Although it is beyond the scope of the current manuscript, we note that the integral forms would permit us to apply our indices to a more general class of tree, such that the size of any branch is allowed to vary along its length.

New node-wise mean indices

To define new tree indices as node-wise means of Dq and Jq, we first set wk=SCk, where

SCkx:=bCksb(x),

and we define the normalization factor

hCj:=0hSCj(x)dx,h=jVhCj=0hS(x)dx.

Let dk denote the depth of node k and let djk=dk-dj denote the distance from j to k. Let j denote the parent of node j. The ancestor weight function v should have three properties. First, as an assumption of our general definition (Equation 0.7),

jAkvjk=uk=0hwkxdx.

Second, vjk should decrease as djj decreases. Third, vjk should increase as the overlap between Cj and Ck increases. A simple way to satisfy all three conditions is to set

vjk=αjkβjkSCkxdx,

where αjk:=dk+djk and

βjk:=αjk+djjifjisnottheroototherwise.

Given the above choices of w and v, we define the node-wise mean diversity of order q as

DqN:=Mnode,0Dq. (0.8)

This is equivalent to HqN=Mnode,1Hq with DqN=expHqN. In particular,

D0N=exp1hkV1hCkjAkvjk0hSCk(x)logPj(x)dxifh>00otherwise,
D1N=exp1hkV1hCkjAkvjk0hSCk(x)1HPj(x)dxifh>00otherwise,

where

H1Pj(x)=-bBjsb(x)STj(x)logsb(x)STj(x).

As previously explained, we can interpret DqN as an average effective outdegree (branching factor in computer science) that accounts for branch lengths only (q=0) or for both branch lengths and branch sizes (q>0). Less formally, DqN quantifies the bushiness of the tree.

With the same w and v, we define the universal tree balance JqN as

JqN:=Mnode,0Jq=1hkV1hCkjAkvjk0hSCk(x)qJPj(x)dxifh>01otherwise. (0.9)

In the case of uniform branch lengths, this definition simplifies to

JqN=1hiVSiqJPi,

where Si and Pi are defined as in Equation 0.5. This means that for trees with uniform branch lengths, JqN is identical to our previous definition of the tree balance index Jq (Lemant et al., 2022), excepting one important difference. Whereas our prior index assigns a balance score of zero to any node that has outdegree 1, the above definition instead assigns a balance score of one. Therefore linear trees are considered maximally unbalanced according to Jq but maximally balanced according to JqN. This difference ensures that all our new evenness indices have consistent definitions and interpretations.

Example 0.3

Consider the perfectly balanced, bifurcating, leafy tree with four leaves and branch lengths λ (upper two branches) and 1-λ (lower four branches), as shown in Figure 3a. For all q0, if λ12 then DqN=2, and otherwise DqN=41-λ, as shown in Figure 3b (dark blue curve). A step-by-step derivation is in the Appendix.

Fig. 3.

Fig. 3.

a) The four-leaf tree considered in Examples 0.3 and 0.6. b) Index values versus branch length λ for the tree of Example 0.6. Curves for indices with parameter q are independent of the value of q0. The y-axis is log-transformed so that the curves for all diversity indices except D0 and Mlong,1Dq appear piecewise linear. c) Dq and DqL values for the four-leaf tree considered in Example 0.6, for varied q with λ=12.

The above example illustrates that, for leafy ultrametric trees, the node-wise mean diversity, like the longitudinal mean diversity, is a piecewise exponential function of branch lengths. Equivalently, the entropy indices are piecewise linear. This property depends on our defining the ancestor weight function vjk as an integral of SCk. Because SCk is a step function, the integrals in all our node-wise mean index definitions are simply sums of areas of rectangles, and the widths of these rectangles are linear functions of branch lengths. Our definitions are designed so that, although Equations 0.8 and 0.9 might appear complicated, in practice they produce relatively simple expressions.

The star mean and new star mean indices

Like the longitudinal and node-wise means, the star mean is based on branch sizes. Unlike those other two means, but in common with the node-size indices Dq and Jq, the star mean ignores tree topology. The idea is that, in effect, we rearrange the tree by reattaching all branches to the root to form a star tree, while retaining branch sizes and lengths, and then calculate the longitudinal (equivalently node-wise) mean index value of the star tree. For index F and tree T, we define the star mean of order r of F such that

Mstar,rFT;w*:=0hw*(x)[F(P*(x))]rdx0hw*(x)dx1rifh>0F()otherwise, (0.10)

where P*x:=pb*x:bB,pb*(x)>0,

pb*x:=sbx+db/S*(x)ifS*(x)>00otherwise,S*x:=bBsbx+db,

and db is the depth of the parent node of branch b. Note that

0hS*x=0hSx=h.

With w*=S*, we define the star mean diversity of order q as

DqS:=Mstar,0Dq, (0.11)

which is equivalent to HqS=Mstar,1Hq with DqS=expHqS. In particular,

D0S=exp1h0hS*(x)logP*(x)dxifh>0F()otherwise,
D1S=exp1h0hS*(x)1H(P*(x))dxifh>0F()otherwise.

DqS quantifies the effective number of branches in the tree, either accounting for branch lengths only (q=0) or for both branch lengths and branch sizes (q>0). Because every non-root node has exactly one parent branch, and because D0S accounts for branch lengths but not sizes, D0S can also be interpreted as an effective number of non-root nodes. We also define an index that quantifies the evenness of all branch sizes:

JqS:=Mstar,0Jq=1h0hS*(x)qJ(P*(x))dxifh>01otherwise. (0.12)

Figures 1b and 3b illustrate how D0S, D1S and J1S values vary with branch lengths for three- and four-leaf trees.

Non-normalized indices

Although our focus is on indices that describe shape, rather than size, we note that every longitudinal, node-wise, or star mean diversity index can be converted into a non-normalized diversity index simply by omitting the normalization factor. Such indices are useful in applications where the unit of branch length should be retained, such as when assessing loss of richness or diversity due to the removal of a node. In particular, we will find it useful to define the non-normalized entropy index

HP:=-bBlbpblogpb=iIhilog1DPi. (0.13)

Results

DqL improves on prior indices for non-ultrametric trees

Our indices D0L and D1L are similar to well-known pre-existing indices but with important improvements (Table 3). The phylogenetic diversity of Faith (1992) – which is popular among conservation biologists – is defined as

PD:=bBlb.

Table 3.

Advantages of using our indices instead of previously-defined indices.

Prior index Proposed replacement Equation Advantages of replacement

Allen et al’s HP HP 0.13 Interpretable for non-ultrametric trees
Chao et al’s Dq DqL 0.3 Bounded and interpretable for non-ultrametric trees; more self-consistent; more intuitive for q > 1
All prior tree balance and imbalance indices JqN 0.9 Defined for all rooted trees; can meaningfully compare any pair of trees; accounts for node sizes and branch lengths

Phylogenetic entropy (Allen et al., 2009) – a previous generalization of Shannon’s entropy – is defined in our notation as

HP:=-bBlbsblogsb.

Chao et al. (2010) defined normalized versions of these indices that can be written as

D0=PDh=1hiIhiBi=iIhiD0QiiISihi,
D1=expHPh=exp-1hiIhibBisblogsb=expiIhilogD1QiiISihi,

where Qi=sb:bBi.

A first problem with these definitions is that, for non-ultrametric trees, phylogenetic entropy lacks a clear interpretation. This issue is due to HP being defined in terms of sets of branch sizes Qi instead of sets of within-interval proportional branch sizes Pi=pb=sb/Si:bBi, as illustrated by the following example.

Example 0.4

Consider the three-node, two-leaf tree with leaf sizes p and 1-p, and leaf depths 1+λ and λ, respectively (Figure 4a). For this tree, as λ0,

PD=1+2λ1,expHP=exp-1+λplogp-λ1-plog1-pp-p.
Fig. 4.

Fig. 4.

a) The two-leaf tree considered in Examples 0.4 and 0.5. b) Index values for the tree of Example 0.5 with p=14. As branch length λ decreases, the previously defined indices D0 and D1 (grey curves) increase monotonically until both D0>D0 and D1>D0. In contrast, our new indices D0L and D1L (black curves) decrease monotonically as l decreases, with D0L<D0 and D1L<D0 for all values of λ.

Therefore PD behaves as expected but, except when p=0 or p=1, expHP approaches a limit greater than 1. Hence expHP (which is supposed to be a measure of diversity) is greater than PD (a measure of richness). Moreover, whereas we expect diversity to be maximal when node sizes are equal, expHP is maximal when the node sizes are unequal (specifically, expHP1.44 when p=e-10.37). If we instead use our index HP (Equation 0.13) then we obtain

expHP=exp[λ(-plogp-1-plog(1-p))]1,

as we would expect.

A second problem is that if the tree is not ultrametric then D0 and D1 do not correspond to weighted means. If and only if the tree is leafy and ultrametric, Si=1 and Qi=Pi for all i and so

D0=Mlong,1D0,D1=Mlong,0D1=D1L,

with wi=Sihi=hi in both cases. Otherwise, the numerator weights hi are unequal to the denominator weights Sihi. As previously noted (Chao et al., 2010; Leinster and Cobbold, 2012), this implies that D0 and D1 can take values exceeding the number of counted nodes when applied to non-ultrametric (or non-leafy) trees. Therefore these normalized indices lack a universal interpretation in terms of effective numbers of counted nodes (or extant types) (Leinster and Cobbold, 2012).

We avoid both problems by defining our richness and diversity indices as weighted means of the within-interval proportional branch sizes in all cases. As illustrated by the following example, the differences between D0L and D0 and between D1L and D1 are generally unbounded and can be relatively large even when branch sizes and node sizes are not very unequal.

Example 0.5

Consider the three-node tree of Figure 4a with p<12. We have D0=2, h=p+λ, and

D0=(1+λ)+λp+λ>1+2λ12+λ=2,D1=exp-(1+λ)plogp-λ(1-p)log(1-p)p+λ1p>2asλ0.

It follows that D0>D0 for all λ, and we can choose λ sufficiently small such that also D1>D0 (Figure 4b, grey curves). For the same three-node tree, our new indices are instead

D0L=expλlog2p+λ<expλlog2λ=2,D1L=expλ(-plogp-1-plog(1-p))p+λ<expλlog2λ=2.

Therefore D0L<D0 and D1L<D0 for all λ0, as we would expect (Figure 4b, black curves). As λ0, both D0L and D1L approach 1, consistent with the fact that the tree has exactly one non-root node when λ=0. As λ, the tree becomes increasingly close to being an ultrametric star tree, and hence D0LD0 and D1LD1 (convergence between dashed curves and between solid curves in Figure 4b).

DqL is more self-consistent and intuitive than the Dq of Chao et al. (2010)

Additional problems with the Dq indices of Chao et al. (2010) are that they are not self-consistent, and that they have counter-intuitive properties when q>1. The general definition can be expressed as

Dq=1hiIhibBisbq11-q,

which can be restructured as

Dq=1hiIhi(bBisbq)11-q1-q11-q=1hiIhiDqQi1-q11-q.

Hence for leafy ultrametric trees we have

Dq=Mlong,1-qDq,

with wi=hi=Sihi. We have thus shown that, in the case of leafy ultrametric trees, every Dq can be expressed as a weighted mean of within-interval diversities. But D0 is the weighted arithmetic mean, D1 is the weighted geometric mean, and in general Dq is the weighted power mean of exponent 1-q. One consequence is that, for ultrametric trees in which every transverse interval contains branches of equal size, the set of within-interval values will be the same for every q value but the Dq values will be different. Moreover, as q becomes larger, Dq increasingly gives larger weight to smaller within-interval diversities. As q, the Dq value assigned to each interval approaches the reciprocal of the maximum branch size within the interval. Counter-intuitively, Dq approaches the minimum of these within-interval Dq values.

These peculiar properties of Dq are unnecessary and have no obvious advantages. The Hill numbers Dq, which are used to assign a diversity value to each interval, necessarily relate to different types of weighted mean (Equation 0.1). But the method of averaging between intervals need not depend on the method of calculating diversity within intervals. Every Hill number Dq can be extended to account for tree shape using the weighted arithmetic mean, the weighted geometric mean, or any other weighted power mean of the within-interval diversities by varying exponent r of the longitudinal mean diversity index

Mlong,rDq=1hiISihiDqQir1r.

The same choice exists when defining node-wise means and star means. To avoid incompatibilities within our system, we define all our diversity indices as weighted geometric means (r0). The following example illustrates the problem and our solution.

Example 0.6

Consider again the four-leaf tree of Example 0.3 (Figure 3a). The longitudinal mean diversity values assigned to this tree are

DqL=D1=exp(λlog2+1-λlog4)=22-λforallq0,andD0=2h+4(1-λ)=2(2-λ),

which are unequal except when λ=0 or λ=1 (Figure 3b, black and grey curves). In particular, in the case of uniform branch lengths λ=12, we find D1=222.83 and D0=3 (Figure 3c, dashed curve). As derived in Example 0.3, the node-wise mean diversity for this tree is

DqN=41-λifλ<122otherwise,

for all q0. Choosing the arithmetic mean instead of the geometric mean would instead give

Mlong,1Dq=4(1-λ)ifλ<122otherwise.

DqN:=Mlong,0DqMlong,1Dq for all q0 and all λ with 0<λ<12 (Figure 3b, dark blue and pale blue curves). As q, Dq2 (Figure 3c, dashed curve), while DqL remains constant (Figure 3c, solid line).

DqN improves on all prior tree balance and imbalance indices

As previously explained (Lemant et al., 2022) and as summarized in Tables 1 and 3, conventional tree balance and imbalance indices including Sackin’s index, Colless’ index, the total cophenetic index, and others (reviewed by Fischer et al. (2023)) have important shortcomings. In the first place, these indices account for neither node sizes nor branch lengths. This means, for example, that these indices consider all star trees maximally balanced and all caterpillar trees maximally imbalanced, even as the relative sizes of some nodes or the relative lengths of some branches approach zero (Figure 5, green lines). The tree balance index Jq defined by Lemant et al. (2022) varies continuously with changing node sizes but is independent of branch lengths (Figure 5, dashed purple curves). JqN improves on Jq by also varying continuously with branch lengths (Figure 5, solid purple curves).

Fig. 5.

Fig. 5.

Values of three tree balance indices for a tree undergoing continuous changes. J1 is the index introduced by Lemant et al. (2022), which is equal to J1N in the central third of the plot. IS,norm is the normalized Sackin index, which is undefined for the leftmost, linear tree. We plot 1-IS,norm for fair comparison because IS,norm is an imbalance index whereas J1 and J1N are balance indices. The normalized Colless index is equal to IS,norm in the rightmost third of the plot and is otherwise undefined. The normalized total cophenetic index is equal to IS,norm throughout the plot.

Lemant et al. (2022) further showed that, even when restricted to the tree types on which conventional tree balance indices are defined, and even when all node sizes are equal, Jq enables a more meaningful comparison of trees with different degree distributions or different numbers of leaves. For example, when applied to leafy caterpillar trees with uniform branch lengths and uniform node sizes, Jq considers long trees (those with many leaves) to be less balanced than short ones, whereas conventional indices consider them equally imbalanced. JqN, as an extension of Jq, shares this useful property.

Inequalities between indices

Choosing self-consistent definitions ensures that our diversity indices are related by simple sets of inequalities, which formalize and generalize the results of previous sections (Figure 6a). Hill (1973) showed that DqDr for all rq0. Because DqL and DqN are geometric weighted means of Dq values with weights independent of q, it follows that they obey corresponding inequalities:

Fig. 6.

Fig. 6.

a) Inequalities between diversity indices for all q0 and all rq. b) Examples of leafy trees with uniform branch lengths for which various index values are equal for all q,r0. The top left corner of each panel contains a grid, whose twelve squares correspond to the twelve indices shown in the key. A line connecting two grid squares indicates that the corresponding indices are equal for the tree shown in the panel. Instances where evenness indices are equal to 1 are indicated in the third grid column. c) A tree for which JqL>Jq and JqLJqN.

Property 0.1

For all rooted trees, DqLDrL, DqNDrN and DqSDrS for all rq0.

Additional inequalities exist between different types of diversity index but not among the evenness indices:

Proposition 1

For all rooted trees, D0DqL for all q0. For all leafy ultrametric trees, but not for all rooted trees, DqDqL for all q0.

Proposition 2

For all rooted trees, DqLDqN for all q0.

Proposition 3

For q>0, no single ordering of Jq, JqL and JqN applies to all leafy ultrametric trees.

Proofs of these three propositions can be found in the Appendix. Informally, the reason why the second inequality in Proposition 1 applies only to leafy ultrametric trees is that DqL, unlike Dq, is independent of the size of the root node (and any node arbitrarily close to the root).

Special cases

Our consistent definitions further yield numerous simple equations that unite our indices in special cases. To simplify the statement of these results, we will assume that all branch sizes are greater than zero. This assumption implies no loss of generality because our index definitions are invariant to the addition or removal of subtrees containing only zero-sized branches (which in an evolutionary tree correspond to extinct lineages). The properties in this section hold for all q, r0.

We begin with cases in which diversities based on the same type of average but with different q values are equal. These first four properties, which are illustrated by simple examples in the top row of Figure 6b, follow immediately from the definitions.

Property 0.2

Dq=DrJq=1 if and only if all counted nodes have equal size.

Property 0.3

DqL=DrLJqL=1 if and only if the branch sizes at every depth are equal. This also implies DqN=DrN and JqN=1.

Property 0.4

DqN=DrNJqN=1 if and only if every internal node’s child branches have equal size.

Property 0.5

(Dq=DrandDqL=DrLJq=1andJqN=1 if and only if the branch sizes at every depth are equal and all node sizes are equal. This implies that the tree is ultrametric and perfectly symmetric, and that DqN=DrN and JqN=1.

In other special cases, we find equality among diversities of different types but with equal q values. Again, these properties are directly implied by the definitions. Simple examples are shown in the middle row of Figure 6b.

Property 0.6

DqL=Dq if and only if the tree is a leafy ultrametric tree in which no non-root node has outdegree greater than 1. This also implies JqL=Jq.

Property 0.7

DqN=DqL if and only if the tree is a piecewise star tree. This also implies JqN=JqL.

Property 0.8

DqS=DqN=DqL if and only if the tree is a star tree. This also implies JqS=JqN=JqL.

Property 0.9

DqS=DqN=DqL=Dq if and only if the tree is a leafy ultrametric star tree. This also implies JqS=JqN=JqL=Jq.

It follows that equality both within and between types applies under more restrictive conditions, as illustrated in the bottom row of Figure 6b:

Property 0.10

DqL=Dr if and only if the tree is a leafy ultrametric tree with equally sized leaves in which only the root has outdegree greater than 1. This also implies DqN=DrN, DqS=DrS and JqS=JqN=JqL=Jq=1.

Property 0.11

DqN=DrL if and only if the tree is a piecewise star tree with equal branch sizes at every depth. This also implies JqN=JqL=1.

Property 0.12

DqS=DqN=DqL if and only if the tree is a star tree with equally sized leaves. This also implies JqS=JqN=JqL=1.

Property 0.13

DqS=DqN=DqL=Dr if and only if the tree is a leafy ultrametric star tree with equally sized leaves. This also implies JqS=JqN=JqL=Jq=1.

In yet another set of special cases, the evenness formulas simplify to ratios. The following two results are immediate consequences of D0L or D0N being constant under the specified conditions.

Property 0.14

If the branch count across the tree is constant and greater than one then

JqL=logDqLlogD0L.

Property 0.15

If the tree has uniform outdegree greater than one and the branches present at every depth in the tree have equal lengths then

JqN=logDqNlogD0N.

All properties described in this section would also hold if we were to define all our richness and diversity indices as weighted arithmetic, rather than geometric, means of interval or node values (or indeed any other weighted power mean). Our preference for geometric means will be justified in the next section.

The leafy tree identity

For an important class of trees, our index definitions lead to a surprisingly simple, fundamental connection between tree balance, Shannon’s diversity index, Sackin’s index, and outdegree. This result is less obvious than the properties of the previous section and requires a more substantial proof. We term this unifying relationship the leafy tree identity.

Lemma 0.7

If the tree is leafy and all branches have equal length l>0 then

logD1N=H1lh.

If additionally all n leaves have equal size then

logD1N=nlognIS,

where IS is Sackin’s index.

Proof.

The proof is identical to the proof of Proposition 6 in Lemant et al. (2022), except for the base of the logarithms and the additional factor l.

Proposition 4

(The leafy tree identity; generalization of Proposition 6 in Lemant et al. (2022)) If the tree is leafy and has uniform branch lengths and all internal nodes have outdegree m>1 then

J1N=H1lhlogm. (0.14)

If additionally all n leaves have equal size then

J1N=nlogmnIS. (0.15)

Proof.

The result follows immediately from Lemma 0.7 and Property 0.15.

The leafy tree identity implies that, among leafy trees with uniform branch lengths and uniform outdegrees, tree balance depends only on node sizes and node depths. If two such trees have equal effective heights relative to branch length (h/l), equal outdegrees (m), and equal node size Shannon entropy values (1H) then they must have equal balance J1N, irrespective of topology and number of leaves. For example, Figure 7a and 7b show a pair of bifurcating leafy ultrametric trees with uniform leaf sizes and uniform branch lengths. Because these trees have equal outdegrees, leaf counts, and Sackin’s index values, the special form of the leafy tree identity (Equation 0.15) implies they must be equally balanced (other equal index values are recorded in Figure 7f). The following example applies the more general form of the leafy tree identity (Equation 0.14) to trees that are less obviously similar.

Fig. 7.

Fig. 7.

a-b) Two leafy bifurcating trees with uniform node sizes and uniform branch lengths, which differ in topology but are equally balanced. c-e) Three leafy bifurcating trees with uniform branch lengths, which differ in topology and number of leaves but are equally balanced. Nodes are labelled with their sizes. f) Table recording where pairs of trees have equal or unequal index values. Parameter q can take any non-negative value.

Example 0.8

Consider the bifurcating leafy ultrametric tree with four leaves, uniform branch lengths, and leaf sizes 38, 18, 14 and 14 (Figure 7c). Now suppose we retain the leaf sizes but rearrange the nodes and branches to form a caterpillar tree with the node of size 38 at depth l and one of the nodes of size 14 at depth 2l (Figure 7d). Finally, consider a six-leaf caterpillar tree with uniform branch lengths and proportional leaf sizes (in order of increasing depth) 12, 14, x, y, p and p, with p0.026606 (Figure 7e). All three trees have identical values of m, h and H1 (see Appendix for derivation). Hence the leafy tree identify implies that they have equal D1N and J1N values. All three trees also have equal values of D1=expH13.75 and D0N=m=2. Other index values shared by pairs of trees are indicated in Figure 7f.

Equation 0.15 is especially useful because the numerator nlogmn is the minimum value that IS can attain on leafy n-leaf trees with uniform branch lengths, uniform node sizes, and uniform outdegree m>1. Hence nlogmn/IS lies between 0 and 1 and is equal to 1 if and only if the tree is fully balanced. We previously showed (Proposition 7 in Lemant et al. (2022)) that, among all node-wise arithmetic mean indices with wi=ni, J1N is the only index that satisfies Equation 0.15. Our previous proof can be straightforwardly generalized to show that Equation 0.15 cannot hold for any index of the form Mnode,rJq with r1 or q1. Therefore J1N is the only tree balance index for which this useful, unifying identity holds.

An example cross-disciplinary application

We illustrate the universality of our methods by using them to compare the shapes of two trees from different fields of research, representing dissimilar processes and constructed using different methods. The first of these trees depicts the evolution of the Human Immunodeficiency Virus (HIV) within a host, as inferred from molecular data and as used in another recent study of tree shape indices (Barzilai and Schrago, 2023). The second tree represents the diversification of the Uralic language family (Honkola et al., 2013). To simplify the exposition we assign size zero to all internal nodes and an equal size to all leaves.

If we disregard the inferred branch lengths then it is difficult by eye to assess which tree is the more diverse or more balanced (Figure 8a, b). These apparent similarities are borne out in the shape index values (Figure 8c, d). Excepting one node, both trees are bifurcating and therefore both have D0N2. The two trees have similar branch counts in total (D0S=33 and 32) and at each depth 3<D0L<4. The D1N, D1S and D1L values are somewhat lower than the corresponding richness values due to imbalances, as captured by our evenness indices, which are likewise similar for the two trees (J1N and J1L between 0.7 and 0.8; J1S0.86). Lemma 0.7 further implies similar IS values (93 and 97).

Fig. 8.

Fig. 8.

a-b) Trees with equalized branch lengths representing the within-host evolution of HIV (a) and the evolutionary history of the Uralic languages (b). c) Diversity index values for the two trees with equalized branch lengths. d) Evenness index values for the two trees with equalized branch lengths. e-f) The same trees but with the originally inferred branch lengths. g) Diversity index values, accounting for branch lengths. h) Evenness index values, accounting for branch lengths. In all cases, leaves are assigned equal size and internal nodes are assigned size zero. The HIV tree was sourced from the GitHub repository associated with Barzilai and Schrago (2023) (file PIC38051.tre) and the languages tree from the D-PLACE database (Kirby et al., 2016) (folder honkola et al2013).

When we restore the inferred branch lengths, the two trees no longer look alike (Figure 8e, f). The HIV phylogeny approximates a non-ultrametric star tree, with long branches originating close to the root. The average effective out-degree of the HIV tree, accounting for unequal branch lengths, is substantially higher than two (D0N9); the effective number of branches is three times lower than when branch lengths are ignored D0S11; and there are more than twice as many parallel branches (D0L10). Because the HIV tree is approximately a star tree with equal node sizes, all its diversity indices are approximately equal and all its evenness indices are close to one (Property 0.12). In the case of the languages tree, accounting for the inferred branch lengths – which are approximately exponentially distributed and not nearly so depth-dependent – has only a small effect on most index values. The diversity indices for the languages tree remain far from equal. Altogether our indices thus show that the HIV tree is much bushier, has a larger number of effective types, and is in every sense more balanced than the languages tree (Figure 8g, h).

In summary, the clear differences between these two trees, implying different modes of evolution, are captured only by indices that account for their different branch length distributions. An analysis based on prior tree balance indices, which ignore branch lengths, would incorrectly conclude that the trees have very similar shapes and plausibly resulted from similar processes.

An example application to model-generated trees

As a final demonstration of the potential for our indices to distinguish trees generated by different processes, we reanalyse results of a recent computational modelling study of tumour evolution by Lewinsohn et al. (2023). The original study sought to infer differences between the shapes of evolutionary trees corresponding to alternative modes of tumour expansion – boundary-driven growth (BDG) versus unrestricted growth. On average, the BDG model was found to generate ultrametric time trees with higher variance in their terminal branch lengths, and non-ultrametric gene trees with higher variance in their leaf depths (mutations per cell).

To see how our tree shape indices vary with simulated tumour growth mode, we consider the two representative simulated tumours from Figure 1 of Lewinsohn et al. (2023). The time trees (Figure 9a-c) have the same number of leaves and almost identical effective numbers of non-root nodes (D0S117 and 118). However, the BDG time tree has 22% higher effective branch count (D1S65 versus 53), 26% higher branch count across the tree (D0L28 versus 22), and 25% higher leaf diversity (D1L21 versus 17).

Fig. 9.

Fig. 9.

a-b) Time trees generated by computational models of tumour evolution with boundary-driven growth (a) or unrestricted growth (b). Leaves represent extant cells and branch lengths are proportional to time elapsed between cell division events. c) Tree shape index ratios for the two time trees. d-e) Gene trees generated by the same simulations as the time trees. Leaves represent extant cells and branch lengths are proportional to genetic distances. f) Tree shape index ratios for the two gene trees. All tree data was obtained from the GitHub repository associated with Lewinsohn et al. (2023).

The gene trees (Figure 9d-f) likewise have the same number of leaves and almost identical effective numbers of non-root nodes (D0S136 in both cases). But the BDG gene tree, being less star-like, has substantially lower average effective outdegree (D0N2.6 versus 3.0; D1N2.1 versus 2.6), 20% fewer branches across the tree (D0L17 versus 21), and 26% lower leaf diversity (D1L11 versus 15). The BDG gene tree is also less balanced (J1N0.76 versus 0.84).

Whereas well chosen problem-specific indices might give greater statistical power for distinguishing particular tree types, an advantage of our multi-dimensional system is that it is designed to be universally applicable, to facilitate comparisons between studies and data sets. Leaf depth variance, for instance, cannot by itself tell apart ultrametric trees, while terminal branch length variance is inapplicable to trees with uniform (or unknown) branch lengths.

Discussion

The seminal paper of Hill (1973) cautions that there is “almost unlimited scope for mathematical generality in relation to measures of diversity and taxonomic difference” and therefore “Simple and well-understood indices should be used”. In accordance with this advice, here we have constructed new tree shape indices as weighted means of the most standard, basic diversity and evenness indices. This systematic approach ensures that all our indices are not only robust and universally applicable but also have simple, consistent interpretations and clear interrelationships.

Some of the indices we have defined here are refinements of prior approaches to assessing tree shape. Our DqL and HP are similar to the Dq of Chao et al. (2010) and the phylogenetic entropy HP of Allen et al. (2009), respectively, but are more self-consistent and can be meaningfully applied to non-ultrametric trees. JqN builds on the ideas of Lemant et al. (2022) but, by accounting for branch lengths – a key advantage of prior phylogenetic diversity and phylogenetic entropy indices, not shared by any prior tree balance indices – generalizes the concept of tree balance to a wider class of trees. These new indices share all the desirable properties but not the shortcomings of their predecessors and can therefore universally supersede them (Table 3). For the remainder of our indices describing average effective out-degree, effective numbers of nodes and branches, and evenness of branch sizes, we know of no precedents. In combination, our indices provide a more sophisticated, general, multidimensional description of tree shape than has previously been possible.

Whereas we have focussed on a system built around Dq and Jq, it is easy to use our general definitions of the longitudinal, node-wise, and star means to quantify other aspects of tree shape. A parallel, self-consistent system of indices can be defined by setting wi=hi instead of wi=Sihi in Equations 0.3 and 0.4, and setting wk=1 instead of wk=SCk in Equations 0.8 and 0.9. These indices, which are robust to small changes in branch lengths but not node sizes, are normalized by dividing by h instead of h. Alternatively, Dq can be replaced by another basic diversity index, or Jq by another evenness index, such as the ratio Dq/D0 preferred by Hill (1973) (see also Smith and Wilson (1996); Jost (2010); Tuomisto (2012)). Based on the means, we can also straightforwardly derive expressions for higher moments to obtain indices that, for example, quantify how much effective out-degree varies across all nodes or varies with node depth.

There are nevertheless several reasons for preferring our specific definitions. First, the foundational Dq and Jq are the most popular diversity and evenness indices among biologists (Tucker et al., 2017; Tuomisto, 2012). Second, defining entropy and evenness indices as weighted arithmetic means, and diversity indices as weighted geometric means, results in relatively simple expressions, especially in the case of leafy ultrametric trees. Third, J1N is the only universal tree balance index for which the unifying leafy tree identity holds. In summary, we have taken the best of the existing indices, improved them, unified them, and filled in the gaps to create a coherent system (Table 2).

Given the ubiquity of tree structures, we expect our multidimensional method of describing tree shape to empower research and inform decision making in diverse domains. Our initial development of universal, robust indices was motivated by the need to compare and categorize non-leafy, non-ultrametric trees representing the clonal evolution of human tumours, where node sizes (corresponding to cell subpopulation sizes) and branch lengths (genetic distances) convey valuable information (Noble et al., 2022). Tree structures with node sizes and branch lengths are likewise centrally important in community ecology, conservation biology, systematic biology, and the study of microbial evolution. For instance, our indices can be used instead of conventional tree balance indices to evaluate alternative models of speciation, or to investigate how the mode of evolution of a pathogenic virus varies with geographical location, time period, or strain. In place of phylogenetic diversity and phylogenetic entropy, our non-normalized diversity indices could be used to inform policy making by quantifying how different actions would affect biodiversity. Beyond biology, obvious subjects for analysis include phylogenetic trees of language evolution, hierarchical organizational structures, and the tree data structures that abound in computing. As we have illustrated, our generic indices can be used not only within but also across domains to uncover similarities and differences in, say, the evolution of organisms, languages, and technologies.

One key topic for further theoretical research is to derive the expected values and covariances of our indices under standard tree generation models, such as the uniform model and the Yule process, for comparison with empirical data. Relationships between our indices and distance-based metrics such as the mean pairwise distance (which lacks a universal normalization (Tsirogiannis et al., 2012)) also remain to be examined. In the same vein as Figure 7, we are investigating sets of distinct trees to which our indices assign equal values, to determine whether additional indices might ever be needed to distinguish between trees in typical applications. Towards establishing a universal standard for describing tree shape, we are developing software packages for calculating index values that can be integrated with popular tree inference methods. Just as the first step in analysing a set of measurements is to calculate the mean and variance, so we propose that, whenever one encounters a rooted tree, a useful first step will be to describe its shape by evaluating our indices.

Acknowledgements

We are grateful to Kerry Manson for helpful comments on an earlier draft of this manuscript, to Lucia Barzilai and Chiara Barbieri for helping us obtain suitable empirical tree data, and to anonymous reviewers for suggesting various improvements.

Funding

This work was supported by the National Cancer Institute at the National Institutes of Health (grant number U54CA217376) to RN. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

APPENDIX

Derivation of DqN in Example 0.3

For the root r, we have hCr=λ and Ar={r}. For either of the other internal nodes k, we have hCk=(1-λ)/2 and Ak={k,r}. The subtree weights are

SCr(x)=1if0x<λ,0otherwise,SCk(x)=12ifλx<1,0otherwise.

The ancestor weights are

vrr=0SCrxdx=0λ1dx=λ,vkk=λ2λSCkxdx=λ2λ12dx=λ2ifλ<12,λ112dx=1-λ2otherwise,vrk=2λSCkxdx=2λ112dx=1-2λ2ifλ<12,0otherwise.

The node diversity values are, for all q0,

DqPr(x)=2if0x<λ,4ifλx<1,0otherwise.DqPk(x)=2ifλx<1,0otherwise.

Hence

0hSCr(x)qHNPrxdx=0λ1log2dx=λlog2,0hSCk(x)qHNPkxdx=λ112log2dx=1-λlog22,0hSCk(x)qHNPrxdx=λ112log4dx=1-λlog2.

For all q0, it follows from Equation 0.8 that if λ12 then

DqN=exp1λλ(λlog2)+2×1(1-λ)/21-λ21-λlog22=2,

and otherwise

DqN=exp1λλ(λlog2)+2×1(1-λ)/2λ2(1-λ)log22+2-2λ2(1-λ)log2=41-λ.

Proof of Proposition 1

Proof.

For q0, let k(q)I(T) such that DqPk(q)DqPi for all iI(T), and let b1,,bPk(q) denote the non-zero-sized branches in the interval k(q). Then, by a basic property of generalized means,

DqLT:=Mlong,0DqTlimrMlong,rDqT=maxiITDqPi=DqPkq.

For the first part of the proposition, we note that for any interval iI(T), the number of non-zero-sized branches in i is Pi, which cannot exceed the number of counted nodes D0(T). Hence, for any rooted tree T and all q0,

DqLTDqPkq:=bBk(q)pbq11-q=j=1Pk(q)pbjq11-qj=10pbjq11-qD0(T).

We now turn to the second part. By definition, for all iI, if bBi then branch size sb=xVbfx, where Vb is the set of all nodes that descend from b, and fx is the proportional size of node x. For all rooted trees we have Vb1Vb2= for all b1,b2Bi with b1b2. For ultrametric trees, bBiVb=L, where L is the set of all leaves in the tree. For leafy ultrametric trees, Si=1 for all i and hence pb=sb for all bBi. Then for any leafy ultrametric tree T and all q0,

DqLTDqPkq:=bBk(q)pbq11-q=bBk(q)sbq11-q=bBk(q)xVbfxq11-qbBk(q)xVbfxq11-q=xL(T)fxq11-q=Dq(T).

Finally we will prove that this inequality does not hold for all rooted trees. We will do so in a more general context to show that the result is independent of our choice of weight function w and exponent r. Let DqL,rw=Mlong,rDq;w for real r, where w is a continuous, monotonically increasing function of Si and hi such that wi=wSi,hi>0 when Si>0 or hi>0, and wi0 as Si0 or hi0. First consider the leafy but non-ultrametric three-leaf star tree T1 in which one leaf has size 1-p and depth λ, and the other two leaves have size p2 and depth 1+λ (as in Figure 4 but with one more leaf). Now

DqL,rwT1=iIT!wiDqPiriIT1wi1r=w1(1-p)q+2p2qr1-q+w2212qr1-qw1+w21r.

Since w1 depends only on λ and w2 depends only on p, we can make λ a function of p such that w1w20 as λ0 and p0, in which case DqL,rwT12 as λ0 and p0. Also, for all q>0, DqT11 as p0. Hence DqL,rwT1>DqT1 as λ0 and p0. Instead setting λ=0 makes T1 ultrametric but non-leafy, with DqL,rwT12 and DqT11 as p0 as before.

Proof of Proposition 2

Proof.

For every node k, every jAk, and at every depth x, we have Pj(x)P(x) and so HqPj(x)Hq(P(x)) for all q0. Hence

DqN:=exp1hkVjAkvjkuk0hSCk(x)qHPj(x)dxexp1hkVmaxjAk0hSCk(x)qHPj(x)dxexp1hkV0hSCk(x)qH(P(x))dx=exp1h0hHq(P(x))kVSCk(x)dx=exp1h0hHq(P(x))S(x)dx=DqL.

Proof of Proposition 3

Proof.

The top left panel of Figure 6b shows a leafy ultrametric tree for which Jq=1, JqL<1 and JqN<1. The third panel in the top row of Figure 6 b shows a leafy ultrametric tree for which Jq<1, JqL<1 and JqN=1. Now consider the four-leaf, bifurcating, leafy ultrametric tree with uniform branch lengths, such that the sizes of each pair of sibling leaves are ϵ and 12-ϵ (Figure 6c). As ϵ0, we have Jq12, JqL34 and JqN12. The different orderings of Jq, JqL and JqN for three trees are inconsistent with any universal ordering.

Derivation of index values in Example 0.2

Since the tree of Figure 7c is ultrametric, h/l=2, where l is the branch length. The node size entropy is

H1=-38log38-214log14-18log18=52log2-38log31.32.

The tree of Figure 7d has the same node sizes and therefore the same H1 value as the previous tree. It also has the same relative effective height, as

hl=38+214+318+14=2.

For the tree of Figure 7e, since proportional node sizes must sum to unity we have

1=12+14+x+y+2px+y+2p=14.

For this tree to have the same h and H1 values as the four-leaf trees we additionally require

2=hl=12+214+3x+4y+5(2p)3x+4y+10p=1,52log2-38log3=H1=-12log12-14log14-xlogx-ylogy-2plogp.

The first two equations together imply x=2p and y=14-4p. After substituting these results into the third equation we obtain the numerical solution p0.026606. Since all three trees have identical values of m, h and H1, the leafy tree identify implies that they have equal D1N values and equal balance:

D1N=expH1lh=expH121.94,J1N=H1lhlogm=H12log20.95.

Data and code availability

All data sets used in this study have previously been published; the captions of Figures 8 and 9 provide precise references. An open source R package to calculate our new tree shape indices for trees in Newick, NEXUS or phylo format is at https://github.com/kimverity/RUIindices.

References

  1. Albers Susanne and Westbrook Jeffery. Self-organizing data structures. Online Algorithms: The state of the art, pages 13–51, 2005. [Google Scholar]
  2. Allen Benjamin, Kon Mark, and Bar-Yam Yaneer. A New Phylogenetic Diversity Measure Generalizing the Shannon Index and Its Application to Phyllostomid Bats. The American Naturalist, 174(2):236–243, 2009. [DOI] [PubMed] [Google Scholar]
  3. Atkinson Quentin D. and Gray Russell D.. Curious Parallels and Curious Connections—Phylogenetic Thinking in Biology and Historical Linguistics. Systematic Biology, 54(4):513–526, 2005. [DOI] [PubMed] [Google Scholar]
  4. Barzilai Lucia P and Schrago Carlos G. Signatures of natural selection in tree topology shape of serially sampled viral phylogenies. Molecular Phylogenetics and Evolution, 183:107776, 2023. [DOI] [PubMed] [Google Scholar]
  5. Chao Anne, Chiu Chun Huo, and Jost Lou. Phylogenetic diversity measures based on Hill numbers. Philosophical Transactions of the Royal Society B: Biological Sciences, 365 (1558):3599–3609, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chindelevitch Leonid, Hayati Maryam, Poon Art FY, and Colijn Caroline. Network science inspires novel tree shape statistics. Plos one, 16(12):e0259877, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Colijn Caroline and Gardy Jennifer. Phylogenetic tree shapes resolve disease transmission patterns. Evolution, medicine, and public health, 2014(1):96–108, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Colless Donald H. Review of Phylogenetics, The Theory and Practice of Phylogenetic Systematics. Systematic Zoology, 31(1):100–104, 1982. [Google Scholar]
  9. Faith Daniel P.. Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1):1–10, 1992. [Google Scholar]
  10. Fischer Mareike, Herbst Lina, Kersting Sophie, Kühn Luise, and Wicke Kristina. Tree Balance Indices: A Comprehensive Survey. Springer Nature, 2023. [Google Scholar]
  11. Hill Mark. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology, 54:427–432, 1973. [Google Scholar]
  12. Honkola Terhi, Vesakoski Outi, Korhonen Kalle, Lehtinen Jyri, Kaj Syrjänen, and Niklas Wahlberg. Cultural and climatic changes shape the evolutionary history of the uralic languages. Journal of Evolutionary Biology, 26(6):1244–1253, 2013. [DOI] [PubMed] [Google Scholar]
  13. Jost Lou. Entropy and diversity. Oikos, 113(2):363–375, 2006. [Google Scholar]
  14. Jost Lou. The Relation between Evenness and Diversity. Diversity, 2(2):207–232, 2010. [Google Scholar]
  15. Kirby Kathryn R, Gray Russell D, Greenhill Simon J, Jordan Fiona M, Gomes-Ng Stephanie, Bibiko Hans-Jörg, Blasi Damián E, Botero Carlos A, Bowern Claire, Ember Carol R, et al. D-place: A global database of cultural, linguistic and environmental diversity. PloS one, 11(7):e0158391, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Leinster Tom and Cobbold Christina A.. Measuring diversity: the importance of species similarity. Ecology, 93(3):477–489, 2012. [DOI] [PubMed] [Google Scholar]
  17. Lemant Jeanne, Sueur Cécile Le, Manojlović Veselin, and Noble Robert. Robust, Universal Tree Balance Indices. Systematic Biology, 71(5):1210–1224, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Leventhal Gabriel E, Kouyos Roger, Stadler Tanja, Von Wyl Viktor, Yerly Sabine, Böni Jürg, Cellerai Cristina, Klimkait Thomas, Günthard Huldrych F, and Bonhoeffer Sebastian. Inferring epidemic contact structure from phylogenetic trees. PLoS computational biology, 8(3):e1002413, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lewinsohn Maya A, Bedford Trevor, Müller Nicola F, and Feder Alison F. State-dependent evolutionary models reveal modes of solid tumour growth. Nature Ecology & Evolution, 7(4):581–596, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Mir Arnau, Rosselló Francesc, and Rotger Lucía. A new balance index for phylogenetic trees. Mathematical Biosciences, 241(1):125–136, 2013. arXiv: 1202.1223. [DOI] [PubMed] [Google Scholar]
  21. Mir Arnau, Rotger Lucía, and Rosselló Francesc. Sound Colless-like balance indices for multifurcating trees. PLoS ONE, 13(9):559–560, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mooers Arne O. and Heard Stephen B.. Inferring Evolutionary Process from Phylogenetic Tree Shape. The Quarterly Review of Biology, 72(1):31–54, 1997. [Google Scholar]
  23. Noble Robert, Burri Dominik, Sueur Cécile Le, Lemant Jeanne, Viossat Yannick, Kather Jakob Nikolas, and Beerenwinkel Niko. Spatial structure governs the mode of tumour evolution. Nature Ecology & Evolution, 6(2):207–217, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pavoine Sandrine and Bonsall Michael B. Measuring biodiversity to explain community assembly: a unified approach. Biological Reviews, 86(4):792–812, 2011. [DOI] [PubMed] [Google Scholar]
  25. Pielou E. C.. The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13:131–144, 1966. [Google Scholar]
  26. Purvis Andy and Agapow Paul-Michael. Phylogeny imbalance: taxonomic level matters. Systematic Biology, 51(6):844–854, 2002. [DOI] [PubMed] [Google Scholar]
  27. Rényi Alfréd. On measures of entropy and information Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1: Contributions to the theory of statistics, 4:547–562, 1961. [Google Scholar]
  28. Rogers James S.. Response of Colless’s Tree Imbalance to Number of Terminal Taxa. Systematic Biology, 42(1):102–105, 1993. [Google Scholar]
  29. Sackin M. J.. “Good” and “Bad” Phenograms. Systematic Biology, 21(2):225–226, 1972. [Google Scholar]
  30. Scott Jacob G, Maini Philip K, Anderson Alexander RA A, and Fletcher Alexander G. Inferring Tumor Proliferative Organization from Phylogenetic Tree Measures in a Computational Model. Systematic Biology, 69(4):623–637, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Shannon C. A.. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. [Google Scholar]
  32. Shao Kwang-Tsao and Sokal Robert R. Tree Balance. Systematic Zoology, 39(3):266, 1990. [Google Scholar]
  33. Smith Benjamin and J. Bastow Wilson. A Consumer’s Guide to Evenness Indices. Oikos, 76(1):70–82, 1996. [Google Scholar]
  34. Tsirogiannis Constantinos, Sandel Brody, and Cheliotis Dimitris. Efficient computation of popular phylogenetic tree measures. In International Workshop on Algorithms in Bioinformatics, pages 30–43. Springer, 2012. [Google Scholar]
  35. Tucker Caroline M., Cadotte Marc W., Carvalho Silvia B., Davies T. Jonathan, Ferrier Simon, Fritz Susanne A., Grenyer Rich, Helmus Matthew R., Jin Lanna S., Mooers Arne O., Pavoine Sandrine, Purschke Oliver, Redding David W., Rosauer Dan F., Winter Marten, and Mazel Florent. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biological Reviews, 92(2):698–715, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Tuomisto Hanna. An updated consumer’s guide to evenness and related indices. Oikos, 121(8):1203–1218, 2012. [Google Scholar]
  37. Veron Simon, Saito Victor, Padilla-García Nélida, Forest Félix, and Bertheau Yves. The use of phylogenetic diversity in conservation biology and community ecology: A common base but different approaches. Quarterly Review of Biology, 94(2):123, 2019. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data sets used in this study have previously been published; the captions of Figures 8 and 9 provide precise references. An open source R package to calculate our new tree shape indices for trees in Newick, NEXUS or phylo format is at https://github.com/kimverity/RUIindices.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES