Abstract
In mathematical phylogenetics, the Sackin index, measuring the sum of path lengths between leaves and the root, is one of the most frequently used measures of balance for phylogenetic trees. The uniform model, in which all rooted binary labeled trees for a given set of leaf labels are assumed to be equiprobable, is one of the most frequently used models for describing a probability distribution on the the set of rooted binary labeled trees. This note provides a simple new derivation of the mean value of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. The new derivation suggests a simple form of the mean Sackin index in terms of the Catalan numbers, quickly enabling a verification of the asymptotic value for the mean.
Keywords: Catalan numbers, phylogenetics, tree balance
In recent years, Mathematical Biosciences has reported several studies of the properties of tree balance statistics, measures that report the extent to which nodes of an evolutionary tree are “balanced” in having subtrees with similar numbers of descendants [2, 6, 16]. For trees with a fixed number of leaves, these and related studies [5, 9, 10] have examined features such as the minima and maxima of two of the earliest and most popular tree balance statistics—the Sackin [20] and Colless [8] indices—and the means and variances of these statistics under two of the most popular probability models for evolutionary tree shapes, the Yule–Harding and uniform models.
The Sackin index sums path lengths from leaves to the root, considering all leaves [22, p. 53]. For binary trees, the Colless index sums the absolute difference between the numbers of leaves in the two subtrees descended from internal nodes, considering all internal nodes [22, p. 53]. In the Yule–Harding model for n taxa, beginning with a single species and proceeding iteratively, species are all equally likely to be the next to speciate, inducing a particular probability distribution on rooted binary labeled trees with n leaves [22, p. 43]. In the uniform model, all rooted binary labeled trees with n leaves are equiprobable [22, p. 50].
Mathematical properties of the Sackin and Colless indices under the Yule–Harding and uniform models have long been of interest, with an initial derivation of the mean of the Colless index under the Yule–Harding model [14] and several subsequent studies [3, 4, 15, 17–19] preceding the recent wave of investigations. Hence, in returning to these indices, with all the attention that had previously been focused on them, Mir et al. [16] were evidently surprised to discover that one of their formulas—a fundamental result that solves one of the most natural problems that might be posed concerning a tree balance index—appeared to be novel. They wrote “we obtain an exact formula for the expected value of the Sackin index under the uniform model, a result that seems to be new in the literature.”
Formally, for n ≥ 2, consider the set RB(X) of rooted binary labeled trees whose leaves are bijectively labeled by n distinct labels in a set X [22, p. 12]. RB(X) contains rb(n) = (2n − 3)!! = (2n − 3)(2n − 5) ⋯ (5)(3)(1) = (2n − 2)!/[2n−1(n − 1)!] trees [22, eq. 2.2].
Definition 1. Consider a rooted binary labeled tree T ∈ RB(X). For each leaf x ∈ X, let ℓ(x) give the length in edges of the directed path from the root ρ of T to leaf x. The Sackin index for T is
Definition 2. For n ≥ 2, given a probability distribution θ on the set of rooted binary labeled trees RB(X) with |X| = n, let Sn denote the random variable obtained by randomly choosing T ∈ RB(X) according to θ and computing the value ST.
An asymptotic large-n expectation under the uniform model, , was studied by Blum et al. [4] (see also the table on p. 14 of [1], in which the uniform model corresponds to the β = −1.5 case of the more general beta-splitting model, and the table entry gives the asymptotic mean path length to the root for a node chosen at random in a tree under the uniform model).
The exact was only recently obtained by Mir et al. [16].
Theorem 3. ([16, Theorem 22]) The expectation of the Sackin index under the uniform model on rooted binary labeled trees of n leaves is
The proof by Mir et al. [16] of Theorem 3 begins with enumerations of classes of trees defined by the path length to the root from a leaf node with a specified label. The enumerations give rise to a sum that is solved after much algebra with the help of three sums evaluated by automatic summation algorithms.
The purpose of this note is to produce a new, simple proof of Theorem 3 characterizing the expected value of the Sackin index under the uniform model. To provide the new proof, we use the fact that taxa are exchangeable in the uniform model [22, p. 52], so that the probability under the model of a rooted binary labeled tree T in RB(X) can be computed from the shape of T, disregarding the labels.
Definition 4. A probability distribution θ on RB(X) satisfies the exchangeability property if for each rooted binary labeled tree T in RB(X) and each permutation σ of its leaf labels, .
The exchangeability of the uniform model enables use of a result from Than & Rosenberg [23]. A subset A of the label set X is said to represent a cluster in a labeled tree T if for some node v of T, the leaves descended from v are bijectively labeled by the elements of A [22, p. 18].
Proposition 5. ([23, Lemma 6], [22, Proposition 3.5]) If a probability distribution θ on RB(X) satisfies the exchangeability property, then
where n = |X|, and pn(k) is the probability that a given subset A ⊆ X with |A| = k, 1 ≤ k ≤ n, is a cluster of a tree of n leaves sampled from RB(X) according to θ.
This result is obtained by noting that for T ∈ RB(X), the sum ST =∑x∈X ℓ(x) that computes the Sackin index, proceeding over leaves of T, can be converted to a sum over edges of T. In particular, for each leaf v of T and each edge e ancestral to v, v appears in the cluster of T immediately descended from e. Each edge of T contributes a count to ST equal to the size of the subtree rooted below the edge. Hence, where Lk counts the number of clusters in T with size k leaves. Taking the expectation of Lk over all trees sampled from RB(X) and using , the proposition follows.
The probabilities pn(1) and pn(n) satisfy pn(1) = pn(n) = 1, as each leaf (|A| = 1) is a cluster of each tree in RB(X), as is the full set of leaves (|A| = n). For 2 ≤ k ≤ n − 1, 0 < pn(k) < 1. In particular, the number of rooted binary labeled trees for the k leaves in a cluster A with |A| = k is rb(k); treating the cluster A as a node, the number of rooted binary labeled trees that contain the remaining n − k leaves and the cluster A is rb(n − k + 1). As each rooted binary labeled tree of n leaves has probability 1/ rb(n) under the uniform model, we therefore have the following result.
Proposition 6. ([23, eq. 10], [22, eq. 3.4]) Under the uniform model, for 1 ≤ k ≤ n,
We now provide the proof of Theorem 3.
Proof of Theorem 3. By Propositions 5 and 6,
| (1) |
This expression is simplified by a “remarkable property of the ‘middle’ elements of Pascal’s triangle” [13, eq. 5.39], the identity
| (2) |
Adding a term for k = n − 1 to the sum in eq. 1, we take m = n − 1 in the “remarkable” eq. 2, obtaining
□
The identity in eq. 2 is quickly obtained by expressing coefficients of the series expansion for f(z) = (1 − 4z)−1 in two different ways. Trivially, [zm]f(z) = 4m. We also have f(z) = g(z)2 for g(z) = (1 − 4z)−1/2. Taking the series expansion of g(z), we have , so that the identity follows from .
The asymptotic mean of the Sackin index can be computed by application of Stirling’s formula to the expression in Theorem 3; we can also quickly deduce the asymptotic mean by rewriting the expression in Theorem 3 in terms of a Catalan number. Recalling that the Catalan number Cn satisfies , we obtain the following alternate formula.
Corollary 7.The expectation of the Sackin index under the uniform model on rooted binary labeled trees of n leaves isWe compute the asymptotic expression for the mean Sackin index from the asymptotic expression for the Catalan numbers, .
Corollary 8. As n → ∞, the expectation of the Sackin index under the uniform model on rooted binary labeled trees of n leaves satisfies
With the considerable attention devoted to the Sackin index in nearly 50 years since its introduction, we can add to the surprise of Mir et al. [16] in finding that their result on its expectation under the uniform model has a simple proof.
Interestingly, two reviewers pointed us to additional proofs. Coronado et al. [10] obtained a closed form for a class of recurrences that includes a recurrence for the mean Sackin index. The Sackin index satisfies a stochastic recurrence Sn = Sk + Sn−k + n [22, eq. 3.12]. Under the uniform model, the probability that the “left” subtree of a rooted binary tree with n leaves has k leaves, 1 ≤ k ≤ n − 1, is [1, eq. 5],
The expected Sackin index then has recurrence
| (3) |
with . Noting that , the recurrence is solved by the special case of Proposition 6 of Coronado et al. [10] with X1 = 0, a1 = 1, aℓ = 0 for 2 ≤ ℓ ≤ n − 1, and bℓ = 0 for all ℓ, recovering Theorem 3.
Fuchs & Jin [12] reported the mean depth of a leaf chosen at random under the uniform model on rooted binary labeled trees—or . They exploited an oft-noted mapping [1, 4, 7] that connects the rooted binary labeled trees and the Catalan trees, a class of rooted binary unlabeled trees in which left and right descendants are distinguished and internal nodes have either a left descendant node, a right descendant node, or both. In a uniform probability model on the Catalan trees, quantities related to the Sackin index have long been studied [11, 21]; Fuchs & Jin [12] obtained and solved a version of the recurrence in eq. 3, reporting their Theorem 4 in a form similar to that of our Corollary 7.
Note that our new proof provides a method for evaluating the expected Sackin index for any probability model for which the quantity pn(k) can be calculated. For the Yule-Harding model, [22, eq. 3.5]. As was noted by Steel [22, eq. 3.11], the approach in our proof of Theorem 3 provides a computation of the mean Sackin index under the Yule-Harding model as well, .
A new proof is provided for the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees.
The new proof makes connections to results on exchangeable probability distributions and binomial identities.
The proof simplifies the understanding of a fundamental result in the study of tree balance.
The mean Sackin index under the uniform model can be reformulated in terms of Catalan numbers.
This reformulation leads quickly to the asymptotic result for the mean Sackin index under the uniform model.
Acknowledgments.
We are grateful to two reviewers who pointed us to additional proofs of Corollary 7 and to the quick proof of eq. 2 by use of generating functions. We thank Mike Steel for suggesting a close look at [1]. Support was provided by NIH grant R01 GM131404.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Authors Matthew King and Noah Rosenberg have no conflicts of interest.
References
- [1].Aldous D. Probability distributions on cladograms. In Aldous D and Pemantle R, editors, Random Discrete Structures, pages 1–18. Springer-Verlag, New York, 1996. [Google Scholar]
- [2].Bartoszek K, Coronado TM, Mir A, and Rosselló F. Squaring within the Colless index yields a better balance index. Mathematical Biosciences, 331:108503, 2021. [DOI] [PubMed] [Google Scholar]
- [3].Blum MGB and François O. On statistical tests of phylogenetic tree imbalance: the Sackin and other indices revisited. Mathematical Biosciences, 195:141–153, 2005. [DOI] [PubMed] [Google Scholar]
- [4].Blum MGB, François O, and Janson S. The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance. Annals of Applied Probability, 16:2195–2214, 2006. [Google Scholar]
- [5].Cardona G, Mir A, and Rosselló F. Exact formulas for the variances of several balance indices under the Yule model. Journal of Mathematical Biology, 67:1833–1846, 2013. [DOI] [PubMed] [Google Scholar]
- [6].Cardona G, Mir A, Rosselló F, and Rotger L. The expected value of the squared cophrenetic metric under the Yule and the uniform models. Mathematical Biosciences, 295:73–85, 2018. [DOI] [PubMed] [Google Scholar]
- [7].Chang H and Fuchs M. Limit theorems for patterns in phylogenetic trees. J. Math. Biol, 60:481–512, 2012. [DOI] [PubMed] [Google Scholar]
- [8].Colless D. Review of “Phylogenetics: the theory and practice of phylogenetic systematics”. Systematic Zoology, 31:100–104, 1982. [Google Scholar]
- [9].Coronado TM, Fischer M, Herbst L, Rosselló F, and Wicke K. On the minimum value of the Colless index and the bifurcating trees that achieve it. Journal of Mathematical Biology, 80:1993–2054, 2020. [DOI] [PubMed] [Google Scholar]
- [10].Coronado TM, Mir A, Rosselló F, and Rotger L. On Sackin’s original proposal: the variance of the leaves’ depths as a phylogenetic balance index. BMC Bioinformatics, 21:154, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Fill JA and Kapur N. Limiting distributions for additive functionals on Catalan trees. Theoretical Computer Science, 326:69–102, 2004. [Google Scholar]
- [12].Fuchs M and Jin EY. Equality of Shapley value and fair proportion index in phylogenetic trees. Journal of Mathematical Biology, 71:1133–1147, 2015. [DOI] [PubMed] [Google Scholar]
- [13].Graham RL, Knuth DE, and Patashnik O. Concrete Mathematics. Addison-Wesley, Boston, 2nd edition, 1994. [Google Scholar]
- [14].Heard SB. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution, 46:1818–1826, 1992. [DOI] [PubMed] [Google Scholar]
- [15].Kirkpatrick M and Slatkin M. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution, 47:1171–1181, 1993. [DOI] [PubMed] [Google Scholar]
- [16].Mir A, Roselló F, and Rotger L. A new balance index for phylogenetic trees. Mathematical Biosciences, 241:125–136, 2013. [DOI] [PubMed] [Google Scholar]
- [17].Rogers JS. Response of Colless’s tree imbalance to number of terminal taxa. Systematic Biology, 42:102–105, 1993. [Google Scholar]
- [18].Rogers JS. Central moments and probability distribution of Colless’s coefficient of tree imbalance. Evolution, 48:2026–2036, 1994. [DOI] [PubMed] [Google Scholar]
- [19].Rogers JS. Central moments and probability distributions of three measures of phylogenetic tree imbalance. Systematic Biology, 45:99–110, 1996. [Google Scholar]
- [20].Sackin M. ‘Good’ and ‘bad’ phenograms. Systematic Zoology, 21:225–226, 1972. [Google Scholar]
- [21].Sedgewick R and Flajolet P. An Introduction to the Analysis of Algorithms. Addison-Wesley, Upper Saddle River, NJ, 2nd edition, 2013. [Google Scholar]
- [22].Steel M. Phylogeny: Discrete and Random Processes in Evolution. Society for Industrial and Applied Mathematics, Philadelphia, 2016. [Google Scholar]
- [23].Than CV and Rosenberg NA. Mean deep coalescence cost under exchangeable probability distributions. Discrete Applied Mathematics, 174:11–26, 2014. [Google Scholar]
