Skip to main content
Springer logoLink to Springer
. 2021 Feb 15;82(3):19. doi: 10.1007/s00285-021-01553-x

Information geometry for phylogenetic trees

M K Garba 1,2, T M W Nye 1, J Lueg 3, S F Huckemann 3,
PMCID: PMC7884381  PMID: 33590321

Abstract

We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera–Holmes–Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback–Leibler divergence, or equivalently, as we show, to any f-divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular, geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.

Keywords: Phylogenetic tree, Information geometry, Tree space

Introduction

Evolutionary relationships between species are represented by phylogenetic trees, in which the leaves represent present-day species, internal vertices represent speciation events, and edge lengths represent the degree of evolutionary divergence between species (Semple and Steel 2003). Evolutionary relationships are often subject to a high degree of uncertainty, and so it is natural to consider the space of all possible relationships and probability distributions on this space. Billera et al. (2001) were the first to construct a space of all phylogenetic trees on a fixed set of leaves. This space, known as Billera–Holmes–Vogtmann or BHV tree space, has a very rich geometry: in particular there is a unique geodesic, or shortest length path, between any two points in the space. BHV tree space is a so-called CAT(0) space (Billera et al. 2001), meaning it has globally non-positive curvature, and many of its attractive geometric properties follow from this condition. A polynomial time algorithm for computing geodesics and their lengths was subsequently developed (Owen and Provan 2011). A number of statistical methods for analysing samples of phylogenetic trees have been established, which rely fundamentally on the geometry of BHV tree space by transferring conventional multivariate statistical methods into the new geometrical context. Algorithms have been developed for computing sample means (Bačák 2014; Miller et al. 2015), for constructing confidence regions for the population mean (Willis 2019), and for performing principal component analysis (Nye 2011; Feragen et al. 2013; Nye 2014; Nye et al. 2017). An alternative geometry for phylogenetic trees, known as the tropical tree space (Speyer and Sturmfels 2004; Lin et al. 2018), arises from regarding phylogenetic trees as distance matrices between the species at the leaves. Statistical methods such as calculation of sample means (Lin and Yoshida 2018) and principal component analysis (Yoshida et al. 2019), have also been developed in tropical tree space.

In the BHV and tropical tree spaces, trees are regarded primarily as geometric or algebraic objects, without specific consideration to how phylogenetic trees are estimated or interpreted. Phylogenetic trees are typically inferred from genetic sequence data via Markov models of sequence evolution over the edges of the tree (Yang 2006), and we are only concerned with such trees. Each phylogenetic tree can therefore be regarded as a probability model for genetic sequence data, and a space of all tree-like probability models can be constructed. This idea was first considered by Kim (2000), and then developed more formally in subsequent papers (Moulton and Steel 2004; Gill et al. 2008). The space is known as the phylogenetic orange space or edge-product space. While the space has been studied from the viewpoint of algebraic geometry (Zwiernik and Smith 2012; Engström et al. 2013), metric geometry on the space has received little attention. Recently, methods for approximately computing ‘probabilistic’ metrics on the edge-product space have been developed (Garba et al. 2018). These metrics are defined by mapping each tree to its associated distribution on sequence data, and using a metric between these probability distributions. Specifically, each tree represents a distribution on characters, where a character is a map from the N leaves of the tree to some alphabet of letters Ω. The Hellinger and Jensen–Shannon metrics are defined between distributions on ΩN and are pulled back to give metrics between trees. Exact calculation of these metrics involves summation over all possible characters, and so when N is large, Garba et al. (2018) use a simulation procedure to estimate the distance between any pair of trees. The probabilistic metrics have substantially different properties than the BHV and tropical metrics. For example, if all the edge lengths in a pair of given trees are scaled up linearly, then the BHV and tropical distance between the trees both scale in the same way, while in contrast the probabilistic metrics eventually tend to zero. This is because the letters at the leaves of the trees become independent from one another as the edge lengths increase, due to genetic saturation. The distributions on characters represented by the two trees therefore converge to one another as the edge lengths are scaled up, and the distance tends to zero (see Fig. 2 in Garba et al. (2018)).

The metrics studied by Garba et al. (2018) arise from embedding tree space into the larger ‘ambient’ space of all distributions on characters. They are obtained from the lengths of ‘chordal paths’ in the ambient space which do not generally lie within the embedded tree space, and are hence called extrinsic metrics. In contrast, the BHV metric is an intrinsic metric, since it is obtained from the lengths of paths lying within tree space. For trees sharing a common branching pattern the BHV metric agrees with the corresponding extrinsic metric obtained via an embedding into Euclidean space. The statistical methods developed on BHV tree space rely heavily on the intrinsic nature of the metric, and this motivated us to seek intrinsic analogs of the probabilistic metrics.

The aim of this article is to realize intrinsic metrics and their associated geodesics in a new space of forests, the wald space,1 that is related to the edge-product tree space (for the subtle, yet essential differences see the discussion in Sect. 6), when the underlying assumptions are similar to those for the probabilistic metrics. We assume that the infinitesimal distance between two trees is measured using the Fisher information matrix. We show that this is equivalent to assuming the infinitesimal squared distance is the Kullback–Leibler divergence, or equivalently, any f-divergence. Our approach uses ideas from information geometry, which is the study of Riemannian differential geometry on spaces of probability distributions. The purpose of developing this geometry on this space of forests is with the ultimate aim of obtaining statistical methods analogous to those on other tree spaces. The probabilistic metrics and the information geometry have an important advantage over the BHV and tropical geometries: they have by definition a direct biological interpretation in terms of the evolution of genetic sequences. In the information geometry, two trees are close when they determine similar distributions of characters, and as a result they would be potentially indistinguishable if inferred from experimental samples of sequence data. Conversely, trees are distant in the information geometry when they induce substantially different distributions. In contrast, the BHV and tropical metrics are defined more abstractly without reference to evolutionary models or processes. Examples of the biological interpretation of the probabilistic metrics were given by Garba et al. (2018).

Our approach has two main parts. First, we consider geodesics in the information geometry when the model associated with each phylogenetic tree is the two-state symmetric Markov process. This is the simplest discrete Markov model of sequence evolution, for which there are two letters in the alphabet, Ω={0,1}. This model is introduced in Sect. 2 along with a formal definition of the wald space and a brief review of BHV space. The thesis of Garba (2019) contains some comparisons of results obtained using the two-state model versus models with the DNA alphabet. Geodesics in wald space are constructed locally by numerically integrating a certain differential equation determined by the assumptions on the Riemannian metric. We explore geodesics on the space of unrooted trees with 5 leaves, for which visualization is relatively straightforward, and compare the results with those for BHV tree space. This forms Sect. 3 of the paper. Secondly, in order to improve computational tractability, we consider an alternative continuous-valued model of evolution on each tree. This consists of a Gaussian process which approximates the two-state Markov process by matching its moments. The continuous random variables at the leaves of the tree have a multivariate normal distribution with zero mean, for which the covariance matrix is related to the matrix of path lengths between the leaves. Numerically solving the differential equations for geodesics is much faster under this set of assumptions, and the geodesics closely resemble those for the two-state model. However, solutions are still restricted to trees sharing a common branching pattern, or topology. The definition of the Gaussian process on trees and numerical solution of geodesics in the corresponding information geometry are described in Sect. 4. The information geometry of multivariate normal distributions with zero mean corresponds to a certain geometry on the space of symmetric positive definite matrices, known as the Fisher–Rao or affine-invariant geometry, and the map from the wald space to covariance matrices is an isometric embedding in this space. The geometry on the space of symmetric positive definite matrices is analytically tractable, and geodesics can be computed in polynomial time. The embedding therefore gives intrinsic and extrinsic metrics on the wald space. We describe a projection algorithm from the space of symmetric positive definite matrices into the embedded wald space. We then use this algorithm to project geodesics in the ambient space down into wald space in various ways to obtain approximate geodesics between trees with different topologies. The embedding in the space of symmetric positive definite matrices and the associated geometry is described in Sect. 5. We conclude in Sect. 6 with a detailed discussion of the promises and challenges of our new wald space.

Background and the new wald space

Phylogenetic trees

For N=2,3, we define UN to be the set of unrooted phylogenetic trees on N taxa. More specifically, a tree T is an element of UN if it satisfies the following conditions. First, T contains exactly N vertices with degree 1, which are called leaves, and these are bijectively labelled 1,,N. Secondly, T must contain no vertices with degree 2. Thirdly, each edge e in T is assigned a length e0 with e0 if e contains a leaf. An edge in a tree is called a pendant edge if it contains a leaf; otherwise it is called an internal edge. Similarly, the vertices which are not leaves are called internal vertices.

The edge lengths e on any given tree TUN define a path length distance between any pair of leaves. The path length on T between u,v{1,,N} will be denoted uv.

Each tree TUN contains at most 2N-3 edges, in which case the tree is called fully resolved or bifurcating, and all internal vertices have degree 3. Trees with fewer edges are called unresolved, and for N>3, these contain at least one vertex with degree 4 or more. Trees which contain only the N pendant edges joined at a single degree-N internal vertex are called star trees.

A tree T is rooted when some internal point ρT is labelled as being the root. This is conveniently achieved by adding an additional taxon labelled 0 to the tree via a pendant edge of length zero. It follows that the set of rooted phylogenetic trees satisfies the same conditions as UN, except the leaves are bijectively labelled 0,1,,N, and the pendant edge containing taxon 0 has zero length. We will work with unrooted trees, but our results are easily transferred to the space of rooted trees via this relationship.

Every fully resolved tree will correspond to a fully resolved BHV-tree (reviewed in Sect. 2.2) and to a fully resolved wald, as introduced below in Sect. 2.4. In both BHV tree space  and in wald space, unresolved trees will be identified with other trees with certain internal edges having zero length, so that conceptually a missing edge is the same as a zero length edge.

Billera–Holmes–Vogtmann tree space

Billera et al. (2001) defined a space of phylogenetic trees, subsequently known as BHV tree space, and described its geometry. BHV tree space can be described via an embedding in Rd for dimension d which increases exponentially with the number of leaves. However, we have chosen to describe BHV tree space in a way different from the original authors, and we define it as a quotient space. As a result, the wald space introduced in the next section is a superset of BHV tree space when the spaces are regarded simply as sets, clarifying the relationship between the two spaces. Importantly, we allow internal edges on trees to have length zero, and under the quotient these are equivalent to trees with those edges missing. A second difference is that while Billera et al. (2001) worked with rooted trees, we work with unrooted trees. As described in Sect. 2.3, the distribution on binary characters determined by a tree does not depend on the root position under the two-state symmetric model, and so unrooted trees are more natural to work with.

BHV tree space is defined using the notion of splits, where a split is a bipartition of the leaf labels 1,,N into two disjoint sets. Cutting an edge of a tree induces such a bipartition of the leaves, and so each edge on a tree corresponds to a split, and the terms split and edge can be used interchangeably. The set of splits represented by a tree is called its topology.

Arbitrary sets of splits do not typically determine valid tree topologies: the splits of a tree must satisfy a compatibility condition. For example, the splits {1,2},{3,4,,N} and {1,3},{2,4,,N} are incompatible, since leaf 1 cannot be grouped next to both 2 and 3 on the same tree. For any topology τ with k internal edges, 0kN-3, the set of trees in UN with that topology is bijectively parametrized by R>0N×Oτ where the first term in the product parametrizes the pendant edge lengths that, by definition, are strictly positive, and Oτ=R0k parametrizes the internal edge lengths.

The set Oτ is called the orthant associated with topology τ, and we identify the set of all trees with topology τ with R>0N×Oτ. Under this identification, the set of all trees UN, as defined in Sect. 2.1, is the disjoint union

UN=R>0N×τOτ

where the disjoint union is taken over all possible topologies τ.

The unrooted BHV tree space UN is obtained by taking the quotient of UN with respect to an equivalence relation:

UN=UN/.

Two trees in UN are equivalent under if and only if they are identical modulo the presence of internal splits with zero length, as shown in Fig. 1.

Fig. 1.

Fig. 1

Two trees in UN are equivalent under the relation when they are identical after internal edges with length zero are removed, and the vertices at the end of every such edge are merged. ABCD represent different subtrees joined by edges of length A,B,C,D to an internal edge with length =0 on the left. The Markov process X(t) cannot change state on any edge with length zero, so the distribution on X1,,XN is unchanged by removing such edges in this way

The quotient space factorizes as

UN=R>0N×BHVN

where the first term parametrizes the lengths of the pendant edges and the space BHVN parametrizes the topology and internal edge lengths of the BHV-trees. When τ is fully resolved, Oτ is called a maximal orthant. Unresolved trees correspond to points on the boundaries of maximal orthants; they can be obtained from fully resolved trees by shrinking internal edge lengths down to zero.

Since there are (2N-5)!! fully resolved unrooted topologies, BHVN can be thought of as being constructed by gluing this number of maximal orthants together along their boundaries, where two points are identified if they correspond to the same tree. For example, when N=4, there are three fully resolved topologies, each of which contains a single internal edge. The space BHV4 therefore consists of three copies of R0 glued together at the origin. The origin corresponds to the star trees, while the location along each of the three copies of R0 gives the length of the internal edge in each of the three possible fully resolved topologies. For N=5 there are 15 possible unrooted tree topologies, each of which contains two internal edges. It follows that BHV5 consists of 15 copies of R02 glued along their boundaries. At each codimension-1 boundary, three maximal orthants are joined together. This is because when a single internal edge is contracted to length zero, a degree 4 vertex is obtained, and there are 3 possible ways to add in an edge, including the original edge, in order to obtain a fully resolved topology, as illustrated by Fig. 2.

Fig. 2.

Fig. 2

When an internal edge from a fully resolved topology is contracted down to length zero (left to centre), there are two fully resolved topologies which can be obtained by expanding out an alternative edge (right). ABCD represent subtrees. The operation of contracting an internal edge and expanding out an alternative edge is called nearest neighbour interchange. It follows that at each codimension-1 boundary, three maximal orthants are glued together

The metric on BHVN is constructed as follows. The basic idea is that for trees with the same fully-resolved topology but different vectors of internal edge lengths, say 1 and 2, the distance is the Euclidean distance 1-2, and the corresponding geodesic is the straight line segment in the orthant containing the trees. Billera et al. (2001) showed that there exists a unique shortest path between any two points in BHVN, for which path length is measured using the Euclidean distance in each orthant, and the length of these defines a metric on BHVN which we denote dBHV. A metric on UN, denoted dUN, is obtained as the product metric when the metric on pendant edges is taken to be the Euclidean distance. An algorithm has been developed which constructs geodesics and calculates their lengths in O(N4) time (Owen and Provan 2011).

The two-state symmetric Markov model

Genetic sequence evolution is typically modelled using discrete-valued continuous-time Markov processes defined over the edges of a tree T (Yang 2006; Bryant et al. 2005). DNA sequence evolution is modelled by associating to each point tT, a random variable X(t) which takes values in an alphabet {A,C,G,T}. In this paper, however, we will consider the two-state symmetric Markov process with alphabet Ω={0,1}. This simplification is made in order to make the mathematics more tractable and for computational speed. Nonetheless, some of the calculations using the two-state symmetric can readily be performed using DNA models. More details are given in the thesis of Garba (2019) in which simulations show similarity of geometries obtained from the two- and the four-state process. The transition probability of the symmetric two-state model is defined in terms of the path length t1t2 between any two points t1,t2T:

PrX(t2)=X(t1)=121+e-t1t2,andPrX(t2)X(t1)=121-e-t1t2. 2.1

The stationary distribution of this Markov process is Bern(1/2), and the process is assumed to be in its stationary state over the tree. As a result, for all tT, X(t) has a marginal Bernoulli distribution, X(t)Bern(1/2). While the random variables X1,,XN at the leaves of the tree have the same marginal distributions, they are not independent since the tree imposes a dependence structure. The following lemma determines certain moments of the process X(t) giving insight on the dependence structure of X1,,XN. The proof is straightforward using the transition probabilities in Eq. (2.1).

Lemma 2.1

  1. If X1,,XN are the random variables at the leaves of a tree TUN determined by the discrete Markov process defined above, then CovXu,Xv=14exp(-uv) where uv is the path length between leaves u and v.

  2. If t1,t2T are path length t1t2 apart, then the conditional distribution of X(t2) given X(t1)=ω{0,1} has variance 14(1-exp(-2t1t2)).

It is straightforward to simulate realizations of X(t) in the following way. First simulate a Poisson process with rate 1 independently on each edge of the tree. The positions of the simulated events correspond to points at which X(t) changes parity. Secondly, pick any point t0T which is not a change point and sample X(t0) from Bern(1/2). The change points generated from the Poisson process then determine the value of X(t) for all other tT. The distribution obtained is independent of the choice of t0, because the Markov process is reversible. In particular, the Markov process is independent of the choice of t0, which could be considered as a root.

Under the model, each edge length can be interpreted as the expected number of change points that occur over the edge. Internal edges are allowed to have length zero, which means that no change in X(t) occurs over the edge. On the other hand, when edges are long, the number of changes is likely to be large, and the letters at either end of the edge are weakly correlated. Biologists refer to this effect as saturation. A fixed change of edge length δ therefore has more effect on the distribution of characters when applied to a short edge as opposed to a long edge in some given tree. For example, an increase of δ=0.1 to an edge with length =0.1 approximately doubles the probability that the letters at either end of the edge are different, but the same change to an edge of length =10 has almost no effect on this probability, which due to saturation is very close to 1/2. This idea becomes important when we consider defining distances between trees via the information they represent, in particular using the probability mass function of the nontrivial distribution of (X1,,XN).

Remark 2.1

  1. The probability mass function of (X1,,XN) determined by T is denoted pT(s) where s{0,1}N is called a binary character. Given any binary character s, the values of pT(s) can be evaluated via a recursive algorithm (Bryant et al. 2005), described in Appendix A. Appendix A also contains a modified form of the algorithm which is used to compute exactly the derivatives of pT(s) with respect to the edge lengths.

  2. Also, as shown in Appendix A, pT(s) is a multivariate polynomial in 1+ek and 1-ek where k ranges over the edge lengths in T.

  3. Crucially, the map TpT from fully resolved trees in UN to the space of probability mass functions on {0,1}N is injective up to the equivalence relation introduced in Sect. 2.2 (Rogers 1997; Allman et al. 2008). This has two implications: first that the probability mass function pT uniquely characterizes each element of UN, and secondly that metrics on distributions of characters pull back to define a metric between fully resolved trees, as described in Sect. 3.2.

A new forest space: the wald space

The following wald space gives an alternative viewpoint of phylogenetic trees by regarding them as Markov models for sequence evolution (Kim 2000; Moulton and Steel 2004; Gill et al. 2008). We will give a description of the wald space that is related to the edge-product space of previous authors, by defining it as a quotient space which adds trees with infinitely long edges to BHV tree space. As described in the introduction, each probabilistic metric considered by Garba et al. (2018) converges to zero in the limit as all the edge lengths in a pair of trees simultaneously tend to infinity. This behaviour indicates that shortest paths might cross a tree with infinitely long edges, which is why we add this boundary at infinity. Furthermore, we allow pendant edges of length zero under certain conditions. Thus, in the wald space, edge lengths e take values in R0{}. It is convenient to reparametrize to the λ-parametrization by defining weights λe=1-exp(-e), so λe[0,1]. Under this transformation, BHVN becomes a set of unit cubes, rather than orthants, glued along faces where λe=0 for one or more edges. The wald space is defined by imposing additional gluing rules on faces where λe=1.

In order to be able to identify trees with infinitely long edges along faces where λe=1, we construct the wald space from forests, that is, disjoint unions of unrooted trees. We start with a preliminary space leading to the definition of the wald space further below. Let WN be the collection of forests satisfying the following necessary and sufficient conditions for each FWN.

  1. The forest F contains exactly N labelled vertices, these are called leaves and labelled 1,,N.

  2. There are no unlabelled vertices in F of degree 0, 1 and 2.

  3. For every pair of leaves u,v in the same tree in F, at least one edge e on the unique path from u to v satisfies λe>0.

Clearly, UNWN. The condition on the edge weights ensures that no pair of leaves is coincident and consequently that metrics are always well-defined, as described in Sect. 3.2. We impose an equivalence relation on WN, defined by the following two rules.

BHV boundary rule: Given F1,F2WN, suppose all internal edges with λe=0 are removed from the forests, and the vertices at either end of each such edge are merged. If the resulting forests are identical, then F1F2. The rule is the same as that in Fig. 1.

Boundary at infinity rule: Suppose FWN contains an edge with λe=1, and that F is modified as follows. The edge with λe=1 is removed, disconnecting the tree it belongs to. If this results in any unlabelled vertex having degree 2, then those vertices are removed. If v is such a vertex, and the two adjacent edges e,e~ have weights λe,λe~[0,1], then e,e~ are replaced by a single edge with weight λe+λe~-λeλe~, as is further explained below. Now suppose F1,F2WN, and this process of modifying unit-weight edges is applied to both forests. Then F1F2 if the resultant forests are identical, as illustrated in Fig. 3.

Fig. 3.

Fig. 3

Illustration of boundary at infinity rule used to define WN. In both (a) and (b), the forests on the left are equivalent to the forests on the right. ABCD are subtrees. a Internal edge with weight 1: the edge is deleted, disconnecting the tree. The resulting edges between subtrees A and B are replaced by a single edge with weight λAB=λA+λB-λAλB and similarly for CD. b Pendant edge, where L is a leaf. The pendant edge with weight 1 is removed, and the resulting edges between A and B are replaced by a single edge with weight λAB. The term F in both panels refers to other disconnected components in the forests

The wald space WN is defined to be the quotient WN/ and it immediately follows that as sets UNWN, but the geometry imposed on WN will be completely different from the geometry of the BHV space.

The boundary rule at infinity requires some explanation. The rule declares that edges of weight λe=1 (or equivalently length e=) can be deleted from a forest F, but unlike the BHV rule for which the vertices at the ends of the edge are merged, edge removal disconnects a tree in F. When resulting degree-2 vertices are removed, the edge length is preserved so that the new edge has length e+e~. The corresponding weight λ is given by λ=1-exp(-(e+e~))=λe+λe~-λeλe~. Unlike the BHV boundary rule, in which finitely many trees are identified in each equivalence class, infinitely many combinations of edge weights λe,λe~ give rise to the same value λe+λe~-λeλe~. It follows that an uncountable collection of forests can be identified into a single equivalence class in WN.

In the edge-product space (Moulton and Steel 2004; Gill et al. 2008), an alternative parametrization is used, defining μe=1-λe=exp(-e) to be the weight of edge e. This parametrization has the advantage that sums of edge lengths 1++m become products of edge weights μ1××μm (hence the name ‘edge-product’). The boundary at infinity rule is simpler under this parametrization: the weights in Fig. 3 panel (b) become μe, μe~ and 0 on the left and μeμe~ on the right. However, under the μ-parametrization, the BHV boundary with e=0 lies on faces of cubes with μe=1, whereas the boundary at infinity has μe=0. We prefer to work with the λ-parametrization since it gives a more intuitive interpretation of the weights, i.e. e=0 corresponds to λe=0 and e= corresponds to λe=1. Forests which contain more than one connected component lie in the faces of cubes with at least one λe=1. Since the pendant edges can be expanded out to infinite length, they are also subject to the boundary at infinity rule, and so the representation of pendant edge lengths in WN is not via a product geometry, as it is for BHV tree space. While the star trees correspond to all internal edges having zero length, WN also contains a point which consists of N isolated vertices.

The BHV boundaries enable tree topologies to be changed via nearest neighbour interchange (NNI) operations (as illustrated by Fig. 2). The boundary at infinity corresponds to a different topological operation, called tree bisection and reconnection (TBR) (Allen and Steel 2001). Under this operation, an edge e in a tree can be expanded up to the boundary λe=1. Removing the edge bisects the tree, and the two components can be reconnected by an edge e~ with λe~=1 placed arbitrarily between the two trees. Reducing the weight λe~ down from 1 then gives a tree with a topology different from the original tree. It follows that there exist continuous paths in the wald space between trees with different topologies, which pass through the boundary at infinity and, as a result, change tree topology via TBR operations. This is in contrast to BHV tree space in which paths between trees of different topologies involve only NNI operations, as edges are contracted down to length zero and alternative edges are expanded out.

While the set WN was defined above via an equivalence relation on forests, we also need to understand how it parametrizes Markov models and then characterise its elements again as probability mass functions on {0,1}N. The two-state symmetric Markov process extends from being defined on trees to forests by taking the process on each connected component in a forest to be independent of the other components. This defines a distribution pF on {0,1}N for each FWN. In fact the distribution uniquely determines the equivalence class of F, and vice versa, as the following lemma shows.

Lemma 2.2

Given F1,F2WN, then F1F2 if and only if pF1(s)=pF2(s) for all s.

A proof is given in the Appendix.

Note that the forest consisting of N isolated vertices corresponds to the random variables X1,,XN being independent, and this can be obtained from any tree by expanding all edges one after the other, as, by definition, there is at least one edge between any two leaves.

Information geometry for the two-state symmetric model

Information geometry provides methods for constructing metrics and geodesics on parametrized sets of probability distributions. In this section we embed wald space in the space of distributions of two-state characters, and investigate the corresponding information geometry analytically and computationally.

Geometry of embeddings

Suppose that θ:XY where (Yd) is a metric space and θ is injective. We will say that X is embedded in Y, and refer to Y as the ambient space. The embedding can be used to construct certain metrics on X. First, since θ is injective, d pulls back to define a metric on X which we denote dX:

dX(x1,x2)=d(θ(x1),θ(x2))

for all x1,x2X. The pullback metric is often called the induced extrinsic metric and, it is simply the restriction of d to XY, and so when the context is clear, it is also denoted d. The probabilistic metrics described in Sect. 3.2 are constructed in this way. A second metric, called the induced intrinsic metric and denoted d(x1,x2), is defined as the infimum of the length of all possible paths in X between x1,x2XY when path length is measured using the metric d. If no path with finite length exists between x1 and x2 then d(x1,x2)=, in which case d is not a metric. Details of this construction of the induced intrinsic metric are given by Bridson and Haefliger (2011). The metrics d and d, if well-defined, give X the structure of a length space, which is a space in which the metric between points x1,x2 is the infimum of the lengths of paths between those points. Length spaces are similar to geodesic metric spaces, except that the infimum is not necessarily achieved by a path lying within the space; in a geodesic metric space a minimum length path exists between every pair of points, and so every geodesic metric space is a length space. An example of a length space which is not a geodesic metric space is R2 with the origin removed and the Euclidean metric. Points antipodal to the origin cannot be joined by a geodesic, but the distance between them is the infimum of the lengths of paths joining the points.

In order to illustrate the relationship between d and d, consider the example of the embedding of the unit sphere X=S2 in Y=R3 equipped with the Euclidean metric d. For any two points x1,x2, d(x1,x2) is the length of the straight line segment in the ambient space R3 joining the points. This metric is usually called the chordal metric on S2. However, when we consider paths between x1,x2 which are restricted to lie in S2, the shortest paths (with respect to d) are great circles, and the induced metric d is the arc length metric. In fact, S2 is a geodesic metric space, since the infimum of path length is always achieved by a great circle.

In the following Sect. 3.2, the wald space WN will be embedded in the space of distributions of characters. Later in Sect. 4 it will be embedded in the space of N×N symmetric positive definite matrices. Each embedding will be used to construct metrics on WN.

Probabilistic metrics

Here we briefly describe the probabilistic metrics developed by Garba et al. (2018) since these will be used for comparison with other metrics. The Kullback-Leibler divergence is a commonly used measure of the difference between two distributions. Given two probability mass functions pq on characters {0,1}N, the Kullback–Leibler divergence from q to p is defined as

DKL(p;q)=s{0,1}Np(s)logp(s)q(s)

provided p(s)=0 only when q(s)=0. The Kullback–Leibler divergence is not a metric since it is not symmetric. However, metrics can be defined as follows: the Jensen–Shannon metric dJS is defined by

dJS(p,q)2=12DKLp;p+q2+12DKLq;p+q2

and the Hellinger metric dH is defined by

dH(p,q)2=s{0,1}N(p(s)-q(s))2.

Recently, probabilistic metrics have been developed which are based on distributions of gene trees instead of distributions of characters (Adams and Castoe 2020).

The Kullback–Leibler divergence, squared Jensen–Shannon metric and squared Hellinger metric are all examples of a more general class of distances between probability distributions known as f-divergences. Given any convex function f(t) such that f(1)=0, the f-divergence of p from q is defined as

Df(p;q)=s{0,1}Nq(s)fp(s)q(s). 3.1

The Kullback–Leibler divergence DKL(p;q) is obtained by taking f(t)=tlogt, while the reversed divergence DKL(q;p) is obtained with f(t)=-log(t). The squared Jensen–Shannon metric and squared Hellinger metric can also be obtained by using more complicated functions f, cf. Sason and Verdu (2016).

Now, let FWN be a forest representative of a wald [F]WN. As described in Sect. 2.4, the distributions at leaves of different trees of F are independent. For two leaves in the same tree in F, some degree of evolution occurs between them since by definition of WN no two leaves are coincident. Therefore, all characters are possible, giving

pF(s)0foralls{0,1}N. 3.2

It follows that the Kullback–Leibler divergence is always well-defined between distributions of characters corresponding to forest representatives from the wald space. Since by Lemma 2.2 the map [F]pF is injective for [F]WN the Jensen–Shannon and Hellinger metrics pull back to define extrinsic metrics on the wald space WN (analogously to (Garba et al. 2018)). As already mentioned in the introduction, statistical methods rely heavily on the intrinsic nature of metrics, and thus we aim for more geometrical structure in the next section by imposing the Fisher information metric (a Riemannian metric) onto the wald space.

A two-state process geometry for the wald space

BHV tree space UN and wald space WN both do not have the structure of a manifold globally, but the interior of each maximal orthant is a manifold parametrized by or λ. Therefore we consider first the information geometry on the subspaces of wald space corresponding to a fixed fully resolved tree topology—here every wald has only one single tree representative, since every wald corresponding to a forest with more than one component, as well as a wald containing a pendant edge with length zero, lies on the boundary of unit cubes corresponding to fully resolved tree topologies. Secondly, we establish global results about the constructed geometry of WN.

Thus suppose τ is a fully resolved tree topology, and that trees with this topology are parametrized by =(1,,2N-3)R>0N×Oτ.

Let p(s) be the probability mass function pT(s) associated with tree T determined by τ,. Recalling that p(s)>0 for all s, due to (3.2), the Fisher information matrix at is

gij()=s{0,1}Np(s)(ilogp(s))(jlogp(s)) 3.3

for i,j=1,,2N-3 where i=/i. This defines a Riemannian inner product on the tangent space of R>0N×Oτ at (that is a copy of R2N-3) which gives a way to measure the lengths of paths. Specifically, if p and p+δ lie infinitesimally close on a path, then the squared path length between them is defined to be i,jδigij()δj. Standard results from Riemannian differential geometry show that if (t) is a path in R>0N×Oτ then it is locally a geodesic (i.e. it minimizes path length) if it satisfies the differential equation

d2kdt2+i,jΓijk()didtdjdt=0,k=1,,2N-3 3.4

where Γijk() are the Christoffel symbols

Γijk()=l12gklglij+glji-gijl.

The matrix gij is the inverse of gij i.e. kgikgkj=δji where δji is the Kronecker delta. It is important to note that the geodesic equation and loci of solutions are invariant under changes of parametrization, and so the equations can be formulated using lengths i or the weights λi. On the boundary, however, this is no longer necessarily true.

The Riemannian metric defined by the Fisher information matrix is related to the Kullback–Leibler divergence and other f-divergences by the following lemma.

Lemma 3.1

Suppose Df is an f-divergence given by some convex function f with f(1)=0, as defined by (3.1). Consider a small perturbation δ=(δ1,,δ2N-3) of the edge lengths of a tree (τ,). Then

i,jδigij()δj=2f(1)Df(p+δ;p)+O(|δ|3) 3.5

where the error term consists of third-order products of the elements of δ and

Df(p+δ;p)=12f(1)s(p+δ(s)-p(s))2p(s)+O(|δ|3). 3.6

In other words, the norm of the perturbation, as measured with respect to the Riemannian inner product, is proportional to the f-divergence of p+δ from p.

The proof is given in the Appendix. Since the lemma applies to an arbitrary f-divergence, the term on the right-hand side of Eq. (3.5) can be the Kullback–Leibler divergence or the squared Jensen–Shannon metric, for example. Lemma 3.1 gives the fundamental assumption behind the geometries we construct on WN: that distances are locally measured by the infinitesimal Kullback–Leibler divergence between probability distributions associated with trees, or equivalently, by any f-divergence. As a corollary of Lemma 3.1, it follows that the metric defined by Eq. (3.3) is positive definite (i.e. the metric is not semi-Riemannian). This is because the map from trees to distributions of characters is injective, and since Df(p;q)>0 for all pq, it follows that the right-hand side of Eq. (3.5) is strictly positive for all small non-zero perturbations δ.

Let D{0,1}N be the space of distributions on {0,1}N. By Lemma 2.2, the map from WN to distributions of characters determines an embedding of WN in D{0,1}N. Given a metric d on D{0,1}N, let d be the induced intrinsic metric on WN, as described in Sect. 3.1.

Theorem 3.1

Let d and d0 be metrics on D{0,1}N which are the square root of an f- and f0-divergence, respectively. Then for any [F],[G]WN:

  1. d([F],[G])< and thus d is well-defined.

  2. d([F],[G])=c·d0([F],[G]) for some constant c>0.

  3. Any path which realizes the distance d([F],[G]) is a solution of Eq. (3.4) at any point in the interior of a maximal orthant.

The proof is given in the Appendix. Valid choices for the metric d in Theorem 3.1 include the Jensen–Shannon metric or Hellinger metric. Theorem 3.1 establishes WN with metric d as a length space. Finiteness of d shows, for example, that points in WN corresponding to disconnected forests (or equivalently, trees with infinite edge lengths) are at a finite distance away from orthant interiors. The second assertion implies a scaling of the induced intrinsic metric under changes of the function f, which, in turn, substantiates our conclusions drawn from Lemma 3.1 that the geometry of WN induced by d is invariant under the choice of f.

Numerical investigation of the geometry

The geodesic equation (3.4) can be solved numerically on the interior of any maximal orthant given some initial conditions (0)=0 and d(0)/dt=v0. As described in Sect. 2.3, the first and second derivatives of p(s) with respect to the edge lengths i can be computed analytically. Calculation of gij() consists of a sum over all 2N possible characters involving the derivatives of p(s). The Christoffel symbols can similarly be calculated as sums over characters. A fourth-order Runge–Kutta method was used to integrate the ODEs.

This numerical scheme was used to construct and visualize geodesics on a single orthant in W5. The particular topology and edge lengths for the orthant are represented by the Newick string ((1:1,2:2):6,3:3,(4:4,5:5):7). Parameters 1,,5 are the pendant edge lengths and 6,7 the lengths of the two internal edges. We restricted to N=5 leaves in order to enable easy visualization of geodesics. Integration was stopped whenever any internal edge was assigned a length 0, corresponding to the boundary of the orthant. If this occurred for a pendant edge, the pendant edge length was given value zero at that step, and integration was continued.

Figure 4 shows typical results. The figure shows the orthant representing the two internal edge lengths, with geodesics ‘fired’ from some fixed starting point 0 in 24 different directions. Also marked on the plots are contours of distance from the starting point. Each panel shows results for a different initial tree 0. It is evident that the geodesics are not the same as BHV geodesics, which are straight lines radiating from the initial point with equally spaced circular contours of distance. Figure 4 shows curved geodesics with irregularly spaced contours of distance. Contours appear to stack up towards the origin and codimension-1 BHV boundaries, but are more spaced out as geodesics move out towards the boundaries at infinity. This is more obvious when initial internal edges are long (top row of Fig. 4). On the other hand, the geodesics are more similar to the BHV geodesics when internal edges are short and pendant edges are long, as in the bottom two rows of the right hand column. In all cases, in contrast to BHV tree space, geodesics seem to be slightly attracted toward the star trees. The pendant edges do not behave as they do in BHV tree space: they can change value even when their initial velocity is zero.

Fig. 4.

Fig. 4

Solutions to the geodesic equation (black radiating curves) and contours of distance (grey) in a single maximal orthant of WN. Each panel shows the trajectories for the internal edge lengths 6,7. Rows correspond to different initial sets of internal edge lengths. Columns correspond to different initial pendant edge lengths: i=0.1 for i=1,,5 on the left, and i=0.5 for i=1,,5 on the right. The initial velocity on pendant edge lengths was zero in all cases. Dots mark points at which a pendant edge was assigned length zero, and at all subsequent points, to avoid negative values. Contours near the origin in the top two plots have been removed: they stack up as the origin is approached and obscure the appearance of the geodesics

Figure 5 provides more detail for certain geodesics in Fig. 4. The graphs in the figure show each edge weight λi versus time, and the time is proportional to distance travelled. The λ-parametrization was used for these plots since it shows the results most clearly. Panel (a) shows that, when contours become more and more widely spaced, the boundary at infinity (λ6=λ7=1) can be reached in finite time, rather than asymptotically. This shows that points corresponding to trees with infinitely long edges are a finite distance away from the starting point, as established by Theorem 3.1. In the -parametrization, the internal edge lengths rapidly blow up to infinity after time t=0.8. It follows that the shortest path between two trees with finite edge lengths might involve trees with infinitely long edges. On the other hand, for some panels in Fig. 4, the contours become increasingly close as BHV boundaries are approached, but as panel (b) in Fig. 5 shows, points on BHV boundaries are in fact finitely close to orthant interiors, since the boundary is reached in finite time.

Fig. 5.

Fig. 5

Graphs showing edge weights λi vs time along geodesics in top left panel of Fig. 4. a Geodesic heading in North East compass direction. b Geodesic heading West

Figure 6 replots the top left and middle left panels from Fig. 4 using the λ-parametrization, so that the boundary at infinity corresponds to edges of the unit square with weight 1. These plots suggest that trees in which one of the two internal edges is finitely long are ‘repellant’ since the geodesics fired in the North and East compass directions end up passing through the disconnected forest with λ6=λ7=1. Indeed, the points on the boundary with λ6=λ7=1 appear to be ‘attractive’, with geodesics pulled round to pass through these points and arriving in finite time.

Fig. 6.

Fig. 6

Top left and middle left panels on Fig. 4 redrawn in the λ-parametrization

While these results show how the information geometry on WN differs substantially from the BHV geometry, the method for constructing geodesics by integrating the geodesic differential equation suffers from several disadvantages. First, it requires summation over the elements of {0,1}N which makes it infeasible for large N (exponential computation time in N). Secondly, only local geodesics are computed by solving the initial value problem (geodesic “shooting” or “firing”) for the differential equation valid only in maximal orthants. Thirdly, for practical applications, an algorithm which takes two points in wald space and joins them by a globally shortest path is more useful, and so it is desirable to solve the boundary value problem for geodesic construction. The next section attempts to deal with some of these shortcomings.

Information geometry for a Gaussian process on trees

In this section we develop the information geometry of a continuous-valued Markov process associated to each tree which is more computationally and analytically tractable than the information geometry for the symmetric two-state Markov process. It has the advantage that the geodesic equation (3.4) can be solved numerically much faster than the corresponding equation for the symmetric two-state model, but the solutions for the two models are very similar.

Definition of the Gaussian process

Our aim is to construct a Gaussian process which is a continuous-valued analogue of the symmetric two-state process by matching the moments specified in Lemma 2.1. Consider the Ornstein–Uhlenbeck process Z(t) on T which satisfies

Z(t2)|Z(t1)=zN(ze-t1t2,1-e-2t1t2) 4.1

where t1t2 is the path length distance between t1 and t2 on T. The stationary distribution is the standard normal distribution N(0, 1), and we assume the process is stationary over T. The Markov process satisfies the detailed balance equation, and so is reversible in its stationary state. As a result, realizations of the process can be simulated by fixing an arbitrary root t0T for the tree, simulating Z(t0) from N(0, 1) and then using Eq. (4.1) to simulate a realization Z(t) for any other tT. Reversibility of the process ensures the distribution obtained is invariant of the choice of root. A short calculation shows that the covariance matrix of the random variables Z1,,ZN at the leaves of T is given by CovZu,Zv=exp(-uv). Since uu=0 for all u=1,,N, this gives VarZu=1. Similarly, the conditional distribution of Z(t2) given Z(t1)=z has variance 1-exp(-2t1t2). These moments match those in Lemma 2.1 up to the constant factor of 1/4, and we will show later that this factor makes no difference to the geometry obtained. The process Z(t) can be thought of in two ways. First, it approximates the binomial random variables obtained when many independent binary characters evolve under the symmetric two-state process. Secondly, it could be regarded as an evolutionary model of a continuous trait for which the observations are standardized to have zero mean and unit variance in each population. Mean-reverting Gaussian processes like this are sometimes used to model continuous traits for which there is some constant optimal value for survival (Hansen and Martins 1996). The definition of Z(t) extends from trees to forests by taking the process to be independent on each connected component.

Given a forest FWN, the distribution of the random variables Z1,,ZN at the leaves is multivariate normal with zero mean and covariance matrix SF where

SF=(exp(-uv))u,v=1N. 4.2

The path distance uv is the sum of the lengths e on edges e between u and v, and is taken to be infinite when u and v are in different components of F. This defines a map FϕF from forests to multivariate normal distributions where ϕF is the probability density function of N(0,SF). A similar result to Lemma 2.2 holds: whenever F1F2, the distributions ϕF1 and ϕF2 are the same, so the map is well-defined on WN.

A Gaussian process geometry for the wald space

The information geometry of multivariate normal distributions with zero mean has been studied previously (Lenglet et al. 2006) and is analytically tractable. The theory described in Sect. 3 needs adapting to account for the change from discrete to continuous characters. The Fisher information metric of WN in Eq. (3.3) becomes an integral over RN rather than a sum, and the mass function p is replaced with the density function ϕ of the Gaussian corresponding to a fully resolved tree with edge lengths :

gij()=RNϕ(s)(ilogϕ(s))(jlogϕ(s))ds. 4.3

The geodesic equation (3.4) remains the same. Although pT(s) and its derivatives could be evaluated exactly for the two-state symmetric model, evaluation of the metric and Christoffel symbols required a sum over all characters. For the continuous model, the corresponding integrals have closed form, as we describe below.

Gaussian distributions with zero mean are parametrized by their covariance matrices, namely by the space of N×N symmetric positive definite matrices, which we will denote SN+. The set of covariance matrices associated with forests forms a subset within SN+, as determined by the following theorem.

Theorem 4.1

  1. The covariance matrix SF associated to any FWN, as defined by Eq. (4.2), is positive definite so lies in SN+, and

  2. the map [F]SF from WN to SN+ is injective and so it determines a well-defined embedding of WN into SN+.

The proof is given in the Appendix.

For Gaussians with zero mean, it can be shown that the Fisher information metric at a point with covariance matrix S is

X,Y=12trS-1XS-1Y,

where XY are matrices in the tangent space at S, i.e. symmetric matrices (Lenglet et al. 2006). This expression is obtained by evaluating the integral in Eq. (4.3). Working in some fixed maximal orthant parametrized by edge lengths , let S be the corresponding covariance matrix defined in Eq. (4.2). For each edge eF define the split matrix σe by

σuve=1,ifelies on the path from leafuto leafv,and0,otherwise, 4.4

for u,v=1,,N. Then the path length between u and v is uv=eeσuve where the sum is over all edges in F. Equation (4.2) becomes

SF=S=(eexp(-eσuve))u,v=1N. 4.5

An entry above is zero if u and v are in different connected components, or equivalently, if they are separated by an infinitely long edge.

By differentiating Eq. (4.5), it can be seen that the tangent space at S is spanned by matrices of the form σeS for each edge e, where denotes the Hadamard matrix product. The Fisher information metric (4.3) for WN for a fully resolved tree becomes

gij()=12tr(S-1(Sσi)S-1(Sσj)) 4.6

where i,j=1,,2N-3 index edges. Algebraic expressions for the first and second derivatives of the Fisher information metric can similarly be obtained, and hence for the Christoffel symbols. Note that scaling S by some positive constant has no effect on the metric, and so the factor of 1/4 difference between the covariance matrices obtained from the discrete process X(t) and continuous process Z(t) has no effect on the geometry.

The inner product and its derivatives can be computed in polynomial time and the paths obtained by integrating the geodesic ODE for the continuous Markov model within orthant interiors in WN resemble those for the two-state model very closely. Namely, results for the same initial conditions as Fig. 4 were obtained but omitted, since the plots were almost indistinguishable from those for the two-state model. However, the inner products defined using the two different models are not identical: both inner products can be written down explicitly in the case N=2, using the transition probabilities for the discrete model and Eq. (4.6) for the continuous model. The two inner products differ when the length of the single edge in the tree is small, but converge as the edge length tends to infinity.

Using Eq. (4.6) and its derivatives, we derived algebraic expressions for the Riemannian curvature tensor and the sectional curvatures at any point in WN. We implemented these expressions in R, and hence evaluated these quantities for certain trees in W5. We found that at randomly selected points in W5, and hence in all spaces with N5, the sectional curvatures had mixed signs. As a result, there is no global sign condition on curvature like that for BHV tree space, which is globally non-positively curved.

Geometry via embedding in SN+

The information geometry on Gaussians with zero mean can equivalently be regarded as a geometry for the space of N×N symmetric positive definite matrices SN+. This is a useful viewpoint to adopt, first because it highlights the fact that the geometry we develop on WN is based entirely on the covariance between the leaves induced by the Markov process Z(t), and secondly, because other metrics on SN+ have been studied (Dryden et al. 2009). These alternative metrics on SN+ could in turn define different metrics on WN, although they will not be considered any further in this paper. The metric on SN+ obtained from the information geometry of Gaussian distributions with zero mean will be denoted dcov. The metric and its associated geodesics in SN+ can be computed in polynomial time (Lenglet et al. 2006). The main idea in this section is to use the analytically tractable geometry in SN+, combined with a projection algorithm from the ambient space SN+ to the embedded space WN, to construct approximate geodesics in WN.

Given the embedding [F]SF of WN within SN+, we can consider the intrinsic metric dcov on WN induced by dcov. By construction, the induced metric corresponds to the information geometry on WN for the continuous Markov model considered in Sect. 4. The following theorem is analogous to Theorem 3.1 for the discrete Markov substitution models. A proof is given in the appendix.

Theorem 5.1

For any [F],[G]WN the induced intrinsic metric dcov([F],[G]) is finite and therefore well-defined. Any path which realizes the distance dcov([F],[G]) is a solution of Eq. (3.4) at any point in the interior of a maximal orthant, where the Riemannian inner product is given by Eq. (4.5).

Lenglet et al. (2006) give formulae for the distance and geodesics between pairs of points in SN+. The distance between S1,S2SN+ is defined by

dcov(S1,S2)2=12tr(log(S1-1/2S2S1-1/2)2) 5.1

where log denotes the matrix logarithm.

Since the map from WN to SN+ is injective, dcov pulls back to define an extrinsic metric on WN:

dcov([F1],[F2])=dcov(SF1,SF2). 5.2

In fact, the space SN+ equipped with dcov has globally non-positive curvature (Skovgaard 1984; Ballmann et al. 1985) and so there is a unique geodesic between any two points in the ambient space. The point at proportion t[0,1] along the geodesic between S1,S2SN+ is

ΓS1,S2(t)=S11/2exp(tU)S11/2whereU=log(S1-1/2S2S1-1/2). 5.3

Equations (5.1) to (5.3) involve eigen-decompositions of N×N matrices, and so can be computed in O(N4) steps.

Figure 7 shows a comparison of BHV metric dBHV, tropical metric, path distance metric, Jensen–Shannon metric dJS and dcov for every pair of trees in a sample of 100 trees obtained by bootstrap replication during maximum likelihood inference of a phylogenetic tree. The trees were inferred using the MrBayes software (Huelsenbeck and Ronquist 2001), and a sample data set of DNA from 12 primates provided with the software. The path difference metric (Steel and Penny 1993) between trees T,T is (u,v(uv-uv)2)1/2. The Jensen–Shannon metric was calculated exactly by summing over all 212 characters for the two-state model, as described in Sect. 3.2. The covariance metric was calculated using Eqs. (5.1) and (5.2). The BHV and tropical metrics are quite closely correlated, as are the Jenson–Shannon and covariance matrices. The path difference metric has a similar correlation with the BHV, tropical and Jensen–Shannon metrics (approximately 0.7–0.8), but is relatively weakly correlated with dcov. This suggests that the BHV and tropical metrics are based on features of the data which are rather distinct from those for the Jensen–Shannon and covariance metrics. The covariance metric has the advantage over the Jensen–Shannon metric of being computable in polynomial time.

Fig. 7.

Fig. 7

Comparison of the BHV metric dBHV, tropical metric, path difference metric, Jensen–Shannon metric dJS and dcov for every pair from a sample of 100 trees obtained from bootstrap replicates during maximum likelihood estimation. The trees were inferred from DNA data from 12 species of primate. The correlation coefficients are shown in the bottom left panels

Projection into wald space

To approximate geodesics in the extrinsic covariance metric dcov of WN, we construct a projection from SN+ onto WN, that is, given S0SN+, we aim to find an element [F]WN which minimizes dcov(S0,SF). Suppose that F is a fully resolved tree with edge lengths . The expression for dcov(S0,SF)2 can be differentiated with respect to edge lengths of F and gives

idcov(S0,SF)2=tr(logS0-1/2SFS0-1/2S01/2SF-1(iSF)S0-1/2)

where i=/i (e.g. Moakher (2005)). Moreover, iSF=SFσi where denotes the Hadamard or element-wise matrix product and σi is the split matrix associated with edge i, as defined in Eq. (4.4).

This analytic expression for the derivative can be used to implement a gradient descent algorithm. Within each maximal orthant Oτ the edge lengths were updated according to the rule

k+1=k-αkdcov(S0,Sk)2

where k denotes the edge lengths at iteration k and Sk the corresponding covariance matrix. The step size αk was determined using the Barzilai–Borwein method. Two versions of the algorithm were used. In the first, the algorithm was halted whenever an internal edge was assigned a negative length. As a result, the algorithm was constrained to lie within the orthant Oτ containing the initial tree. This algorithm was used for N=5 by running the algorithm 15 times, each time with an initial tree in one of the 15 maximal orthants of U5, and the overall tree closest to S0 found. The second version of the algorithm was able to cross codimension-1 BHV boundaries as follows. If an edge length was assigned a negative value, then trees in the two corresponding neighbouring orthants to τ were considered, taking the absolute value of elements in k+1 as edge lengths. The tree at step k+1 was taken to be whichever of these two trees was closest to S0.

In general, the closest point in WN to a covariance matrix S0SN+ is not necessarily unique as illustrated in Fig. 8. Moreover, the gradient descent algorithm can converge to local minima, and so the result obtained is sensitive to the tree used to initialize the algorithm.

Fig. 8.

Fig. 8

Schematic diagram for N=4 showing three neighbouring orthants (curved heavy lines) embedded in SN+. The black circle is the BHV boundary between the orthants. The extrinsic geodesic between trees [F1],[F2] is depicted as a straight line segment. The projection of this segment consists of a path from [F1] to [G1] within the orthant, but then jumps to [G2] in a different orthant. The dashed lines show the orthogonal projection of the point S along the extrinsic geodesic, and S is equidistant from [G1] and [G2]

Construction of geodesics in WN via projection of extrinsic geodesics

Since construction of geodesics in SN+ between any two given points and projection from SN+ into WN can both be performed efficiently, we aim to combine these algorithms to give an efficient means of constructing geodesics within WN between any two given end points. A naive approach, given [F1],[F2]WN, is to simply project the extrinsic geodesic between SF1 and SF2 from SN+ into WN. This approach works for the example of the unit sphere S2 embedded within R3: the projection of the chord between two points in the sphere is a great circle between those two points. However, this approach fails for WNSN+ since the projected paths are often discontinuous and jump between different orthants, as illustrated in Fig. 8.

The following recursive algorithm for constructing an approximate geo-desic in WN is therefore proposed, which is intended to overcome this issue. Let ti=i/k for i=0,,k and suppose [F1],[F2]WN. The algorithm outputs a sequence [G0],,[Gk]WN where [G0]=[F1] and [Gk]=[F2]. For each iteration i=1,,k-1 of the algorithm, the following steps are performed.

  1. Compute the extrinsic geodesic Γ from SGi-1 to SGk using Eq. (5.3).

  2. Find the point SSN+ at proportion 1/(k-i+1) along Γ.

  3. Let [Gi] be the projection of S into WN.

The idea is that at each iteration, a new extrinsic geodesic is constructed from the previous point [Gi-1] to the destination [F2], a small step is taken along that geodesic, and that point is projected into WN to give [Gi]. For the results in this paper, the projection at Step 3 was performed using the second version of the projection algorithm described in Sect. 5.1, rather than using the less efficient search over all orthants. The gradient descent for the projection to obtain [Gi] at Step 3 was initialized using the edge lengths from the forest [Gi-1].

This algorithm has the disadvantage that it is not symmetric under swapping the end points [F1],[F2], whereas the image of the geodesic should be invariant under this operation.

The following symmetrized version of the algorithm overcomes this issue.

The algorithm produces a sequence [G0],,[Gk],[Hk],,[H0]WN where the initial values are taken to be [G0]=[F1] and [H0]=[F2]. For each iteration i=1,,k-1 of the algorithm, the following steps are performed.

  1. Compute the extrinsic geodesic Γ from SGi-1 to SHi-1 using Eq. (5.3).

  2. Find the points R,SSN+ at proportions 1/(k-i+1) and 1-1/(k-i+1) along Γ.

  3. Let [Gi] and [Hi] be the projections of R and S into WN respectively.

The quality of the approximate geodesics produced by the symmetrized algorithm can be assessed by comparing them to geodesics in a single orthant constructed by integrating the geodesic equation as described in Sect. 4.2. Given an initial tree [F1]W5 and an initial velocity for , the geodesic equation was integrated until the path obtained reached some specified length. The final point reached was taken to be [F2], and the symmetrized algorithm was then used to obtain an approximate geodesic between [F1] and [F2]. In all cases, the paths obtained with the two methods matched very closely, with the quality of the match improving for shorter internal edge lengths. Figure 9 shows typical results.

Fig. 9.

Fig. 9

Comparison of paths obtained by a integrating the geodesic ODE for the Gaussian process model and b by applying the symmetrized projection method to the end points obtained in (a)

In contrast to the methods presented in Sects. 3 and 4, these algorithms are not based on ‘firing’ geodesics, and can produce approximate geodesics between end points in different orthants. Figures 10 and 11 show results obtained when the end points are separated by 1 or 2 nearest neighbour interchange operations respectively in W5. More precisely, we consider trees corresponding to Newick strings

F1:((1:0.1,2:0.1):6,3:0.1,(4:0.1,5:0.1):7),F2:((2:0.1,3:0.1):6,1:0.1,(4:0.1,5:0.1):7)andF3:((2:0.1,3:0.1):6,5:0.1,(1:0.1,4:0.1):7). 5.4

In both figures, the approximate geodesics constructed using the recursive algorithm are not symmetric under interchange of the end points, and differ from the paths obtained using the symmetric algorithm. The lengths of the paths can be computed by using large k and summing dcov between successive points in the output. In all the examples we explored, the symmetrized algorithm produced shorter paths. The approximate geodesics shown in the figures are significantly different from BHV geodesics, which consist of straight (or once broken) lines between the end points in both plots.

Fig. 10.

Fig. 10

Comparison of approximate geodesics in W5 constructed beween trees F1 and F2 from (5.4) in neighbouring orthants. The vertical axis 7 represents a codimension-1 BHV boundary between two orthants. When, due to a nearest neighbor interchange, crossing it, 6 tends to zero, another edge appears, with negative length corresponding to the negative values on the 6 axis. Three approximate geodesics are shown: (i) construction via the recursive algorithm from F1 and F2, (ii) using the same algorithm but reversing the end-points, and (iii) construction via the symmetrized algorithm

Fig. 11.

Fig. 11

Comparison of approximate geodesics constructed between trees F1 and F3 from (5.4) in orthants separated by two nearest neighbour interchanges. Three neighbouring orthants in W5 are shown, and the bottom right-hand orthant does not correspond to a valid tree topology. As in Fig. 10, negative values on axes correspond to negative lengths of new edges

We apply the symmetrized algorithm to investigate the distance from trees in the interior of a maximal orthant to the star stratum on the boundary of that orthant. If [G1],[G2],,[Gk]WN is an approximated geodesic between [G1] and [Gk] computed by the symmetrized algorithm, the intrinsic distance between [G1] and [Gk] can be approximated by dcov([G1],[Gk])i=1k-1dcov([Gi],[Gi+1]). Consider the following setup. For λ0(0,1], let G1=G(λ0) be the forest corresponding to the Newick string ((1:λ0,2:λ0):λ0,(3:λ0,4:λ0)) in λ-parametrization. This is a fully resolved 4-taxon tree on which each edge has weight λ0. By symmetry, the edges on the tree in the star stratum closest to G(λ0) must all have equal weight λ(0,1]. Thus, let Gk=F(λ) be the star tree corresponding to the Newick string (1:λ,2:λ,3:λ,4:λ), again in λ-parametrization. Figure 12 shows for each λ0{0.1,0.5,0.9,0.95} the approximated values of dcov([G(λ0)],[F(λ)]) against λ. Obviously, F(λ) is closest for λ slightly larger than λ0 with distance decreasing as λ01. This suggests that the star stratum is closer to the tree G(λ0) than the forest consisting of 4 isolated points (obtained from F(λ) as λ1), even for λ0 values close to 1. Note, though, that the forest is a boundary point of the star stratum. For any G(λ0) the distance to F(λ) tends to infinity as λ0. Indeed, SF(0)S4+ is not of full rank.

Fig. 12.

Fig. 12

Approximated distances λdcov([G(λ0)],[F(λ)]) for different values of λ0

Discussion

In order to do statistical inference on data sets of phylogenetic trees one needs a structure rich enough to enable the use of geometric statistical methods. Recent research has produced geometries such as the BHV and tropical tree spaces and statistical methods adapted to these geometries. Based on a more principled set of underlying assumptions by regarding phylogenetic trees as probability models for genetic sequence data, we have developed a canonical and biologically motivated geometry on tree space by applying tools from information geometry, giving the wald space. In particular, unlike previous related work (Garba et al. 2018) in which various extrinsic metrics were considered, in this paper we have focused on developing intrinsic metrics and their associated geodesics to explore and to enable accessing the geometry, for this is a key ingredient for statistical inference on non-Euclidean spaces.

There are two main difficulties with achieving our aim. First, the discrete-valued Markov process on trees with genetic alphabet Ω characterizes trees as probability models with sample space ΩN where N is the number of phylogenetic taxa. Therefore, calculations of distances and construction of geodesics involve summations over |Ω|N terms, which becomes infeasible for large N. In order to establish computational tractability, we generalized the discrete-valued probability model to a continuous-valued Markov process in a canonical way and applied the information geometry again.

Secondly, information geometry is formulated for parametrized probability models that are a manifold, whereas tree space is a union of manifolds having different dimensions due to the orthants (representing forests with different number of edges) being glued together in a certain way. One has to be careful to compare the structure to the one defined in Moulton and Steel (2004), for example, as we are not including forests with coincident leaves and furthermore we consider a different topology induced by the Fisher information metric. We tackled this issue of not having a single connected parametrized manifold by using the continuous-valued Markov models to embed wald space in the ambient space of symmetric positive definite matrices, which has an analytically tractable geometry and thus allows for approximation of geodesics in the embedded space WN. Our computational results show that the geometry obtained is significantly different from the BHV and tropical geometries, partly due to the inclusion of trees with infinitely long edges in wald space.

Several questions about the geometry of the wald space WN remain. While we have shown that trees with infinitely long edges are a finite distance away from other trees (Theorems 3.1 and 5.1), computational results suggest that parts of this subspace are repulsive and are avoided by geodesics (see Figs. 6 and 12). An explanation for this behaviour might be obtained via calculations or results about curvature for such points of WN, but further investigation is required. Secondly, Theorems 3.1 and 5.1 establish WN as length spaces for the two induced intrinsic metrics we study. It is desirable to strengthen these results and prove that the distance between every pair of points in the space is realized by at least one path, so that our wald space is a geodesic metric space as opposed to a length space. It appears that such a proof requires thorough analysis of the condition on edge weights which excludes trees with coincident leaves. Furthermore, the methods and results presented in Sect. 5.2 represent a first step towards the development of more sophisticated and efficient algorithms for the construction of information geodesics in wald space via the embedding in the space of covariance matrices. A more thorough evaluation of the computational cost as N increases could be carried out, and a more rigorous treatment might establish convergence properties for the symmetrized algorithm. Alternatively, existing algorithms taken from computational Riemannian geometry could be adapted to work in wald space (see Schmidt et al. (2006) for example) and might offer better performance.

The underlying motivation for this work has been to obtain a novel geometric framework for the space of phylogenetic trees which has more principled biological justification than existing geometries, and which can be used to develop statistical methods for analysing data sets of trees. Ultimately, realizing this aim is still some way off. For example, given a sample of points {x1,,xn}X in a metric space (Xd), the Fréchet mean x¯X is a point which minimises the sum of squared distances to the data:

x¯=arg minxXi=1nd(x,xi)2.

In general, the Fréchet mean does not always exist, or it can fail to be unique, but in globally non-positively curved spaces such as (SN+,dcov) and (UN,dBHV) there exists a unique Fréchet mean (Bridson and Haefliger 2011). Development of methods for calculation of a Fréchet mean using an intrinsic information metric in wald space seems very challenging, and the curvature calculations in Sect. 4.2 have implications for the existence and uniqueness of Fréchet means. On the other hand, given any sample of trees in WN, there is a unique inrinsic Fréchet mean in SN+ and an algorithm for computing the mean is given by Lenglet et al. (2006). Our projection algorithm could be used to project this to an extrinsic mean back into WN. Properties of the projected Fréchet mean tree could be investigated.

In comparison to the BHV and tropical metrics, the intrinsic information metrics have the advantage of interpretability in terms of genetic substitutions and the distributions of characters represented by two trees. This suggests the information metrics might be better suited for statistical tasks such as hypothesis testing. In the BHV and tropical geometries, contraction and expansion of edges offer the means of moving between different topologies. In the wald space, additional topological transformations are possible via expanding edges to infinite length, and these correspond to tree bisection and reconnection (TBR) operations. Many applications in phylogenetics require searches over the space of phylogenetic trees, and movement along information geodesics in the wald space might have advantages over existing methods.

Acknowledgements

The second and the last author express their thanks to the Oberwolfach 1804 meeting “Statistics for Data with Geometric Structure” in which wald space was first discussed. The last two authors gratefully acknowledge support from DFG GRK 2088. The last author was supported by the Niedersachsen Vorab of the Volkswagen Foundation.

Appendix A: Calculation of pT(s) and its derivatives

The probability pT(s) of any binary character s on a tree TUN can be computed efficiently via the following algorithm (Semple and Steel 2003), often called the Felsenstein pruning algorithm. First, an arbitrary internal vertex v0T is chosen and used to root the tree. The two-state symmetric model is a reversible Markov process, and so the choice of the root does not affect the value of pT(s). The root determines ancestral relations on the tree, and we let Tv denote the subtree of T descended from vertex v. We let Lv denote the leaves of Tv, and given a binary characer s, let sv denote the restriction of s to Lv. Finally, we let pTv(sv|ω) be the probability of sv on Tv given the letter ω{0,1} at v:

pTv(sv|ω)=PruLvX(u)=s(u)|X(v)=ω,

since sv(u)=s(u) for all uLv. The theorem of total probability gives

pT(s)=12ω{0,1}pTv0(sv0|ω). 6.1

For an interior vertex v, if we let vi, i=1,,m be the vertices immediately descended from v via edges of length i, then the transition probabilities in Eq. (2.1) give

pTv(sv|ω)=i=1m12((1+e-i)pTvi(svi|ω)+(1-e-i)pTvi(svi|ω¯)) 6.2

where ω¯=1-ω. For a leaf u, we have pTu(su|ω)=pTu(s(u)|ω)=1 if s(u)=ω and zero otherwise. This and Equation (6.2) can be applied recursively to compute the terms pTv(sv|ω) for each vertex vT, starting at the leaves and working up the tree to the root v0. Finally, pT(s) can be computed using Equation (6.1), and it follows from the recursion that pT(s) is a multivariate polynomial with arguments of the form 1+e-k and 1-e-k, where k ranges over the edges of T. The coefficients of the polynomial depend on the topology of T.

Equations (6.1) and (6.2) can also be used to differentiate pT(s) with respect to any edge length parameter. These derivatives are required in Sect. 3. Suppose e is an edge of T and we wish to compute the derivative pT(s)/e. Since we are free to choose v0, the calculation is simplified if we let v0 be an internal vertex at one end of edge e. We can order the vertices vi attached to v0 so that e=(v0,v1). Equation (6.2) then gives

pTv0(sv0|ω)e=12e-e(pTv1(sv1|ω¯)-pTv1(sv1|ω))×i=2degv012((1+e-i)pTvi(svi|ω)+(1-e-i)pTvi(svi|ω¯)),

where the pTvi terms can be calculated recursively. Second derivatives of the mass function can be calculated analytically in a similar way.

Appendix B: Proof of Lemma 2.2

First, suppose F1F2. The BHV boundary rule does not affect the distribution on characters induced by a tree, because the same distribution is obtained whether an edge of length zero is present in a tree or not. Similarly, if F1,F2 are equal modulo an application of the boundary rule at infinity, then pF1(s)=pF2(s) since an edge with weight 1 results in independence between the letters at leaves at either side of the edge under the transition probabilities in Eq. (2.1). Specifically, if v0,v1 are vertices at the ends of an edge e with λe=1, then

X(v1)|X(v0)=ωBern(1/2)

where ω{0,1}, so the conditional distribution of X(v1) is the same as its marginal. The map [F]pF from elements of WN to distributions on characters is therefore well-defined. In fact, the work of Allman et al. (2008) shows the map is injective, and this establishes the lemma.

Appendix C: Proof of Lemma 3.1

The Riemannian metric in Eq. (3.3) can be expanded as

δigij()δj=sp(s)(δiilogp(s))(δjjlogp(s))=sp(s)(δi1p(s)p(s)i)(δj1p(s)p(s)j)=s1p(s)(δip(s)i)(δjp(s)j).

The Taylor expansion of p(s) is

p+δ(s)-p(s)=iδip(s)i+O(|δ|2).

Substituting this into the expression for the Riemannian metric gives

i,jδigij()δj=s(p+δ(s)-p(s)+O(|δ|2))2p(s)=s(p+δ(s)-p(s))2p(s)+O(|δ|3)

since p+δ(s)-p(s) is O(|δ|). On the other hand, a Taylor expansion of f around 1 gives

Df(p+δ;p)=sp(s)fp+δ(s)p(s)=sp(s)(f(1)+f(1)p+δ(s)-p(s)p(s)+12f(1)(p+δ(s)-p(s)p(s))2+O(|p+δ(s)-p(s)p(s)|3))=f(1)+f(1)sp+δ(s)-f(1)sp(s)+12f(1)s(p+δ(s)-p(s))2p(s)+O|δ|3.

The first three terms vanish since f(1)=0 and since p(s)=1 for all . It follows that

i,jδigij()δj=2f(1)Df(p+δ;p)+O(|δ|3)

and the lemma is established.

Appendix D: Proof of Theorem 3.1

We need to show that the induced intrinsic metric d([F1],[F2]) is finite for any [F1],[F2]WN. Start by choosing the representative F1WN for the equivalence class [F1] to be a connected tree with edge weights λ1, some elements of which might have value 1. The tree F1 can be continuously deformed within the orthant corresponding to its topology, to the star tree F on which all pendant edges have weight λ=1/2, by changing λ along the obvious linear path. If the path has finite length, then the first part of the theorem has been established, since any [F1] and [F2] can be joined to [F] in this way. As shown in Remark 2.1, each pλ(s) for λ along the path from F1 to F is a polynomial in λ. It follows that if λ and λ+δλ represent the edge weights at two nearby points on the path then

pλ+δλ(s)-pλ(s)=πi(s,λ)δλi+O(|δλ|2)

where for each character s and i=1,,2N-3, πi(s,λ) is a polynomial in λ. Then

(pλ+δλ(s)-pλ(s))2=(πi(s,λ)δλi)2+O(|δλ|3).

The path distance between any pair of leaves is continuous along this path, and is strictly positive since the pendant edge lengths are non-zero at all points on the path, apart from potentially at F1. Pendant edge lengths can be zero on F1, but by the definition of WN, the path distance between leaves is non-zero. It follows that pλ(s) is also bound away from zero. Thus there is a constant C(s) such that

(pλ+δλ(s)-pλ(s))2pλ(s)C(s)(πi(s,λ)δλi)2C(s)(iπi(s,λ)2)δλ2

where the second line comes from the Cauchy-Schwarz inequality and the norm is the Euclidean norm. Since the πi(s,λ) are polynomials in λ and the elements of λ lie between 0 and 1, the πi(s,λ) are bounded from above and we obtain

(pλ+δλ(s)-pλ(s))2pλ(s)B(s)δλ2

for some constant B(s). Now suppose that Df=df2 is a f-divergence, where df is a metric. Applying Eq. (3.6) from Lemma 3.1 with the λ-parametrization gives

df2(pλ+δλ,pλ)=12f(1)s(pλ+δλ(s)-pλ(s))2pλ(s)+O(|δλ|3)12f(1)sB(s)δλ2+O(|δλ|3)Kδλ2 6.3

for some constant K. Thus the infinitesimal path length in WN as measured by the metric d is bounded by some multiple of the Euclidean path length on λ. The length of the linear path from F1 to F measured with d is therefore finite, since the Euclidean length of this path is finite, and hence d([F1],[F2]) is finite.

For the second part of the theorem, suppose that Df0=df02 is a f0-divergence, where df0 is a metric. Further, suppose that F1 and F2 are given by λ1 and λ2, respectively. It suffices to consider λ1,λ2 from the same topology and sufficiently close such that the image tλ(t), λ(0)=λ1 and λ(1)=λ2 of the geodesic from pλ1 to pλ2 in the metric induced by the Riemannian metric gij lies fully in a convex λ coordinate patch and has finite Euclidean length there, say L. Hence, for every nN, there are δλ(j) with δλ(j)L/n, j{1,,n} such that λ2=λ1+j=1nδλ(j). Then the second assertion of the theorem follows from (6.3), setting c=f(1)/f0(1), as n, because

|df(pλ1,pλ2)2-c·df0(pλ1,pλ2)2|=|j=1n-1O(|δλ(j)|3)|=OLn2.

The second equality sign holds because the constants in the individual summands O(|δλ(j)|3) (1jn) can be bounded by the supremum of absolute values of the gradient of pλ with respect to λ, in the coordinate patch, as can be seen from the last lines of the proof of Lemma 3.1.

The third part of the theorem, which states that minimal length paths satisfy the geodesic equation locally, is part of the standard theory for Riemannian geometry on manifolds, e.g. (Lee 1997, Sect. 4).

Appendix E: Proof of Theorem 4.1

The theorem is trivial for N=2, so suppose FWN with N3 and that the first assertion holds for all GWN-1. The matrix SF is not changed by inserting edges e with λe=1 to connect trees in F, or by adding edges with λe=0, so without loss of generality we may assume F is a fully resolved tree. We may also assume there is a cherry between leaves N-1 and N since each bifurcating tree with N3 has a cherry and permuting the labels of F results in a tree with covariance PTSFP (with permutation matrix P), where positive definiteness is preserved.

Let eN-1 and eN be the edges incident to leaves N-1 and N, respectively. Since F is bifurcating, there is exactly one edge, say e0, incident to eN-1 and eN. Let SF=(suv)u,v=1N. The tree GWN-1 obtained by deleting eN and leaf N and merging e0 and eN-1 to e~ with weight λe~=1-(1-λe0)(1-λeN-1) has covariance SG=(suv)u,v=1N-1, which is by induction positive definite. Using this and Sylvester’s criterion (that a matrix is positive if and only if all principal minors have positive determinant) it suffices to show det(SF)>0. We have for all 1uN-1 that suN=sNu=(1-λeN)cu, where

cu=1-λeN-1whenu=N-1,andeeN,eN-1(1-λe)σNuefor1uN-2

and σNue is defined by Eq. (4.4). Note that for u,vN-1, cu and suv do not involve λeN, and that sNN=1. If SN denotes the set of permutations of {1,,N}, then the Leibniz formula for determinants gives

det(SF)=(τSNτ(N)=Nsgn(τ)u=1Nsuτ(u))+(τSNτ(N)Nsgn(τ)u=1Nsuτ(u))=det((suv)u,v=1N-1)+(1-λeN)2(τSNτ(N)Nsgn(τ)cτ(N)cτ-1(N)u=1uτ-1(N)N-1suτ(u)),

so det(SF) is linear in x:=(1-λeN)2. By symmetry of the cherry, det(SF) is also linear in y:=(1-λeN-1)2 as well. We write g(x,y)=det(SF). For x=0, we have sNu=0 for u<N and sNN=1, so g(0,y)=det(SF)=det(SG)>0 for all y[0,1], and similarly g(x,0)>0 for all x[0,1]. Furthermore, g(1,1)=0, since in that case the last two rows of SF coincide. Since g is linear in x and in y, respectively, we have g(x,y)>0 for all (x,y)[0,1]2\{(1,1)}, so that det(SF)>0 for all (λeN-1,λeN)[0,1]2\{(0,0)}. If λeN-1=λeN=0, we would have dN(N-1)=0, but this is not allowed by the definition of WN.

We also need to show that the map [F]SF is injective on WN where [F] denotes the equivalence class of FWN. This is trivial, however, since whenever F1,F2WN are in different equivalence classes, the matrix of distances between the leaves is different.

Appendix F: Proof of Theorem 5.1

The proof is similar to that for Theorem 3.1, and so we give a brief sketch. We consider the same path between the trees [F1],[F]WN, and show that each element of gij(λ) is bound from above along the path. Working in the λ-parametrization of an orthant, Eq. (4.5) becomes

Sλ=(e(1-λe)σuve)u,v=1N.

Each element of the matrix is therefore a polynomial in the elements of λ, and their derivatives with respect to λ are also polynomials. Recalling that the tangent space of WN at Sλ in a maximal orthant is spanned by iSλ, where i{1,,2N-3} ranges over the edges in that maximal orthant, Eq. (4.6) becomes

gij(λ)=12tr(Sλ-1(iSλ)Sλ-1(jSλ)),

i,j{1,,2N-3}.

Applying the Cauchy-Schwartz inequality |tr(ATB)|2tr(ATA)tr(BTB) gives

|gij(λ)|214tr((iSλ)2Sλ-2)tr((jSλ)2Sλ-2)14tr(Sλ-4)tr((iSλ)4)12tr((jSλ)4)12.

The first term in this product is bounded on a geodesic path from F1 to F, since Sλ is positive definite and its eigenvalues are bound away from zero. The other two terms are also bounded from above, because the derivatives of Sλ are polynomials in λ, Thus |gij(λ)|C for some constant C at all points along that path, and the same argument as for Theorem 3.1 shows that dcov([F1],[F]) is finite.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Footnotes

1

This space was first discussed by the authors at the Oberwolfach 1804 meeting “Statistics for Data with Geometric Structure” in the Schwarzwald (Black Forest) in 2018.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

M. K. Garba, Email: m.k.garba1@ncl.ac.uk

T. M. W. Nye, Email: tom.nye@ncl.ac.uk

J. Lueg, Email: jonas.lueg@stud.uni-goettingen.de, Email: lueg@math.uni-goettingen.de

S. F. Huckemann, Email: huckeman@math.uni-goettingen.de

References

  1. Adams RH, Castoe TA. Probabilistic species tree distances: implementing the multispecies coalescent to compare species trees within the same model-based framework used to estimate them. Syst Biol. 2020;69(1):194–207. doi: 10.1093/sysbio/syz031. [DOI] [PubMed] [Google Scholar]
  2. Allen BL, Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann Comb. 2001;5(1):1–15. doi: 10.1007/s00026-001-8006-8. [DOI] [Google Scholar]
  3. Allman ES, Ané C, Rhodes JA. Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Adv Appl Probab. 2008;40(1):229–249. doi: 10.1239/aap/1208358894. [DOI] [Google Scholar]
  4. Ballmann W, Gromov M, Schroeder V. Manifolds of nonpositive curvature. Progress in mathematics. Basel: Birkhäuser; 1985. [Google Scholar]
  5. Bačák M. Computing medians and means in Hadamard spaces. SIAM J Optim. 2014;24(3):1542–1566. doi: 10.1137/140953393. [DOI] [Google Scholar]
  6. Billera L, Holmes S, Vogtman K. Geometry of the space of phylogenetic trees. Adv Appl Math. 2001;27:733–767. doi: 10.1006/aama.2001.0759. [DOI] [Google Scholar]
  7. Bridson MR, Haefliger A. Metric spaces of non-positive curvature. Berlin: Springer; 2011. [Google Scholar]
  8. Bryant D, Galtier N, Poursat M-A. Likelihood calculation in molecular phylogenetics. In: Gascuel O, editor. Mathematics of evolution and phylogeny. Oxford: Oxford University Press; 2005. pp. 33–62. [Google Scholar]
  9. Dryden IL, Koloydenko A, Zhou D, et al. Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Ann Appl Stat. 2009;3(3):1102–1123. doi: 10.1214/09-AOAS249. [DOI] [Google Scholar]
  10. Engström A, Hersh P, Sturmfels B. Toric cubes. Rendiconti del Circolo Matematico di Palermo. 2013;62(1):67–78. doi: 10.1007/s12215-013-0115-9. [DOI] [Google Scholar]
  11. Feragen A, Owen M, Petersen J, Wille M, Thomsen L, Dirksen A, de Bruijne M (2013) Tree-space statistics and approximations for large-scale analysis of anatomical trees. In: 23rd biennial international conference on information processing in medical imaging (IPMI) [DOI] [PubMed]
  12. Garba MK (2019) Information geometry for phylogenetic trees. Ph.D. thesis, School of Mathematics, Statistics and Physics, Newcastle University
  13. Garba MK, Nye TMW, Boys RJ. Probabilistic distances between trees. Syst Biol. 2018;67(2):320–327. doi: 10.1093/sysbio/syx080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gill J, Linusson S, Moulton V, Steel M. A regular decomposition of the edge-product space of phylogenetic trees. Adv Appl Math. 2008;41(2):158–176. doi: 10.1016/j.aam.2006.07.007. [DOI] [Google Scholar]
  15. Hansen TF, Martins EP. Translating between microevolutionary process and macroevolutionary patterns: the correlation structure of interspecific data. Evolution. 1996;50(4):1404–1417. doi: 10.1111/j.1558-5646.1996.tb03914.x. [DOI] [PubMed] [Google Scholar]
  16. Huelsenbeck JP, Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17(8):754–755. doi: 10.1093/bioinformatics/17.8.754. [DOI] [PubMed] [Google Scholar]
  17. Kim J. Slicing hyperdimensional oranges: the geometry of phylogenetic estimation. Mol Phylogenet Evol. 2000;17(1):58–75. doi: 10.1006/mpev.2000.0816. [DOI] [PubMed] [Google Scholar]
  18. Lee JM. Riemannian manifolds: an introduction to curvature. Berlin: Springer; 1997. [Google Scholar]
  19. Lenglet C, Rousson M, Deriche R, Faugeras O. Statistics on the manifold of multivariate normal distributions: theory and application to diffusion tensor MRI processing. J Math Imaging Vis. 2006;25(3):423–444. doi: 10.1007/s10851-006-6897-z. [DOI] [Google Scholar]
  20. Lin B, Yoshida R. Tropical Fermat-Weber points. SIAM J Discrete Math. 2018;32(2):1229–1245. doi: 10.1137/16M1071122. [DOI] [Google Scholar]
  21. Lin B, Monod A, Yoshida R (2018) Tropical foundations for probability and statistics on phylogenetic tree space. arXiv preprint arXiv:1805.12400
  22. Miller E, Owen M, Provan JS. Polyhedral computational geometry for averaging metric phylogenetic trees. Adv Appl Math. 2015;68:51–91. doi: 10.1016/j.aam.2015.04.002. [DOI] [Google Scholar]
  23. Moakher M. A differential geometric approach to the geometric mean of symmetric positive-definite matrices. SIAM J Matrix Anal Appl. 2005;26:735–747. doi: 10.1137/S0895479803436937. [DOI] [Google Scholar]
  24. Moulton V, Steel M. Peeling phylogenetic oranges. Adv Appl Math. 2004;33(4):710–727. doi: 10.1016/j.aam.2004.03.003. [DOI] [Google Scholar]
  25. Nye TMW. Principal components analysis in the space of phylogenetic trees. Ann Stat. 2011;39(5):2716–2739. doi: 10.1214/11-AOS915. [DOI] [Google Scholar]
  26. Nye T. An algorithm for constructing principal geodesics in phylogenetic treespace. IEEE ACM Trans Comput Biol. 2014;11(2):304–315. doi: 10.1109/TCBB.2014.2309599. [DOI] [PubMed] [Google Scholar]
  27. Nye TMW, Tang X, Weyenberg G, Yoshida R. Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. Biometrika. 2017;104(4):901–922. doi: 10.1093/biomet/asx047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Owen M, Provan JS. A fast algorithm for computing geodesic distances in tree space. IEEE ACM Trans Comput Biol. 2011;8(1):2–13. doi: 10.1109/TCBB.2010.3. [DOI] [PubMed] [Google Scholar]
  29. Rogers JS. On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences. Syst Biol. 1997;46(2):354–357. doi: 10.1093/sysbio/46.2.354. [DOI] [PubMed] [Google Scholar]
  30. Sason I, Verdu S. f-divergence inequalities. IEEE Trans Inf Theory. 2016;62(11):5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
  31. Schmidt FR, Clausen M, Cremers D (2006) Shape matching by variational computation of geodesics on a manifold. In: Joint pattern recognition symposium. Springer, pp 142–151
  32. Semple C, Steel M. Phylogenetics. Oxford lecture series in mathematics and its applications. Oxford: Oxford University Press; 2003. [Google Scholar]
  33. Skovgaard LT. A Riemannian geometry of the multivariate normal model. Scand J Stat. 1984;11(4):211–223. [Google Scholar]
  34. Speyer D, Sturmfels B. The tropical Grassmannian. Adv Geom. 2004;4(3):389–411. doi: 10.1515/advg.2004.023. [DOI] [Google Scholar]
  35. Steel MA, Penny D. Distributions of tree comparison metrics—some new results. Syst Biol. 1993;42(2):126–141. [Google Scholar]
  36. Willis A. Confidence sets for phylogenetic trees. J Am Stat Assoc. 2019;114(525):235–244. doi: 10.1080/01621459.2017.1395342. [DOI] [Google Scholar]
  37. Yang Z. Computational molecular evolution. Oxford: Oxford University Press; 2006. [Google Scholar]
  38. Yoshida R, Zhang L, Zhang X. Tropical principal component analysis and its application to phylogenetics. Bull Math Biol. 2019;81(2):568–597. doi: 10.1007/s11538-018-0493-4. [DOI] [PubMed] [Google Scholar]
  39. Zwiernik P, Smith JQ. Tree cumulants and the geometry of binary tree models. Bernoulli. 2012;18(1):290–321. doi: 10.3150/10-BEJ338. [DOI] [Google Scholar]

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES