Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2017 Sep 27;104(4):901–922. doi: 10.1093/biomet/asx047

Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees

Tom M W Nye 1,, Xiaoxian Tang 2, Grady Weyenberg 3, Ruriko Yoshida 4
PMCID: PMC5793493  PMID: 29422694

Summary

Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic analysis of gene sequences typically produces a collection of these trees, one for each gene in the analysis. Analysis of samples of trees is difficult due to the multi-dimensionality of the space of possible trees. In Euclidean spaces, principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample’s structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree space similar to the Inline graphicth principal component in Euclidean space: the locus of the weighted Fréchet mean of Inline graphic vertex trees when the weights vary over the Inline graphic-simplex. We establish some basic properties of these objects, in particular showing that they have dimension Inline graphic, and propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second principal components.

Keywords: Fréchet mean, Phylogenetic tree, Principal component analysis, Tree space

1. Introduction

A great opportunity offered by modern genomics is that phylogenetics applied on a genomic scale, or phylogenomics, should be especially powerful for elucidating gene and genome evolution, relationships among species and populations, and processes of speciation and molecular evolution. However, a well-recognized hurdle is the sheer volume of genomic data that can now be generated relatively cheaply and quickly, but for which analytical tools are lacking. There is a major need to explore new approaches that will enable us to undertake comparative genomic and phylogenomic studies much more rapidly and robustly than existing tools allow.

Datasets consisting of collections of phylogenetic trees are challenging to analyse, due to their high dimensionality and the complexity of the space containing the data. Multivariate statistical procedures such as outlier detection (Weyenberg et al., 2014), clustering (Gori et al., 2016) and multi-dimensional scaling (Hillis et al., 2005) have previously been applied to such datasets, but principal component analysis is perhaps the most useful multivariate statistical tool for exploring high-dimensional datasets. For example, Zha et al. (2001) and Ding & He (2004) showed that principal component analysis automatically projects to the subspace where the global solution of Inline graphic-means clustering lies, and so facilitates Inline graphic-means clustering to find near-optimal solutions. Although principal component analysis for data in Inline graphic can be defined in several different ways, the following description is natural for reformulating the procedure in tree space. Suppose we have data Inline graphic where Inline graphic for Inline graphic. For any set of Inline graphic points Inline graphic we can define

graphic file with name Equation1.gif (1)

so that Inline graphic is the affine subspace of Inline graphic containing Inline graphic. The orthogonal Inline graphic distance of any point Inline graphic from Inline graphic is denoted by Inline graphic, and the sum of squared projected distances of the data Inline graphic onto Inline graphic is denoted by

graphic file with name Equation2.gif

Then the Inline graphicth principal component Inline graphic corresponds to a choice of Inline graphic which minimizes this sum. In Inline graphic, Inline graphic is the sample mean, Inline graphic is the line through the sample mean which minimizes the sum of squared projected distances, and so on for Inline graphic. Although it is not explicit in the definition above, in Inline graphic the principal components are nested, i.e., Inline graphic. This description of principal component analysis relies heavily on the vector space properties of Inline graphic: Inline graphic is defined as a linear combination of vectors and the procedure uses orthogonal projection.

However, the space of phylogenetic trees on a fixed set of leaves is not a Euclidean vector space, so we cannot directly apply classical principal component analysis to a dataset of phylogenetic trees. Instead, Billera et al. (2001) showed that the set Inline graphic of all phylogenetic trees with Inline graphic leaves labelled Inline graphic forms a CATInline graphic space as defined by Bridson & Haefliger (2011, Definition II.1.1). In CATInline graphic spaces any pair of points are joined by a unique geodesic, or shortest-length path, and an algorithm exists that computes Inline graphic geodesics in Inline graphic steps (Owen & Provan, 2011). Furthermore, projection onto closed sets is well defined in CATInline graphic spaces.

The analogue of the zeroth principal component is the unweighted Fréchet mean of the data Inline graphic. The Fréchet mean is a statistic which characterizes the central tendency of a distribution in arbitrary metric spaces. For any metric space Inline graphic equipped with metric Inline graphic, the Fréchet population mean Inline graphic with respect to the distribution Inline graphic is defined by

graphic file with name Equation3.gif

The discrete analogue, the weighted Fréchet mean of a sample Inline graphic with respect to a weight vector Inline graphic, is

graphic file with name Equation4.gif

where the weights Inline graphic satisfy Inline graphic for Inline graphic. In any CATInline graphic space, Inline graphic is a well-defined unique point given data Inline graphic and weight vector Inline graphic. The definition of the zeroth principal component Inline graphic in Inline graphic given above coincides with the definition of the Fréchet sample mean with weights Inline graphic in any CATInline graphic space. Several algorithms for computing the Fréchet sample mean in Inline graphic have been developed (Bačák, 2014; Miller et al., 2015) and we review these in § 2.2, as they play an important role in our method. The term Fréchet mean will be used throughout to refer to a sample mean unless stated otherwise.

Methods for constructing a principal geodesic in tree space, an analogue of Inline graphic as defined above, have recently been developed. In Nye (2011), the approach involved firing geodesics from some mean tree. For each candidate geodesic Inline graphic, the sum of squared projected distances Inline graphic was computed and a greedy algorithm was used to adjust Inline graphic in order to minimize Inline graphic. The geodesics considered were infinitely long, but have the disadvantage that in some cases many such geodesics fit the data equally well. Subsequent approaches therefore considered finitely long geodesic segments (Feragen et al., 2013; Nye, 2014). The geodesic segment between two points Inline graphic is analogous to Inline graphic in (1) with Inline graphic, except that the weights Inline graphic and Inline graphic must be constrained to be a valid probability vector; that is, Inline graphic and Inline graphic must be nonnegative and sum to 1. Feragen et al. (2013) constrained the ends of the geodesic to be points in the sample Inline graphic and sought the corresponding geodesic Inline graphic which minimizes Inline graphic, whereas Nye (2014) did not restrict the geodesic and used a stochastic optimization algorithm to perform the minimization.

In this paper we address two fundamental questions: (i) which geometric object most naturally plays the role of a Inline graphicth principal component in tree space; and (ii) given such an object, how can we efficiently project data points onto the object? Our proposed solution is to replace the definition of Inline graphic given in (1) with the locus of the weighted Fréchet mean of points Inline graphic in tree space. Specifically, suppose Inline graphic and define Inline graphic by

graphic file with name Equation5.gif

where Inline graphic is the Inline graphic-dimensional simplex of probability vectors,

graphic file with name Equation6.gif

and Inline graphic is the Fréchet mean of the points in set Inline graphic with weights Inline graphic. We call Inline graphic the locus of the Fréchet mean of Inline graphic. Our choice of notation is intended to emphasize the analogy between the definition of Inline graphic in tree space and the corresponding definition for Inline graphic in (1). The locus of the Fréchet mean is a type of minimal surface, as the following physical analogy suggests. Imagine connecting a point Inline graphic to points Inline graphic by Inline graphic pieces of elastic. When the point Inline graphic is free to move, it will move under the action of the elastic into an equilibrium position in tree space. If the stiffness of each piece of elastic is allowed to vary independently, corresponding to different choices for Inline graphic, the equilibrium point will move about in tree space, tracing out a surface. In Euclidean space the locus of the Fréchet mean of some collection of points is an affine subspace; however, in tree space, the locus can be curved. Surfaces of this kind have recently been studied in the context of Riemannian manifolds and other geodesic metric spaces (Pennec, 2015). We discuss the relationship of the present paper to that work in § 6.

Our main theoretical results are as follows. First, when Inline graphic we derive a set of local implicit equations for Inline graphic. These allow us to derive conditions for Inline graphic to be locally flat, and also enable us to construct explicit realizations of Inline graphic in certain cases. Secondly, using the implicit equations we show that the locus of the Fréchet mean Inline graphic in Inline graphic is locally Inline graphic-dimensional for generic nondegenerate choices of Inline graphic, and thus forms a suitable candidate for a Inline graphicth principal component. Third, we present an algorithm for projection onto Inline graphic which relies only on the CATInline graphic properties of Inline graphic. We demonstrate accuracy of the projection algorithm via a simulation study.

2. The geometry of tree space

2.1. Construction of tree space and its geodesics

Throughout the paper, the Inline graphic-dimensional Euclidean vector space is denoted by Inline graphic. The nonnegative and positive orthants in Inline graphic are denoted by Inline graphic and Inline graphic, respectively. For any vectors Inline graphic, Inline graphic denotes the Euclidean norm of Inline graphic and Inline graphic denotes the Euclidean inner product.

A phylogenetic tree with leaf set Inline graphic is an undirected weighted acyclic graph with Inline graphic degree-Inline graphic vertices labelled Inline graphic and with no degree-Inline graphic vertices. We consider rooted trees, and the root is the leaf labelled 0. Each such tree contains Inline graphic pendant edges, which connect to the leaves, and up to Inline graphic internal edges. The maximum number of internal edges is achieved when the tree is binary, in which case all non-leaf vertices have degree Inline graphic, and the tree is said to be fully resolved. If a tree contains fewer edges, then it is said to be unresolved and there must be at least one vertex with degree Inline graphic or higher. Apart from the root edge containing taxon Inline graphic, each edge in a phylogeny is assigned a strictly positive weight, also called the edge length. Given a tree Inline graphic, the set of edges of Inline graphic is denoted by Inline graphic, and the weight assigned to Inline graphic is denoted by Inline graphic. It is convenient to define Inline graphic to be zero whenever Inline graphic is not contained in Inline graphic.

Tree space Inline graphic is the set of all phylogenetic trees with leaf set Inline graphic (Billera et al., 2001). Tree space can be embedded in Inline graphic for Inline graphic in the following way. If we cut any edge Inline graphic, then the tree Inline graphic splits into two disconnected pieces. This determines a split Inline graphic of the leaf set Inline graphic, where Inline graphic and Inline graphic. By convention we choose Inline graphic to be the set containing the root 0, and so there are Inline graphic possible splits of Inline graphic. The collection of splits represented by a tree Inline graphic is called the topology of Inline graphic. Since edges and splits are equivalent, we use the notation Inline graphic to also represent the set of splits in Inline graphic. By choosing some arbitrary ordering of the set of all splits, each tree Inline graphic can be represented as a vector in Inline graphic with up to Inline graphic positive entries given by the edge weights of Inline graphic and zeros for each split that is not contained in Inline graphic. However, an arbitrary choice of vector will not necessarily represent a tree; for example, the splits Inline graphic and Inline graphic cannot both be contained in the same tree, so any vector for which these splits both have a strictly positive value does not represent a tree. Two splits Inline graphic and Inline graphic are compatible if one of the four sets Inline graphic, Inline graphic, Inline graphic and Inline graphic is empty, in which case there is at least one tree containing both splits. Any collection of pairwise compatible splits determines a valid tree topology (Semple & Steel, 2003, Theorem 3.1.4).

The embedding into Euclidean space reveals the combinatorial structure of Inline graphic. Every tree Inline graphic contains Inline graphic pendant edges other than the root edge, so Inline graphic is the product of Inline graphic and a space corresponding to the internal edges. It is therefore convenient to ignore the pendant edges and consider the corresponding embedding of tree space into Inline graphic. Given any tree topology Inline graphic containing Inline graphic internal edges, the set of trees with topology Inline graphic corresponds to a subset Inline graphic which is isomorphic to Inline graphic with respect to the local Euclidean structure. Each such region is called the orthant for topology Inline graphic. The boundary of Inline graphic in Inline graphic corresponds to trees obtained by removing one or more internal edges from Inline graphic. Equivalently, the trees on the boundary can be obtained by taking a tree Inline graphic in Inline graphic and continuously shrinking one or more internal edges down to length zero. Thus, for a fully resolved topology Inline graphic, the codimension-Inline graphic boundaries of Inline graphic correspond to trees containing Inline graphic internal edges, and in general each codimension-Inline graphic boundary corresponds to trees containing Inline graphic internal edges, for Inline graphic. There are Inline graphic possible fully resolved rooted tree topologies, and so Inline graphic is built from Inline graphic orthants isomorphic to Inline graphic together with the boundaries of these orthants which correspond to trees that are not fully resolved. Orthants are glued together at their boundaries, since a given unresolved tree containing Inline graphic internal edges can be obtained by removing edges from several different trees containing Inline graphic edges. Orthants corresponding to fully resolved topologies are glued at their codimension-Inline graphic boundaries in a relatively simple way. If a single internal edge in a tree with fully resolved topology Inline graphic is contracted to length zero and removed from the tree, the result is a vertex of degree Inline graphic. There are three possible ways to add in an extra edge to give a fully resolved topology, so each codimension-Inline graphic face of Inline graphic is glued to two other such orthants. Trees containing no internal edges are called star trees; the point Inline graphic corresponds to the set of star trees and is contained in the boundary of every orthant Inline graphic.

The topology of Inline graphic is taken to be that induced by the embedding into Euclidean space. Geodesics are constructed by considering continuous paths in Inline graphic which are Euclidean straight-line segments in each orthant. The length of a path is the sum of the Euclidean segment lengths. As shown by Billera et al. (2001), the shortest such path or geodesic between two points Inline graphic is unique, and it will be denoted by Inline graphic. The distance Inline graphic is defined to be the length of Inline graphic, and this defines the metric Inline graphic on Inline graphic. By definition, Inline graphic incorporates information about both the topologies and the edge lengths of Inline graphic and Inline graphic. Given two points Inline graphic and Inline graphic in the same orthant, Inline graphic is simply the Euclidean line segment between Inline graphic and Inline graphic, whereas when Inline graphic and Inline graphic are in different orthants, Inline graphic consists of a series of straight-line segments traversing orthants corresponding to different topologies. Billera et al. (2001) proved that Inline graphic is a CATInline graphic space, so it has several additional geometrical properties (Bridson & Haefliger, 2011).

Owen & Provan (2011) established an Inline graphic algorithm to compute the geodesic between any two trees in Inline graphic. The details of their algorithm are not important for the present application, but we do require some notation for the form of the geodesics it constructs. Given Inline graphic, let Inline graphic be the set of splits in Inline graphic which are compatible with every split in Inline graphic and every split in Inline graphic. Adopting notation from Owen & Provan (2011), the geodesic Inline graphic is characterized by disjoint sets of internal splits

graphic file with name Equation7.gif

where Inline graphic is an integer that depends on Inline graphic and Inline graphic. These sets of splits determine the order in which edges are removed and added as the geodesic is traversed; the Inline graphicth topology visited contains splits

graphic file with name Equation8.gif

The union Inline graphicInline graphic is Inline graphic and similarly for tree Inline graphic. We let Inline graphic be the ordered list of sets Inline graphic and similarly define Inline graphic. The support of Inline graphic, defined to be the triple Inline graphic, characterizes the sequence of orthants the geodesic traverses. For any set Inline graphic we adopt the notation

graphic file with name Equation9.gif

and similarly for subsets of Inline graphic. Owen & Provan (2011) showed that

graphic file with name Equation10.gif (2)

where Inline graphic is the Inline graphic-dimensional vector whose Inline graphicth element is Inline graphic, and similarly for Inline graphic the Inline graphicth element is Inline graphic. The vectors Inline graphic and Inline graphic have dimension Inline graphic and respectively contain the edge lengths Inline graphic and Inline graphic for Inline graphic. It follows from (2) that

graphic file with name Equation11.gif (3)

where Inline graphic is the sum of squared edge lengths in Inline graphic and similarly for Inline graphic.

The following definition characterizes certain geodesics which behave rather like Euclidean straight lines.

Definition 1

(Simple geodesic). Suppose that Inline graphic are fully resolved. The geodesic Inline graphic is said to be simple if each of the sets Inline graphic and Inline graphic contains exactly one element for Inline graphic. Equivalently, Inline graphic is simple if and only if at most one edge length at a time contracts to zero as the geodesic is traversed.

The following definition determines the set of trees Inline graphic such that the geodesics Inline graphic to a fixed point Inline graphic all share the same support.

Definition 2

(Support region). Fix some point Inline graphic and an orthant Inline graphic corresponding to a fully resolved topology Inline graphic. Let Inline graphic be the support of Inline graphic for some Inline graphic. Then the set

Definition 2

is called a support region. The number of support regions for fixed Inline graphic and Inline graphic is finite since geodesics of the form Inline graphic for Inline graphic have finitely many distinct supports.

Miller et al. (2015) considered very similar subsets of Inline graphic and established their properties. This relied on a map Inline graphic defined by squaring edge lengths. In the image of this map, Miller et al. (2015) showed that each support region is defined by a set of linear inequalities and that the boundaries between support regions are codimension-Inline graphic hyperplanes. It follows, by inverting the squaring map, that the union over the set Inline graphic of possible supports, Inline graphic, is dense in Inline graphic, where Inline graphic denotes the interior of each support region; it also follows that the boundaries between the support regions are continuous codimension-Inline graphic surfaces within each orthant.

2.2. Algorithms for computing the Frechét mean

Several algorithms for computing the unweighted or weighted Fréchet mean of a sample in Inline graphic have been developed (Sturm, 2003; Bačák, 2014; Miller et al., 2015). These algorithms have the following general structure. Suppose we have a set Inline graphic. At the Inline graphicth iteration there is an estimate Inline graphic of the Fréchet mean of Inline graphic. To find the next estimate, Inline graphic, a data point Inline graphic is selected, either deterministically or stochastically depending on the particular algorithm. The geodesic Inline graphic is constructed, and Inline graphic is taken to be the point a certain proportion of the distance along the geodesic. This proportion can depend on the weights when the weighted Fréchet mean is estimated. In each case, some form of convergence of the sequence Inline graphic to the Fréchet mean of Inline graphic can be proved, independent of the initial estimate Inline graphic.

Our method does not make direct use of these algorithms. However, as described in § 4.1, our proposed algorithm for projecting data onto the locus of the Fréchet mean is adapted from the algorithm of Sturm (2003), which computes the Fréchet mean of Inline graphic using weights Inline graphic. By definition, the Fréchet mean is invariant under positive scaling of the weights, so we can assume Inline graphic without loss of generality. Sturm’s algorithm proceeds in the following way.

Algorithm 1.

Sturm’s algorithm for the weighted Fréchet mean.

Fix an initial estimate Inline graphic and set Inline graphic.

Repeat:

  Sample Inline graphic such that Inline graphic.

  Construct Inline graphic.

  Let Inline graphic be the point a proportion Inline graphic along Inline graphic, where Inline graphic.

  Set Inline graphic.

Until the sequence Inline graphic converges.

Convergence can be tested in various ways, for example by repeating until a specified number of consecutive estimates Inline graphic all lie within distance Inline graphic of each other. Sturm proved that the points Inline graphic converge in probability to the Fréchet mean of the distribution defined by sampling Inline graphic according to probabilities Inline graphic.

The deterministic algorithm of Bačák (2014) for computing the weighted Fréchet mean is similar to Sturm’s algorithm, except that the data points are used cyclically, as opposed to being randomly sampled, and the weighting is instead taken into account in the definition of the proportions Inline graphic. We use the algorithm of Bačák (2014) for computing the Fréchet mean in order to test our projection algorithm, and this procedure is also described in § 4.1.

2.3. Convex hulls

Nye (2014) suggested that the convex hull of Inline graphic points in Inline graphic might be a suitable geometrical object to represent a Inline graphicth principal component. A set Inline graphic is convex if and only if for all points Inline graphic the geodesic Inline graphic is also contained in Inline graphic. The convex hull of a set of points is the smallest convex set containing those points. Any geodesic segment is the convex hull of its endpoints, and using the convex hull of three points to represent a second principal component is a natural generalization of the idea of a principal geodesic. Convexity is also a desirable property when performing projections, as occurs in principal component analysis. However, convex hulls in tree space do not have the correct dimension. Examples for which the convex hull of three points is three-dimensional can readily be constructed, as shown in a 2015 University of Kentucky PhD thesis by G. Weyenberg and in Lubiw et al. (2017). Lin et al. (2016, § 3) show that the dimension of a convex hull of three points in Inline graphic can be arbitrarily high as Inline graphic increases. More generally, convex hulls in tree space are difficult to characterize geometrically, and several fundamental questions remain unanswered. These issues make convex hulls less appealing as geometrical objects to represent principal components, so we focus our attention on the locus of the Fréchet mean. We shall, however, demonstrate the relationship between the locus of the Fréchet mean and the convex hull for an explicit configuration of three points Inline graphic later in § 3.4.

3. The locus of the Fréchet mean

3.1. Basic properties

Throughout this section we work with Inline graphic vertex points Inline graphic and let Inline graphic. As in § 1, we define Inline graphic by

graphic file with name Equation13.gif

and denote the associated locus of the Fréchet mean by Inline graphic.

Here we establish some basic properties of Inline graphic, while § 3.2 presents a more detailed analysis of Inline graphic within orthant interiors. First, the map Inline graphic is continuous and so Inline graphic is compact, since it is the continuous image of a compact set. Continuity of Inline graphic can be proved using the deterministic algorithm for calculating the weighted Fréchet mean given by Bačák (2014); the output of the algorithm depends continuously on the inputs Inline graphic and Inline graphic. Secondly, the points Inline graphic are contained in Inline graphic, since Inline graphic where Inline graphic denotes the Inline graphicth standard basis vector in Inline graphic. Similarly, each geodesic Inline graphic is contained in Inline graphic, by taking Inline graphic to be a convex combination of Inline graphic and Inline graphic. By the same argument, if Inline graphic is a nonempty subset of Inline graphic, then Inline graphic contains Inline graphic.

In Euclidean space the convex hull of Inline graphic points coincides with the locus of the Fréchet mean of the points. However, this is not the case in tree space, though Inline graphic is contained in the closure of the convex hull of Inline graphic. This latter property follows because any point in Inline graphic can be approximated arbitrarily closely by performing a finite number of steps in the algorithm of Bačák (2014), as shown in § 2.2. Provided the algorithm is initialized with one of the points Inline graphic, each of these steps remains within the convex hull, and so the limit point is contained in the closure of the convex hull. Note that Inline graphic is itself generally not convex, so there may not be a unique closest point on Inline graphic to any given point Inline graphic, although the minimum distance of Inline graphic from Inline graphic is well defined. By using Inline graphic as a principal component we have therefore lost the desirable property of uniqueness of projection.

Fréchet means in tree space exhibit a property called stickiness (Hotz et al., 2013). This essentially means that for fixed Inline graphic the map Inline graphic can fail to be injective. Specifically, depending on the points in Inline graphic, there may exist open sets in Inline graphic which all map to the same point in tree space. This has implications when we project data points onto Inline graphic: given a data point Inline graphic, the value of Inline graphic which minimizes Inline graphic might be nonunique, even if there is a unique closest point Inline graphic to Inline graphic.

3.2. Implicit equations for the locus of the Fréchet mean

The algebraic form of tree space geodesics described in § 2.1 can be used to derive implicit equations for the edge lengths of trees lying on the locus of the Fréchet mean Inline graphic, and these equations are fundamental to establishing the dimension of Inline graphic. For fixed Inline graphic, consider the objective function Inline graphic defined by

graphic file with name Equation14.gif

Suppose we fix an orthant Inline graphic for a fully resolved topology Inline graphic. Let Inline graphic have edge lengths Inline graphic where Inline graphicInline graphic. Miller et al. (2015) showed that functions of the form Inline graphic are continuously differentiable on Inline graphic with respect to the edge lengths Inline graphic. In order to minimize Inline graphic we also assume that Inline graphic lies in a set

graphic file with name Equation15.gif (4)

for some choice of supports Inline graphic. We call sets of this form mutual support regions with respect to Inline graphic. For each Inline graphic the sets Inline graphic are open and the union over possible choices Inline graphic is dense in Inline graphic, as shown in § 2.1. Since the intersection of finitely many dense open sets is also dense, it follows that the union of sets of the form Inline graphic in (4) over all choices Inline graphic is dense in Inline graphic. Each mutual support region is essentially a piece of tree space for which the combinatorics of the geodesics to Inline graphic do not vary as a reference point moves around the region. An example of a decomposition of orthants into mutual support regions is given in § 3.4. Under this assumption on Inline graphic, we can write down the algebraic form of Inline graphic using (3), to give

graphic file with name Equation16.gif

so that

graphic file with name Equation17.gif (5)

If the point Inline graphic lies on the locus of the Fréchet mean Inline graphic, then Inline graphic for all Inline graphic, and so we want to evaluate these derivatives to obtain implicit equations relating the edge lengths Inline graphic to the vector Inline graphic.

Let Inline graphic be any of the trees Inline graphic. By definition, Inline graphic, so

graphic file with name Equation18.gif

since Inline graphic is the length of split Inline graphic. The derivative of Inline graphic is therefore a constant. The term Inline graphic has a more general functional dependence on Inline graphic. By definition,

graphic file with name Equation19.gif

For any edge Inline graphic this expression does not depend on Inline graphic, so the derivative is zero. When Inline graphic, only the first term in brackets will depend on Inline graphic. Since the sets Inline graphic are disjoint, it must be the case that Inline graphic is contained in exactly one set, and we define Inline graphic to be the index Inline graphic of that set when Inline graphic. Then

graphic file with name Equation20.gif

In the case where Inline graphic contains only Inline graphic and no other splits, we have Inline graphic, so the expression becomes Inline graphic, which is also a constant. Substituting these expressions into (5) gives

graphic file with name Equation21.gif (6)

where Inline graphic if Inline graphic and 0 otherwise.

We define Inline graphic by

graphic file with name Equation22.gif (7)

Miller et al. (2015) showed that the function Inline graphic for fixed Inline graphic is continuously differentiable on Inline graphic with respect to Inline graphic. Higher derivatives exist within each support region Inline graphic. It follows that Inline graphic is continuously differentiable with respect to the edge lengths for all Inline graphic lying within the interior of mutual support regions, and that Inline graphic is continuous on Inline graphic. However, Inline graphic may not be differentiable on the boundary between mutual support regions. In § 3.3 we show that the matrix of second derivatives of Inline graphic is positive definite on each mutual support region, and so every solution to Inline graphic is a minimum. It follows that Inline graphic is locally the solution to Inline graphic.

The following lemma establishes conditions for Inline graphic to be a flat affine subspace within the mutual support region Inline graphic.

Lemma 1.

If the supports Inline graphic are such that the geodesics Inline graphic are simple for all Inline graphic, in the sense of Definition 1, then Inline graphic is an affine subspace of dimension Inline graphic or lower in Inline graphic.

Proof.

If all the geodesics Inline graphic are simple for Inline graphic, then each set Inline graphic contains exactly one split. Then (6) becomes

Proof.

for some constants Inline graphic. Solving Inline graphic gives each edge length Inline graphic as a linear combination of Inline graphic, which establishes the result. Generically, Inline graphic is therefore locally a Inline graphic-dimensional affine subspace of Inline graphic, but the dimension may be lower. Further discussion of the dimension is given in § 3.3. □

3.3. The dimension of the locus of the Fréchet mean

That Inline graphic has dimension Inline graphic in each mutual support region follows quickly from the form of Inline graphic in (7) through application of the implicit function theorem.

Lemma 2.

The matrix with elements Inline graphic is positive definite for all Inline graphic in mutual support region Inline graphic.

A proof of this lemma can be found in the Supplementary Material.

Theorem 1.

Within the mutual support region Inline graphic, the locus of the Fréchet mean Inline graphic is a submanifold of dimension Inline graphic or lower. For generic selections of the points Inline graphic, the dimension is Inline graphic.

Proof.

Application of the implicit function theorem to the map Inline graphic when Inline graphic establishes that there is a locally defined function Inline graphic such that Inline graphic and that the locus Inline graphic is a Inline graphic-dimensional submanifold of Inline graphic. In fact, the image Inline graphic will be Inline graphic-dimensional when Inline graphic, the derivative of Inline graphic with respect to Inline graphic, has rank Inline graphic, which holds for generic selections of Inline graphic in tree space. This is analogous to considering the unique affine subspace containing Inline graphic given points in Euclidean space: generically the subspace has dimension Inline graphic, but it can be lower. □

3.4. Explicit calculation

In this subsection we construct an explicit example of the locus of the Fréchet mean for three points in Inline graphic. This example helps to demonstrate the nature of geodesics in tree space, the derivation of the implicit equations for Inline graphic, the relationship with the convex hull, and other geometrical features. We start by fixing Inline graphic and Inline graphic to have the topologies and edge lengths shown in Fig. 1(a). We will ignore the pendant edge lengths, and so the orthants containing these trees can be identified with three orthants in Inline graphic equipped with standard coordinates Inline graphic. There are five splits contained in these trees, excluding the pendant splits; they will be written as Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic by neglecting the complements in Inline graphic. We then let Inline graphic denote the length associated with split Inline graphic in tree Inline graphic, for example. Under the identification with Inline graphic we have

graphic file with name Equation24.gif

and Inline graphic. Figure 1(b) shows the location of trees Inline graphic under this identification. The orthant Inline graphic does not correspond to a valid tree topology as Inline graphic is not compatible with Inline graphic. At each codimension-Inline graphic face between the orthants shown there is in fact a third orthant in Inline graphic glued at the same boundary, but these orthants do not play a role in this example.

Fig. 1.

Fig. 1.

(a) Topologies for the trees Inline graphic of the example in § 3.4; the circled numbers are weights for internal edges. (b) Coordinates of the trees Inline graphic under the identification with orthants in Inline graphic; the Inline graphic axis points out of the page. The geodesics between Inline graphic are shown: Inline graphic kinks around the origin; the dashed line is between points Inline graphic and Inline graphic on Inline graphic and Inline graphic, respectively; the lower left quadrant does not correspond to any tree topology, and is not a part of the space.

In Fig. 1(b) it can be seen that the geodesics Inline graphic and Inline graphic are straight-line segments under the identification with Inline graphic, while the geodesic Inline graphic kinks at a codimension-Inline graphic face. This behaviour is typical of geodesics in Inline graphic: they are straight-line segments within each orthant but can contain kinks at the boundaries between orthants. Figure 1(b) also shows how the convex hull of Inline graphic has dimension 3. The dashed line shows the geodesic between points Inline graphic and Inline graphic on Inline graphic and Inline graphic, respectively. The convex hull therefore contains the points Inline graphic and Inline graphic, so there are four points which are not coplanar within each orthant of the convex hull.

Figure 2 shows the decomposition of the orthants into mutual support regions for Inline graphic and Inline graphic. There are five regions in total, and the geodesics Inline graphic are simple for all Inline graphic when Inline graphic is contained in three of the regions. Lemma 1 shows that Inline graphic is therefore planar in those regions with equation

graphic file with name Equation25.gif

Fig. 2.

Fig. 2.

Decomposition of the locus of the Fréchet mean into mutual support regions. There are five such regions, represented by shading: two mutual support regions are dark grey, and two are mid-grey. The dashed lines show the geodesics between a point Inline graphic and the points Inline graphic: (a) when Inline graphic is contained in the light grey mutual support region, none of the geodesics Inline graphic hit codimension-Inline graphic orthant faces, so Lemma 1 shows that Inline graphic is planar within the region; the same applies to the two mutual support regions shaded mid-grey; (b) when Inline graphic is contained in one of the dark grey shaded regions, then Inline graphic is not simple as it intersects a codimension-Inline graphic boundary, so the part of Inline graphic lying within this region is not planar.

We can also explicitly calculate equations for Inline graphic in the mutual support region contained in Inline graphic and shown in dark grey at the top-left of each panel in Fig. 2. For Inline graphic contained in this region, the squared distances to the vertices are

graphic file with name Equation26.gif

where Inline graphic has coordinates Inline graphic. These can be used to write down an equation for Inline graphic, and then (6) becomes

graphic file with name Equation27.gif

Then Inline graphic can be solved to give

graphic file with name Equation28.gif

whenever Inline graphic, where Inline graphic. The resulting surface is shown in Fig. 3, from which we can see that Inline graphic forms a nonconvex two-dimensional surface that is contained within the convex hull.

Fig. 3.

Fig. 3.

Perspective view of Inline graphic for the example in § 3.4. The locus of the Fréchet mean is a two-dimensional surface which resembles a rubber sheet pulled taut between the corners.

4. Projection onto the locus of the Fréchet mean and principal component analysis

4.1. Projection

In order to use the surface Inline graphic as a principal component, we need to be able to project data onto Inline graphic. Let Inline graphic denote a data point and fix Inline graphic. A projection of Inline graphic onto Inline graphic is a point which minimizes Inline graphic. This point may not be unique as Inline graphic is not convex. A naive algorithm to find a projection is to perform an exhaustive search, as described in Algorithm 2.

Algorithm 2.

Exhaustive search to project Inline graphic onto Inline graphic.

Construct a lattice of points Inline graphic. For Inline graphic this is a triangular lattice.

For each point Inline graphic use a standard algorithm to compute Inline graphic.

Find Inline graphic which minimizes Inline graphic.

We implemented this algorithm for Inline graphic and used the algorithm of Bačák (2014) in the second step to compute Fréchet means. Algorithm 2 is computationally very expensive, since the resolution of the lattice Inline graphic needs to be quite fine in order to obtain accurate results. Consequently we use the exhaustive search algorithm only as a benchmark for assessing other methods.

We would like a more efficient algorithm defined entirely in terms of the geodesic geometry, since any reliance on local differentiable structure is likely to be problematic at orthant boundaries. We propose Algorithm 3, which we call the geometric projection algorithm.

Algorithm 3.

Geometric projection algorithm to project Inline graphic onto Inline graphic.

Fix an initial estimate Inline graphic of the projection of Inline graphic, let Inline graphic, and set Inline graphic.

Repeat:

 Construct Inline graphic for Inline graphic.

 For Inline graphic let Inline graphic be the point a proportion Inline graphic along Inline graphic.

 Find Inline graphic which minimizes Inline graphic.

 Set Inline graphic and Inline graphic, where Inline graphic is the Inline graphicth standard basis vector

  in Inline graphic.

 Set Inline graphic.

Until the sequence Inline graphic converges.

Algorithm 3 is a modification of Sturm’s algorithm for computing the Fréchet mean of Inline graphic, Algorithm 1. At each step of Sturm’s algorithm, one of the points Inline graphic is used as the new estimate Inline graphic, and the point Inline graphic is sampled according to a fixed probability vector Inline graphic. Here, the new estimate for the projection, Inline graphic, is again chosen from Inline graphic but is selected to greedily minimize the distance from Inline graphic. The vector Inline graphic estimates the weight vector associated with the projected point: at iteration Inline graphic, Inline graphic is a vector with integer entries which counts the number of times the algorithm has moved the estimate of the projection towards each vertex in Inline graphic. The computational cost of the algorithm is similar to that for computing a single Fréchet mean using the Sturm algorithm. For Inline graphic the initial point Inline graphic is sampled uniformly from the perimeter of Inline graphic. Convergence is tested as follows: at iteration Inline graphic it is determined whether Inline graphic for all Inline graphic, where Inline graphic and Inline graphic are fixed; if that is the case, then the algorithm terminates. The output from the algorithm after Inline graphic iterations is an estimate Inline graphic of the projection of Inline graphic and a vector Inline graphic.

The geometric projection algorithm is presented here without a proof of convergence and without further theoretical study of its properties. Instead we rely on a simulation study in the next subsection to assess its effectiveness.

4.2. Simulations

We ran simulations designed to demonstrate that, specifically in the case of Inline graphic, Algorithm 3 converges to a tree on Inline graphic which minimizes Inline graphic. For each iteration of the simulation, a random species tree Inline graphic with Inline graphic taxa was generated under the Kingman (1982) coalescent. Three trees Inline graphic and a fourth test tree Inline graphic were then generated under a coalescent model constrained to be contained within the tree Inline graphic, and thus corresponded to gene trees coming from the underlying species tree Inline graphic. Maddison (1997) describes in detail the relationship between species trees and gene trees. The DendroPy library (Sukumaran & Holder, 2010) was used to generate these trees. The test tree Inline graphic was then projected onto Inline graphic for Inline graphic using the exhaustive search algorithm and the geometric projection algorithm. All calculations were carried out ignoring pendant edges. This particular simulation scheme was chosen in order to generate a variety of different geometrical configurations for the points Inline graphic and Inline graphic, as well as being biologically reasonable. If the trees were sampled with topologies chosen independently uniformly at random, for example, the simulation procedure would only have explored instances of Inline graphic with widely differing vertices.

The results obtained from the two algorithms were compared in two ways. First, the distances from the data tree to the projected trees obtained with the two algorithms were computed and checked to ensure that the projection algorithm yielded a distance less than or equal to the exhaustive search. Second, the distance between the tree from geometric projection and the tree from exhaustive search was checked to ensure that the two trees were close together. For the second check we considered any distance greater than 1% of the total internal length of the data tree to be a failure.

In a run of 10 000 replications of this procedure, 95Inline graphic7% of the replications passed the two tests. However, even the set of failing replications produced a projection result that was quite close to the exhaustive search result. Among the 435 failing replications, the perpendicular distance for the projection was an average of 3Inline graphic7% greater than the perpendicular distance of the exhaustive search, and the distance between the two results was an average of 4Inline graphic7% of the total internal length of the data tree.

We believe that the failing results are attributable to the projection algorithm becoming trapped in local minima of the perpendicular distance. Starting the algorithm from several locations and comparing the results would help to mitigate this problem. However, for the present purpose of fitting higher principal components to a collection of data trees, we believe these small deviations from the exhaustive search solution are an acceptable trade for the increase in computational speed.

4.3. Stochastic optimization for principal component analysis

Given data Inline graphic, our objective is to find Inline graphic that minimizes the sum of squared projected distances Inline graphic. We henceforth restrict ourselves to the case Inline graphic. The geometric projection algorithm is used to compute Inline graphic given Inline graphic, at least approximately, so we must now consider how to search over the possible configurations of the vertices Inline graphic. We adopt a stochastic optimization approach, Algorithm 4 below, which is similar to that used for fitting principal geodesics in Nye (2014). We assume that we have available a set of proposals Inline graphic, each of which is a map from Inline graphic to the set of distributions on Inline graphic. In particular, given any tree Inline graphic, each Inline graphic is assumed to be a distribution on Inline graphic from which we can easily sample.

Algorithm 4.

Stochastic optimization algorithm to fit Inline graphic to Inline graphic.

Fix an initial set Inline graphic and compute Inline graphic.

Repeat:

  For Inline graphic:

  For Inline graphic:

    Sample a tree Inline graphic from Inline graphic.

    Let Inline graphic be the set Inline graphic but with Inline graphic replacing Inline graphic.

    Compute Inline graphic using the geometric projection algorithm.

    If Inline graphic set Inline graphic.

Until convergence.

The optimization algorithm attempts to minimize Inline graphic by stochastically varying one point Inline graphic at a time using the proposals Inline graphic. The algorithm is greedy: whenever a configuration Inline graphic improves upon the current configuration Inline graphic we replace Inline graphic with Inline graphic. Convergence is assessed by considering the relative change in Inline graphic over a certain fixed number of iterations. If this is less than some proportion then the algorithm terminates. We used three different types of proposal. The first samples a tree uniformly at random with replacement from the dataset Inline graphic. The second type is a refinement of the first: given a tree Inline graphic it similarly samples a tree Inline graphic uniformly at random with replacement from the dataset Inline graphic; then the geodesic Inline graphic is computed, and a beta distribution is used to sample a tree some proportion of the distance along Inline graphic. The third type of proposal is a random walk starting from Inline graphic, as described in Nye (2014). The random walk proposals can have different numbers of steps and step sizes. The algorithm is not guaranteed to find a global optimum, and it can become stuck in local minima, so the algorithm must be run with different starting points for each dataset, and then compare the results from each run.

Two statistics can be used to summarize the fit of Inline graphic to a dataset Inline graphic: the sum of squared projected distances Inline graphic and a non-Euclidean proportion of variance statistic, denoted by Inline graphic. If the projection of each data point Inline graphic onto Inline graphic is denoted by Inline graphic and Inline graphic denotes the Fréchet mean of Inline graphic, then

graphic file with name Equation29.gif

The denominator in this expression varies with Inline graphic since Pythagoras’ theorem does not hold in tree space. Unlike Inline graphic, the Inline graphic statistic is quite sensitive to small changes in Inline graphic, but it can be interpreted broadly as the proportion of variance explained by Inline graphic.

To assess the performance of the algorithm we conducted a small simulation study. Eight datasets of 100 trees containing Inline graphic taxa were generated in the following way. For each dataset a tree topology was sampled from a coalescent process, and each edge length was sampled from a gamma distribution with shape Inline graphic and rate Inline graphic, to give a tree Inline graphic. Two trees Inline graphic and Inline graphic were then obtained by applying random topological operations to Inline graphic. In four of the datasets, Inline graphic and Inline graphic were obtained by performing nearest-neighbour interchange operations, while in the other four datasets subtree prune and regraft operations were used. Then, to construct each dataset given Inline graphic, 100 points were sampled from a Dirichlet distribution on Inline graphic with parameter Inline graphic, and the corresponding points on Inline graphic were found using the Bačák algorithm. Each point was then perturbed by using a random walk, so that each dataset resembled a cloud of points around the surface Inline graphic. The step size of the random walk was tuned to produce datasets classified as having either low or high dispersion. Table 1 summarizes the datasets used and the simulation results. The stochastic optimization algorithm performs well in every scenario.

Table 1.

Simulations to assess the stochastic optimization algorithm: the leftmost column describes the number and type of topological operation used to obtain Inline graphic and Inline graphic from Inline graphic for each dataset; in each scenario, two datasets were generated by perturbing points on Inline graphic via random walks, with low and high dispersions. Shown are the fitted values Inline graphic computed with the geometric projection algorithm, with reference values Inline graphic in parentheses, computed with the exhaustive projection algorithm, together with the non-Euclidean Inline graphic statistic, with reference values in parentheses

Low dispersion High dispersion
Topological scenario Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic nearest-neighbour interchange Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic nearest-neighbour interchange Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic subtree prune and regraft Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic subtree prune and regraft Inline graphic Inline graphic Inline graphic Inline graphic

5. Results

5.1. Coelacanths genome and transcriptome data

We applied our method to the dataset comprising 1290 nuclear genes encoding 690 838 amino acid residues obtained from genome and transcriptome data by Liang et al. (2013). Over the past few decades researchers have worked on the phylogenetic relations between coelacanths, lungfishes and tetrapods, but controversy remains despite several studies (Hedges, 2009). Most morphological and palaeontological studies support the hypothesis that lungfishes are closer to tetrapods than they are to coelacanths. However, some research supports alternative hypotheses: that coelacanths are closer to tetrapods; that coelacanths and lungfish are closest; or that tetrapods, lungfishes and coelacanths cannot be resolved. Liang et al. (2013) present these four hypotheses in their Fig. 1, Trees 1–4, respectively.

We reconstructed gene trees using the R (R Development Core Team, 2017) package Phangorn (Schliep, 2011), with each gene tree estimated using maximum likelihood under the Le & Gascuel (2008) model. The dataset consisted of 1290 gene alignments for 10 species: lungfish, Protopterus annectens, and coelacanth, Latimeria chalumnae; three tetrapods, frog, Xenopus tropicalis, chicken, Gallus gallus, and human, Homo sapiens; two ray-finned fish, Danio rerio and Takifugu rubripes; and three cartilaginous fish included as an out-group, Scyliorhinus canicula, Leucoraja erinacea and Callorhinchus milii.

Analysis was performed ignoring pendant edge lengths. A total of 97 outlying trees were removed using KDETrees (Weyenberg et al., 2016), so that 1193 gene trees remained. The Fréchet mean was computed using the Bačák algorithm and its topology is shown in Fig. 4. The mean tree does not resolve whether coelacanth or lungfish is the closest relative of the tetrapods. The sum of squared distances of the data points to the Fréchet mean was 19Inline graphic7. A principal geodesic was constructed using the algorithm from Nye (2014): the sum of squared projected distances was 9Inline graphic53 and the non-Euclidean Inline graphic statistic was 51Inline graphic4%. Traversing the principal geodesic gives trees with the same topology as the Fréchet mean that contract down to a star tree at one end of the geodesic and expand in size at the other end. This shows that the principal source of variation in the dataset is the overall scale of the gene trees or, in other words, the total amount of evolutionary divergence for each gene.

Fig. 4.

Fig. 4.

The second principal component computed from the lungfish dataset: (a) the simplex shaded according to the topology of the corresponding points on Inline graphic, with the projections of the data points also displayed; (b) topologies of trees on Inline graphic. Species abbreviations are based on the binary nomenclature: lungfish, Pa; coelacanth, Lc; frog Xt; chicken, Gg; human, Hs; ray-finned fish, Dr and Tr; cartilaginous fish, Sc, Le and Cm. The number of data points projecting to each topology is displayed in brackets.

Figure 4 illustrates the second principal component. The sum of squared projected distances was 7Inline graphic29 and the non-Euclidean Inline graphic statistic was 61Inline graphic8%. This represents a relatively small increase in the proportion of variance in relation to the principal geodesic. Three runs of Algorithm 4 were performed to construct the second principal component. The results obtained had very similar summary statistics, but the topologies displayed on the surfaces were more variable, so Fig. 4 is a representative choice. Although the projected points are clustered towards the bottom of the simplex in the figure, the full simplex was drawn to show all the different topological regions. Of the 1193 gene trees, 1094 projected to points with topology 1, which supports lungfish being the closest relative of the tetrapods. From the remaining projected data points, 75 have topology 5, placing both lungfish and coelacanth in a clade with the tetrapods. The topologies 3, 4, 6 and 7 have biologically implausible relationships. However, the projected data points lying outside topology 1 all lie close to the boundary of their respective orthants, having at least one edge length less than 0Inline graphic0005. For example, the projected data points with topology 3 have very short edge lengths for the biologically implausible clades, such as the grouping of X. tropicalis with S. canicula, and so lie close to trees with more plausible topologies.

Overall, the second principal component suggests that the data support topology 1, with lungfish as the closest relative of tetrapods, and that most of the variation within the data comes from edge length variation within that topology rather than from conflicting topologies. Although the estimates are subject to random variation, it is interesting that the Fréchet mean and principal geodesic did not exhibit topology 1, while the second principal component suggests a solution to the controversial relationship between coelacanth, lungfish and tetrapods. The exhaustive projection algorithm was used to project the data onto the surface Inline graphic produced by Algorithm 4, in order to compare with the results obtained by geometric projection. The sum of squared distances between the projected trees obtained with the two different algorithms was 0Inline graphic004, a small fraction of the sum of squared projected distances 7Inline graphic29 for Inline graphic.

5.2. Apicomplexa

We also applied our method to a set of trees constructed from 268 orthologous sequences from eight species of protozoa in the Apicomplexa phylum, previously presented by Kuo et al. (2008). The same dataset was also analysed by Weyenberg et al. (2016), and more details are given in that paper, such as the gene sequences used to infer each tree. The phylum Apicomplexa contains many important protozoan pathogens (Levine, 1988), including the mosquito-transmitted Plasmodium species, the causative agent of malaria; T. gondii, which is one of the most prevalent zoonotic pathogens worldwide; and the water-borne pathogen Cryptosporidium species. Several members of the Apicomplexa also cause significant morbidity and mortality in both wildlife and domestic animals. These include the Theileria and Babesia species, which are tick-borne haemoprotozoan ungulate pathogens, and several species of Eimeria, which are enteric parasites that are particularly detrimental to the poultry industry. Because of their medical and veterinary importance, whole-genome sequencing projects have been completed for multiple prominent members of the Apicomplexa. We removed 16 outlier trees previously identified by Weyenberg et al. (2016) before fitting principal components.

The trees were analysed ignoring pendant edges. The Fréchet mean was computed using the Bačák algorithm: the corresponding tree topology was unresolved, and is shown in Fig. 5. The sum of squared distances from the mean to the data points was 24Inline graphic6. The principal geodesic was estimated using the algorithm from Nye (2014). The principal geodesic has a non-Euclidean Inline graphic statistic of 40%, and the sum of squared projected distances was 14Inline graphic2. The principal geodesic displays two main effects. First, the edges leading to the P. vivax and P. falciparum clade, the E. tenella and T. gondii clade, and the B. bovis and T. annulata clade vary substantially in length. The second is a topological rearrangement whereby the clade containing P. vivax and P. falciparum paired with E. tenella and T. gondii is replaced with a clade containing P. vivax and P. falciparum paired with B. bovis and T. annulata. However, the second effect involved very short internal edges, so that along its length, the trees on the principal geodesic resembled the mean tree shown in Fig. 5 but with different overall scale. The principal geodesic therefore reflects variation in the scale of the tree.

Fig. 5.

Fig. 5.

The second principal component computed from the Apicomplexa dataset: (a) the simplex shaded according to the topology of the corresponding points on Inline graphic, with the projections of the data points also displayed; (b) topologies of trees on Inline graphic. Species abbreviations are based on the species’ binary nomenclature. The number of data points projecting to each topology is displayed in brackets.

Figure 5 illustrates the second principal component, with the simplex shaded according to the corresponding tree topology on Inline graphic. Three separate runs of Algorithm 4 converged to give similar results. The summary statistics for the second principal component are: sum of squared projected distances 10Inline graphic3; non-Euclidean Inline graphic statistic 56%. While these summary statistics were consistent between runs, the set of topologies displayed on Inline graphic was subject to more variation, so Fig. 5 is a representative choice, although topologies 1, 4 and 6 were present in all runs. The results show how the second principal component is able to tease out more from the data than the variation in overall scale captured by the principal geodesic. Topology 4 is congruent with the generally accepted phylogeny of taxa within the Apicomplexa and is a resolution of the Fréchet mean tree: T. annulata and B. bovis group together; the two Plasmodium species group together; C. parvum is the deepest rooting apicomplexan; and P. vivax, P. falciparum, T. annulata and B. bovis are monophyletic. The latter group are all haemosporidians or blood parasites.

Figure 5 shows that the second principal component corresponds to variation in topology consisting of nearest-neighbour interchange operations that transform topology 4 into topologies 1 and 6. None of the projected trees have topology 5, although this is the topology of one of the vertices of Inline graphic. This topology appears to be present in order for Inline graphic to be positioned in such a way as to capture the other topologies. Topology 2 shows evidence of stickiness, as discussed in § 3.1. Although the topology is unresolved, so that the coloured triangle lies in a codimension-Inline graphic region of tree space, it occupies the nonzero area on the simplex. As for the lungfish, the exhaustive and geometric projection algorithms were compared on the surface Inline graphic produced by Algorithm 4. The distances between the projected points obtained with the two algorithms were very small compared to the distances of the data points from Inline graphic: the sum of squared distances between pairs of projected points was Inline graphic.

6. Discussion

This paper presents three main innovations: (i) use of the locus of the Fréchet mean Inline graphic as an analogue of a principal component in tree space; (ii) proof that Inline graphic has the desired dimension; and (iii) the geometric projection algorithm for projecting data onto Inline graphic. The locus of the Fréchet mean was first proposed as a geometric object for principal component analysis in tree space in a 2015 University of Kentucky PhD thesis by G. Weyenberg. Pennec (2015) made a similar proposal for an analogue of principal component analysis in Riemannian manifolds and other geodesic metric spaces, called barycentric subspace analysis. The barycentric subspaces of Pennec correspond exactly to the surfaces Inline graphic considered in this paper, except that the weights Inline graphic are not constrained to lie in the simplex and can be negative. Pennec’s approach, however, is principally based in the context of a Riemannian manifold rather than in tree space, though he points out the potential for generalization. There are substantial differences between barycentric subspace analysis and the method presented in this paper. In particular, a key aim of barycentric subspace analysis is to produce nested principal components, Inline graphic, while we do not have that restriction here. The nesting is achieved by either adding or removing points from Inline graphic in order to obtain, respectively, a higher- or lower-order nested principal component. This is also possible in the context of our analysis, but the Inline graphicth principal component would in each case form part of the boundary of the Inline graphicth principal component. This is undesirable as it leads to poorly fitting principal components. For example, suppose that the second principal component is constructed by adding an extra vertex to the principal geodesic; many data points would project onto the edge of the second principal component corresponding to the principal geodesic rather than being distributed over the interior of the surface. Similar problems arise if the analysis is performed by removing points from Inline graphic sequentially. These problems do not arise with Pennec’s methodology, because the weights Inline graphic are not restricted to the simplex, so a nested principal component can lie in the interior of higher-order components. In contrast, the existing algorithms for computing the Fréchet mean in tree space and our algorithm for projection onto Inline graphic all require the weights Inline graphic to lie in the simplex, and this motivated the decision to consider principal components which are not nested in this paper. If these algorithms could be adapted to allow negative values for the weights, then a nested principal component analysis would be possible in tree space.

Our analysis has been restricted to datasets with relatively few taxa and to the construction of the first and second principal components. The algorithms presented in this paper scale linearly with respect to the number of data points Inline graphic, but run in polynomial time with respect to the number of taxa Inline graphic. However, by partitioning the dataset for the geometric projection algorithm, parallel computer architectures can be employed and the speed-up is approximately proportional to the number of processors used. While the geometric projection algorithm runs relatively quickly, the calculations involved in searching for the optimal set of vertices Inline graphic can be very substantial. The experimental datasets in § 5 took between one and three days to analyse, running on four processors each. For higher-order components, Inline graphic, this computational burden will increase, and it is likely that finding a global minimum for Inline graphic will be more difficult. While the method presented in this paper generalizes to arbitrary Inline graphic, including the geometric projection algorithm, computational issues limited our analysis to Inline graphic. However, fitting a principal component Inline graphic with Inline graphic would give an upper bound on Inline graphic even if a global minimum were not found, and hence an approximate lower bound on the non-Euclidean Inline graphic statistic. Consequently, even a poorly fit principal component with Inline graphic might give some indication of the additional variance explained by higher-order components.

Uncertainty in estimated principal components could be assessed by bootstrap methods; for example, one can generate replicate datasets by resampling the data Inline graphic and constructing principal components for each replicate. An alternative bootstrap procedure involves estimating a principal component Inline graphic for Inline graphic and then generating replicate datasets by randomly perturbing the projection of each point Inline graphic onto Inline graphic using a random walk, in a similar way to the simulations in § 4.3. However, both these approaches are highly computationally expensive, and would only be feasible for relatively small datasets. Obtaining analytical results about uncertainty, such as proving validity of the bootstrap procedure or establishing confidence regions for principal components, would involve development of asymptotic theory on the space of configurations of the vertices Inline graphic, and this lies well beyond existing probability theory on tree space (Barden et al., 2013).

The figures in § 5 demonstrate the potential for creating visualizations of the data which reveal meaningful biological structure. The pattern of projected points obtained for the experimental datasets we considered were very similar to results obtained via multi-dimensional scaling. However, multi-dimensional scaling is not capable of revealing the features of the dataset that cause the observed variation. More information could be included in the graphical representation of our results, such as the distance of the data points from their projections, information about the principal geodesic, and the proximity of points to orthant boundaries.

Our software for finding principal components in tree space is available to download from http://www.mas.ncl.ac.uk/~ntmwn/geophytterplus/. The datasets analysed in this paper are also available from that website. An optional R package used to produce the figures in this article can be obtained from https://github.com/grady/geophyttertools.

We presented Algorithm 3, the geometric projection algorithm, without a proof of convergence, and we used simulation to assess its accuracy. The algorithm is attractive in that it is defined entirely in terms of the geodesic structure on tree space, so it could be used on any geodesic metric space, including Riemannian manifolds. The algorithm clearly deserves further investigation, and we intend to study its properties in future work.

Supplementary Material

Supplementary Data 1
Supplementary Data 2

Acknowledgement

The authors thank D. Howe from the University of Kentucky for useful comments on the analysis of the Apicomplexa dataset. Grady Weyenberg acknowledges support from the Wellcome Trust and the Medical Research Council Integrative Epidemiology Unit, University of Bristol, U.K. Xiaoxian Tang acknowledges support from the Zentrale Forschungsförderung of the University of Bremen, Germany.

Supplementary material

Supplementary material available at Biometrika online includes the proof of Lemma 2 and the geophytter+ software, which implements the algorithms described in this paper.

References

  1. Barden D., Le H. & Owen M. (2013). Central limit theorems for Fréchet means in the space of phylogenetic trees. Electron. J. Prob. 18, 1–25. [Google Scholar]
  2. Bačák M. (2014). Computing medians and means in Hadamard spaces. SIAM J. Optimiz. 24, 1542–66. [Google Scholar]
  3. Billera L. J., Holmes S. P. & Vogtman K. (2001). Geometry of the space of phylogenetic trees. Adv. Appl. Math 27, 733–67. [Google Scholar]
  4. Bridson M. R. & Haefliger A. (2011). Metric Spaces of Non-Positive Curvature. Berlin: Springer. [Google Scholar]
  5. Ding C. & He X. (2004). Inline graphic-means clustering via principal component analysis. In Proc. 21st Int. Conf. Mach. Learn. Banff: Association for Computing Machinery, p. 29. [Google Scholar]
  6. Feragen A., Owen M., Petersen J., Wille M. M. W., Thomsen L. H., Dirksen A. & de Bruijne M. (2013). Tree-space statistics and approximations for large-scale analysis of anatomical trees. In Information Processing in Medical Imaging (23rd Int. Conf. Proc.), Gee J. C.Joshi S.Pohl K. M.Wells W. M. & Zollei L. eds. Berlin: Springer. [DOI] [PubMed] [Google Scholar]
  7. Gori K., Suchan T., Alvarez N., Goldman N. & Dessimoz C. (2016). Clustering genes of common evolutionary history. Molec. Biol. Evol. 33, 1590–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hedges S. (2009). Vertebrates (Vertebrata). In The Timeline of Life, Hedges S. B. & Kumar S. eds. New York: Oxford University Press, pp. 309–14. [Google Scholar]
  9. Hillis D. M., Heath T. A. & St. John K. (2005). Analysis and visualization of tree space. Syst. Biol. 54, 471–82. [DOI] [PubMed] [Google Scholar]
  10. Hotz T., Huckemann S., Le H., Marron J. S., Mattingly J. C., Miller E., Nolen J., Owen M., Patrangenaru V. & Skwerer S. (2013). Sticky central limit theorems on open books. Ann. Appl. Prob. 23, 2238–58. [Google Scholar]
  11. Kingman J. F. C. (1982). The coalescent. Stoch. Proces. Appl. 13, 235–48. [Google Scholar]
  12. Kuo C., Wares J. P. & Kissinger J. C. (2008). The Apicomplexan whole-genome phylogeny: An analysis of incongruence among gene trees. Molec. Biol. Evol. 25, 2689–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Le S. Q. & Gascuel O. (2008). An improved general amino acid replacement matrix. Molec. Biol. Evol. 25, 1307–20. [DOI] [PubMed] [Google Scholar]
  14. Levine N. D. (1988). Progress in taxonomy of the Apicomplexan protozoa. J. Eukaryot. Microbiol. 35, 518–20. [DOI] [PubMed] [Google Scholar]
  15. Liang D., Shen X. X. & Zhang P. (2013). One thousand two hundred ninety nuclear genes from a genome-wide survey support lungfishes as the sister group of tetrapods. Molec. Biol. Evol. 30, 1803–7. [DOI] [PubMed] [Google Scholar]
  16. Lin B., Sturmfels B., Tang X. & Yoshida R. (2016). Convexity in tree spaces. arXiv: 1510.08797v3. [Google Scholar]
  17. Lubiw A., Maftuleac D. & Owen M. (2017). Shortest paths and convex hulls in 2D complexes with non-positive curvature. arXiv: 1603.00847v4. [Google Scholar]
  18. Maddison W. P. (1997). Gene trees in species trees. Syst. Biol. 46, 523–36. [Google Scholar]
  19. Miller E., Owen M. & Provan J. S. (2015). Polyhedral computational geometry for averaging metric phylogenetic trees. Adv. Appl. Math. 68, 51–91. [Google Scholar]
  20. Nye T. M. W. (2011). Principal components analysis in the space of phylogenetic trees. Ann. Statist. 39, 2716–39. [Google Scholar]
  21. Nye T. M. W. (2014). An algorithm for constructing principal geodesics in phylogenetic treespace. IEEE/ACM Trans. Comp. Biol. Bioinfo. 11, 304–15. [DOI] [PubMed] [Google Scholar]
  22. Owen M. & Provan J. S. (2011). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans. Comp. Biol. Bioinfo. 8, 2–13. [DOI] [PubMed] [Google Scholar]
  23. Pennec X. (2015). Barycentric subspaces and affine spans in manifolds. In Geometric Science of Information (2nd Int. Conf. Proc.), Nielsen F. & Barbaresco F. eds. Palaiseau, France: Springer. [Google Scholar]
  24. R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org. [Google Scholar]
  25. Schliep K. P. (2011). Phangorn: Phylogenetic analysis in R. Bioinformatics 27, 592–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Semple C. & Steel M. A. (2003). Phylogenetics. Oxford: Oxford University Press. [Google Scholar]
  27. Sturm K.-T. (2003). Probability measures on metric spaces of nonpositive curvature. In Heat Kernels and Analysis on Manifolds, Graphs, and Metric Spaces, Pascal A.Coulhon T. & Grigor’yan A. eds. Providence, Rhode Island: American Mathematical Society, pp. 357–90. [Google Scholar]
  28. Sukumaran J. & Holder M. T. (2010). Dendropy: A Python library for phylogenetic computing. Bioinformatics 26, 1569–71. [DOI] [PubMed] [Google Scholar]
  29. Weyenberg G., Huggins P. M., Schardl C. L., Howe D. K. & Yoshida R. (2014). KDEtrees: Non-parametric estimation of phylogenetic tree distributions. Bioinformatics 30, 2280–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Weyenberg G., Yoshida R. & Howe D. (2016). Normalizing kernels in the Billera-Holmes-Vogtmann treespace. IEEE/ACM Trans. Comp. Biol. Bioinfo. 10.1109/TCBB.2016.2565475. [DOI] [PubMed] [Google Scholar]
  31. Zha H., Ding C., Gu M., He X. & Simon H. D. (2001). Spectral relaxation for Inline graphic-means clustering. Neural Info. Proces. 14, 1057–64. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data 1
Supplementary Data 2

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES